ir.strategy

Indexing strategies — the “what do we index?” seam.

An IndexingStrategy decomposes one artifact into an IndexPlan: the filter_fields (hard-filterable metadata, not embedded) and a list of Surface (embeddable units). This is the central extensibility point of ir: a naive corpus uses WholeText; a structured corpus (a package) decomposes into several heterogeneous surfaces so a query can match the right part of an artifact, and constrains candidates by metadata before semantic ranking.

Shipped strategies:

  • WholeText — one surface = the whole text. The out-of-the-box default.

  • Chunked — split the text into overlapping chunks (one surface each).

  • Skill — embed name + description only (the body stays on disk, per the capability-discovery research); name/parent become filter fields.

  • Packagename + description plus README chunks as surfaces; name/owner/deps become filter fields (AI synopsis / problem-class surfaces are a documented extension).

Every strategy is a plain callable-ish object with a decompose method, so custom strategies need only match the IndexingStrategy protocol.

class ir.strategy.Chunked(*, chunk_size: int = 1200, overlap: int = 200, text_key: str | None = None, kind: str = 'chunk')[source]

Split the artifact’s text into overlapping chunk surfaces.

class ir.strategy.ClaudeTurn(*, include_full: bool = False)[source]

Index a Claude Code session turn-pair: user prompt + assistant summary.

Two surfaces by default — user_prompt (what the human asked) and assistant_summary (the assistant’s final end-of-turn text, the highest-signal “here’s what I did”; the deliberation before it is mostly noise) — so a query can target either side via surfaces={"user_prompt"} / {"assistant_summary"}. With include_full=True a third assistant_full surface (all the turn’s assistant natural-language text) trades noise for recall — off by default. Session / project / time / model / tool-use become hard-filter fields.

The raw artifact is a turn-pair record (see priv.claude_transcripts.turn_pair_records()): a mapping with user_prompt / assistant_summary / assistant_full plus metadata.

class ir.strategy.IndexingStrategy(*args, **kwargs)[source]

Decompose one artifact into filter fields + embeddable surfaces.

class ir.strategy.Package(*, chunk_size: int = 1500, overlap: int = 200, embed_deps: bool = False, deps_template: Callable[[list[str]], str] | None = None)[source]

Package strategy: name + description surface plus README chunks.

Filter fields capture ownership (ours vs third-party), name, deps. AI synopsis / problem-class surfaces are a documented extension point.

With embed_deps=True, decompose additionally emits one Surface(kind="deps", granularity="field") whose text is a prefix-form serialization of the bare dependency names (deps_template, default _default_deps_text()) — so a query for a domain matches a package by the libraries it depends on (e.g. sentence-transformers -> embeddings, networkx -> graphs), and the BM25 leg picks up exact dep-token matches. The deps bag is kept separate from prose (its own surface) so a rare library name is not diluted, and deps remain a filter field regardless. embed_deps defaults False (today’s behavior); it folds into the strategy id, so toggling it re-decomposes incrementally. The deps surface is appended last, leaving the description (position 0) and readme_chunk indices unchanged.

Surface indexing: the description surface (kept whenever name or description is non-empty) occupies plan position 0, so readme_chunk j is stored with Record.surface_index == j + 1 while its surface metadata says chunk_index == jsurface_index is plan-global, chunk_index per-kind (see ir.base.Record.make_id()). Never derive sibling record ids from chunk_index; use the ledger (ir.retrieve.records_for_artifact()). n_chunks is stamped on readme chunks at decompose time, but corpora built before the stamp keep records without it until the artifact re-indexes (content / embedder / strategy change) — read it with metadata.get("n_chunks").

ir.strategy.STRATEGY_REGISTRY: dict[str, type] = {'Chunked': <class 'ir.strategy.Chunked'>, 'ClaudeTurn': <class 'ir.strategy.ClaudeTurn'>, 'Package': <class 'ir.strategy.Package'>, 'Skill': <class 'ir.strategy.Skill'>, 'WholeText': <class 'ir.strategy.WholeText'>}

Shipped strategies addressable by name, for persisting an IndexingStrategy in a registry entry and reconstructing it (#58). New shipped strategies register here so a corpus can name its segmentation.

class ir.strategy.Skill[source]

Capability strategy: embed name + description only.

The body (SKILL.md) is loaded post-selection and is not indexed; name and parent are filter fields.

class ir.strategy.WholeText(*, text_key: str | None = None, kind: str = 'document')[source]

One surface = the entire text. Sensible default for a naive corpus.

ir.strategy.strategy_from_spec(spec: Mapping[str, Any] | None) IndexingStrategy | None[source]

Reconstruct a shipped strategy from a {"name", "params"} spec.

None (no persisted strategy) returns None so the caller falls back to the source preset’s default strategy — the back-compatible behavior for v1 registry entries.

ir.strategy.strategy_to_spec(strategy: Any) dict[source]

A {"name", "params"} spec for a shipped, scalar-param strategy.

Captures only scalar constructor parameters (the same identity surface ir.index._strategy_id() stamps), so it round-trips the shipped strategies. A custom strategy, or one wrapping another (e.g. the ir.with_synopsis() wrapper), is not captured here — those are set programmatically at build time / by the maintenance layer, not persisted as a segmentation spec.

ir.strategy.text_of(raw: Any, text_key: str | None = None) str[source]

Best-effort text extraction from a raw artifact payload.

The SSOT for turning an opaque raw (a str, a Mapping with a text field or a text_key, or anything else) into embeddable text — reused by the shipped strategies and by ir.synopsis.make_llm_synthesizer() so an injected-free synopsis summarizes the same text a strategy would index.