ir.strategy

Indexing strategies — the “what do we index?” seam.

An IndexingStrategy decomposes one artifact into an IndexPlan: the filter_fields (hard-filterable metadata, not embedded) and a list of Surface (embeddable units). This is the central extensibility point of ir: a naive corpus uses WholeText; a structured corpus (a package) decomposes into several heterogeneous surfaces so a query can match the right part of an artifact, and constrains candidates by metadata before semantic ranking.

Shipped strategies:

  • WholeText — one surface = the whole text. The out-of-the-box default.

  • Chunked — split the text into overlapping chunks (one surface each).

  • Skill — embed name + description only (the body stays on disk, per the capability-discovery research); name/parent become filter fields.

  • Packagename + description plus README chunks as surfaces; name/owner/deps become filter fields (AI synopsis / problem-class surfaces are a documented extension).

Every strategy is a plain callable-ish object with a decompose method, so custom strategies need only match the IndexingStrategy protocol.

class ir.strategy.Chunked(*, chunk_size: int = 1200, overlap: int = 200, text_key: str | None = None, kind: str = 'chunk')[source]

Split the artifact’s text into overlapping chunk surfaces.

class ir.strategy.IndexingStrategy(*args, **kwargs)[source]

Decompose one artifact into filter fields + embeddable surfaces.

class ir.strategy.Package(*, chunk_size: int = 1500, overlap: int = 200)[source]

Package strategy: name + description surface plus README chunks.

Filter fields capture ownership (ours vs third-party), name, deps. AI synopsis / problem-class surfaces are a documented extension point.

Surface indexing: the description surface (kept whenever name or description is non-empty) occupies plan position 0, so readme_chunk j is stored with Record.surface_index == j + 1 while its surface metadata says chunk_index == jsurface_index is plan-global, chunk_index per-kind (see ir.base.Record.make_id()). Never derive sibling record ids from chunk_index; use the ledger (ir.retrieve.records_for_artifact()). n_chunks is stamped on readme chunks at decompose time, but corpora built before the stamp keep records without it until the artifact re-indexes (content / embedder / strategy change) — read it with metadata.get("n_chunks").

class ir.strategy.Skill[source]

Capability strategy: embed name + description only.

The body (SKILL.md) is loaded post-selection and is not indexed; name and parent are filter fields.

class ir.strategy.WholeText(*, text_key: str | None = None, kind: str = 'document')[source]

One surface = the entire text. Sensible default for a naive corpus.

ir.strategy.text_of(raw: Any, text_key: str | None = None) str[source]

Best-effort text extraction from a raw artifact payload.

The SSOT for turning an opaque raw (a str, a Mapping with a text field or a text_key, or anything else) into embeddable text — reused by the shipped strategies and by ir.synopsis.make_llm_synthesizer() so an injected-free synopsis summarizes the same text a strategy would index.