ir.strategy
Indexing strategies — the “what do we index?” seam.
An IndexingStrategy decomposes one artifact into an
IndexPlan: the filter_fields (hard-filterable metadata,
not embedded) and a list of Surface (embeddable units). This
is the central extensibility point of ir: a naive corpus uses
WholeText; a structured corpus (a package) decomposes into several
heterogeneous surfaces so a query can match the right part of an artifact,
and constrains candidates by metadata before semantic ranking.
Shipped strategies:
WholeText— one surface = the whole text. The out-of-the-box default.Chunked— split the text into overlapping chunks (one surface each).Skill— embedname + descriptiononly (the body stays on disk, per the capability-discovery research); name/parent become filter fields.Package—name + descriptionplus README chunks as surfaces; name/owner/deps become filter fields (AI synopsis / problem-class surfaces are a documented extension).
Every strategy is a plain callable-ish object with a decompose method, so
custom strategies need only match the IndexingStrategy protocol.
- class ir.strategy.Chunked(*, chunk_size: int = 1200, overlap: int = 200, text_key: str | None = None, kind: str = 'chunk')[source]
Split the artifact’s text into overlapping chunk surfaces.
- class ir.strategy.IndexingStrategy(*args, **kwargs)[source]
Decompose one artifact into filter fields + embeddable surfaces.
- class ir.strategy.Package(*, chunk_size: int = 1500, overlap: int = 200)[source]
Package strategy:
name + descriptionsurface plus README chunks.Filter fields capture ownership (ours vs third-party), name, deps. AI synopsis / problem-class surfaces are a documented extension point.
Surface indexing: the
descriptionsurface (kept whenever name or description is non-empty) occupies plan position 0, soreadme_chunkj is stored withRecord.surface_index == j + 1while its surface metadata sayschunk_index == j—surface_indexis plan-global,chunk_indexper-kind (seeir.base.Record.make_id()). Never derive sibling record ids fromchunk_index; use the ledger (ir.retrieve.records_for_artifact()).n_chunksis stamped on readme chunks at decompose time, but corpora built before the stamp keep records without it until the artifact re-indexes (content / embedder / strategy change) — read it withmetadata.get("n_chunks").
- class ir.strategy.Skill[source]
Capability strategy: embed
name + descriptiononly.The body (SKILL.md) is loaded post-selection and is not indexed; name and parent are filter fields.
- class ir.strategy.WholeText(*, text_key: str | None = None, kind: str = 'document')[source]
One surface = the entire text. Sensible default for a naive corpus.
- ir.strategy.text_of(raw: Any, text_key: str | None = None) str[source]
Best-effort text extraction from a raw artifact payload.
The SSOT for turning an opaque
raw(astr, aMappingwith atextfield or atext_key, or anything else) into embeddable text — reused by the shipped strategies and byir.synopsis.make_llm_synthesizer()so an injected-free synopsis summarizes the same text a strategy would index.