ir.strategy
Indexing strategies — the “what do we index?” seam.
An IndexingStrategy decomposes one artifact into an
IndexPlan: the filter_fields (hard-filterable metadata,
not embedded) and a list of Surface (embeddable units). This
is the central extensibility point of ir: a naive corpus uses
WholeText; a structured corpus (a package) decomposes into several
heterogeneous surfaces so a query can match the right part of an artifact,
and constrains candidates by metadata before semantic ranking.
Shipped strategies:
WholeText— one surface = the whole text. The out-of-the-box default.Chunked— split the text into overlapping chunks (one surface each).Skill— embedname + descriptiononly (the body stays on disk, per the capability-discovery research); name/parent become filter fields.Package—name + descriptionplus README chunks as surfaces; name/owner/deps become filter fields (AI synopsis / problem-class surfaces are a documented extension).
Every strategy is a plain callable-ish object with a decompose method, so
custom strategies need only match the IndexingStrategy protocol.
- class ir.strategy.Chunked(*, chunk_size: int = 1200, overlap: int = 200, text_key: str | None = None, kind: str = 'chunk')[source]
Split the artifact’s text into overlapping chunk surfaces.
- class ir.strategy.ClaudeTurn(*, include_full: bool = False)[source]
Index a Claude Code session turn-pair: user prompt + assistant summary.
Two surfaces by default —
user_prompt(what the human asked) andassistant_summary(the assistant’s final end-of-turn text, the highest-signal “here’s what I did”; the deliberation before it is mostly noise) — so a query can target either side viasurfaces={"user_prompt"}/{"assistant_summary"}. Withinclude_full=Truea thirdassistant_fullsurface (all the turn’s assistant natural-language text) trades noise for recall — off by default. Session / project / time / model / tool-use become hard-filter fields.The raw artifact is a turn-pair record (see
priv.claude_transcripts.turn_pair_records()): a mapping withuser_prompt/assistant_summary/assistant_fullplus metadata.
- class ir.strategy.IndexingStrategy(*args, **kwargs)[source]
Decompose one artifact into filter fields + embeddable surfaces.
- class ir.strategy.Package(*, chunk_size: int = 1500, overlap: int = 200, embed_deps: bool = False, deps_template: Callable[[list[str]], str] | None = None)[source]
Package strategy:
name + descriptionsurface plus README chunks.Filter fields capture ownership (ours vs third-party), name, deps. AI synopsis / problem-class surfaces are a documented extension point.
With
embed_deps=True,decomposeadditionally emits oneSurface(kind="deps", granularity="field")whose text is a prefix-form serialization of the bare dependency names (deps_template, default_default_deps_text()) — so a query for a domain matches a package by the libraries it depends on (e.g.sentence-transformers-> embeddings,networkx-> graphs), and the BM25 leg picks up exact dep-token matches. The deps bag is kept separate from prose (its own surface) so a rare library name is not diluted, and deps remain a filter field regardless.embed_depsdefaultsFalse(today’s behavior); it folds into the strategy id, so toggling it re-decomposes incrementally. The deps surface is appended last, leaving thedescription(position 0) andreadme_chunkindices unchanged.Surface indexing: the
descriptionsurface (kept whenever name or description is non-empty) occupies plan position 0, soreadme_chunkj is stored withRecord.surface_index == j + 1while its surface metadata sayschunk_index == j—surface_indexis plan-global,chunk_indexper-kind (seeir.base.Record.make_id()). Never derive sibling record ids fromchunk_index; use the ledger (ir.retrieve.records_for_artifact()).n_chunksis stamped on readme chunks at decompose time, but corpora built before the stamp keep records without it until the artifact re-indexes (content / embedder / strategy change) — read it withmetadata.get("n_chunks").
- ir.strategy.STRATEGY_REGISTRY: dict[str, type] = {'Chunked': <class 'ir.strategy.Chunked'>, 'ClaudeTurn': <class 'ir.strategy.ClaudeTurn'>, 'Package': <class 'ir.strategy.Package'>, 'Skill': <class 'ir.strategy.Skill'>, 'WholeText': <class 'ir.strategy.WholeText'>}
Shipped strategies addressable by name, for persisting an
IndexingStrategyin a registry entry and reconstructing it (#58). New shipped strategies register here so a corpus can name its segmentation.
- class ir.strategy.Skill[source]
Capability strategy: embed
name + descriptiononly.The body (SKILL.md) is loaded post-selection and is not indexed; name and parent are filter fields.
- class ir.strategy.WholeText(*, text_key: str | None = None, kind: str = 'document')[source]
One surface = the entire text. Sensible default for a naive corpus.
- ir.strategy.strategy_from_spec(spec: Mapping[str, Any] | None) IndexingStrategy | None[source]
Reconstruct a shipped strategy from a
{"name", "params"}spec.None(no persisted strategy) returnsNoneso the caller falls back to the source preset’s default strategy — the back-compatible behavior for v1 registry entries.
- ir.strategy.strategy_to_spec(strategy: Any) dict[source]
A
{"name", "params"}spec for a shipped, scalar-param strategy.Captures only scalar constructor parameters (the same identity surface
ir.index._strategy_id()stamps), so it round-trips the shipped strategies. A custom strategy, or one wrapping another (e.g. their.with_synopsis()wrapper), is not captured here — those are set programmatically at build time / by the maintenance layer, not persisted as a segmentation spec.
- ir.strategy.text_of(raw: Any, text_key: str | None = None) str[source]
Best-effort text extraction from a raw artifact payload.
The SSOT for turning an opaque
raw(astr, aMappingwith atextfield or atext_key, or anything else) into embeddable text — reused by the shipped strategies and byir.synopsis.make_llm_synthesizer()so an injected-free synopsis summarizes the same text a strategy would index.