ir
ir — an information-retrieval substrate for agentic systems.
One uniform “find the relevant things in this corpus” contract that scales from an ad-hoc search over an ephemeral list to a maintained search engine. Retrieval is the core; selection/expansion/reranking/generation are layered on top.
Quick start:
import ir
# Define a corpus source (abstract strategy + parameters, smart defaults):
source = ir.CorpusSource.from_md_reports() # project docs/ reports
corpus = ir.build(source) # index (incremental)
hits = ir.search(corpus, "how do I deploy the app") # ranked SearchHits
# Light, dependency-free embedding for fast tests:
corpus = ir.build(source, embedder="light")
A corpus source is defined by a scope (what is in the corpus), a
change_signal (what counts as stale), an indexing_strategy (how a raw
item becomes filter fields + embeddable surfaces), and an embedder. The
default embedder is a decent local model (all-MiniLM-L6-v2); "light"
selects a numpy-only hashing embedder. Data persists under XDG dirs through a
dol repository layer.
- class ir.Artifact(id: str, raw: Any, metadata: dict = <factory>)[source]
A logical corpus item before decomposition into surfaces.
- class ir.Chunked(*, chunk_size: int = 1200, overlap: int = 200, text_key: str | None = None, kind: str = 'chunk')[source]
Split the artifact’s text into overlapping chunk surfaces.
- class ir.Corpus(name: str, store: CorpusStore, embedder: Callable, embedder_id: str)[source]
A built, queryable corpus: a store plus its embedder.
- search(query, **kwargs)[source]
Search this corpus for query.
**kwargs(k/mode/filter/surfaces/per_artifact/ …) are forwarded toir.retrieve.search().
- class ir.CorpusGraph(store_or_corpus: Any)[source]
A
GraphStoreover one corpus — artifact nodes,linksedges.node_idis anartifact_id.graph[aid]is the artifact’s stored records (its scorable surfaces, in plan order);graph.neighbors(aid, edge_type=...)reads the corpus store’slinksview. Single-corpus, so it resolves intra-corpus targets; cross-corpus[source, artifact_id]targets are returned byneighbors()verbatim but__getitem__only dereferences ids in this corpus (federated traversal is a follow-up).- neighbors(node_id: str, *, edge_type: str | None = None) list[source]
Outgoing neighbor ids of node_id, optionally of one edge_type.
Returns target ids in stored form (a bare
artifact_id, whose source is this graph’ssource; or a[source, artifact_id]list for a cross-corpus edge), de-duplicated with first-seen order preserved. An artifact with no edges — or a store without a links view — yields[]. Pass a canonical(source, artifact_id)tocanonical_node_id()for a traversal’s visited-set.node_id is an intra-corpus
artifact_id(astr); a cross-corpus target fetched from another graph is out of contract here (it has no edges in this corpus).
- source
The corpus name, when known — the
sourcehalf of this graph’s node identities (Nonefor a bare store).
- class ir.CorpusSource(name: str, scope: ~collections.abc.Mapping[str, ~typing.Any], indexing_strategy: ~ir.strategy.IndexingStrategy = <factory>, change_signal: ~collections.abc.Callable[[str, ~typing.Any], str] = <function content_hash_signal>, embedder: ~typing.Any = 'default', metadata_of: ~collections.abc.Callable[[str, ~typing.Any], ~collections.abc.Mapping[str, ~typing.Any]] | None = None)[source]
A corpus definition: scope + change signal + strategy + embedder.
- change_signal(raw: Any) str
Default change signal: a content hash of the raw payload.
- classmethod from_files(root: str | Path, *, name: str | None = None, pattern: str = '.*\\.md$', exclude: Callable[[str], bool] | None = None, strategy: IndexingStrategy | None = None, **kwargs) CorpusSource[source]
A directory tree of text files as a corpus (lazy
dolscope).
- classmethod from_mapping(mapping: Mapping[str, Any], *, name: str, strategy: IndexingStrategy | None = None, **kwargs) CorpusSource[source]
Any mapping
{id -> raw}(dict,dolstore) as a corpus.
- classmethod from_md_reports(*, name: str = 'reports', projects_root: str | Path | None = None, strategy: IndexingStrategy | None = None, **kwargs) CorpusSource[source]
Markdown reports under projects’
docs/andmisc/docs/.Excludes ALL-CAPS filenames (README/CLAUDE/MEMORY/SKILL…). Each record is a project-tagged document; ids are paths relative to the projects root.
- classmethod from_packages(*, name: str = 'packages', manifest: str | Path | None = None, readme_chars: int = 20000, strategy: IndexingStrategy | None = None, **kwargs) CorpusSource[source]
The local package ecosystem, scanned from the
.pthmanifest.
- classmethod from_skills(*, name: str = 'skills', filter: Any = None, fetcher: Callable[[], list] | None = None, strategy: IndexingStrategy | None = None, **kwargs) CorpusSource[source]
The agent-skills corpus, via
priv.skills_index.fetcheroverrides the source of skill records (each a mapping withname/description/parent) — inject a test double to avoid theprivdependency.
- class ir.CorpusStore(meta: MutableMapping[str, Any], vectors: MutableMapping[str, ndarray], ledger: MutableMapping[str, Any], config: MutableMapping[str, Any], calibration: MutableMapping[str, Any] | None = None, links: MutableMapping[str, Any] | None = None)[source]
Repository bundling the meta/vectors/ledger/config views of one corpus.
- delete_links(artifact_id: str) None[source]
Remove an artifact’s edges; a missing entry is tolerated.
- delete_record(record_id: str) None[source]
Remove a record’s metadata + vector; a missing id is tolerated.
- get_calibration(mode: str) dict | None[source]
The stored calibration record for ranking
mode(Noneif absent).A deep copy, so a caller cannot mutate the nested
gridback into the stored record (in-memory stores share their objects by reference).
- get_links(artifact_id: str) dict[source]
The outgoing edges of artifact_id —
{edge_type: [target, ...]}.Empty dict when the artifact has no stored edges (or no links view). A copy, so a caller cannot mutate the persisted adjacency in place.
- get_record(record_id: str) Record[source]
Reassemble the
Recordfor record_id (KeyErrorif absent).
- ledger_items() Iterator[tuple[str, dict]][source]
Iterate
(key, entry)ledger pairs (the ledger may be mutated while iterating).
- link_items() Iterator[tuple[str, dict]][source]
Iterate
(artifact_id, {edge_type: [target]})adjacency pairs.
- classmethod local(name: str) CorpusStore[source]
File-backed store under
~/.local/share/ir/corpora/<name>.
- matrix() tuple[list[str], ndarray, list[dict]][source]
Return
(record_ids, normalized_matrix, metas)for brute force.Rows are L2-normalized so cosine similarity is a dot product. Empty corpora return a
(0, 0)matrix. Cached until the next write.
- classmethod memory() CorpusStore[source]
In-memory store (no dependencies); ideal for tests.
- put_record(record: Record) None[source]
Persist record’s metadata + vector, invalidating the search matrix.
- set_calibration(mode: str, record: Mapping[str, Any]) None[source]
Persist a calibration
recordfor rankingmode(one per mode).modekeys a file in the calibration store, so it must be a non-empty string with no path separator (the real modes —dense/lexical/hybrid— already satisfy this).
- set_config(settings: Mapping[str, Any]) None[source]
Persist the corpus build settings (name / embedder spec + id).
- set_ledger_entry(key: str, entry: Mapping[str, Any]) None[source]
Write the ledger entry (version / embedder id / record ids) for key.
- set_links(artifact_id: str, edges: Mapping[str, Any]) None[source]
Persist artifact_id’s outgoing edges (
{edge_type: [target]}).Empty edge-type lists are dropped; an empty result deletes the entry (no empty adjacency rows linger). Targets are stored verbatim — a bare
artifact_idor a[source, artifact_id]pair.
- class ir.Disclosure(artifact_id: str, level: str, name: str, score: float, summary: str, body: str | None = None, pointer: str | None = None, metadata: Mapping[str, ~typing.Any]=<factory>, source: str | None = None, passage: str | None = None)[source]
The progressively-disclosed payload for one selected artifact.
- artifact_id
the artifact this payload belongs to.
- Type:
str
- level
how much was loaded —
"metadata"(no I/O),"body"(the pointer’s full text), or"bundled"(body + extras).- Type:
str
- name
a display name (the
namefilter field, else the id).- Type:
str
- score
the selecting hit’s score.
- Type:
float
- summary
the matched surface text — always present, always cheap.
- Type:
str
- body
the full payload (SKILL.md / file text);
Nonebelow"body"level or when the pointer could not be read.- Type:
str | None
- pointer
the source pointer (
skill_path/path) — the “package pointer” an agent follows to act;Noneif the hit has none.- Type:
str | None
- metadata
the hit’s filter metadata, plus a
disclosurenote when a pointer was present but unreadable (stale/moved/deleted), and anexpansionnote when expansion was requested but not possible for this hit.- Type:
collections.abc.Mapping[str, Any]
- source
the corpus/source name the selecting hit came from (
Nonewhen unattributed) — the attribution a federated caller needs to tell two same-id artifacts from different corpora apart.- Type:
str | None
- passage
the expanded neighborhood text (
disclose(..., expand=...)) — the mid-granularity payload betweensummary(the matched surface) andbody(the pointer’s full text);Nonewhen expansion was not requested or not possible. Assembled from the corpus’s stored records (seeir.expand), unlikebody, which dereferences the pointer to an external resource.- Type:
str | None
- class ir.DiscoveryResult(query: str, mode: str, strategy: str, disclose_level: str, results: list[~ir.select.Disclosure], abstained: bool, reason: str, n_retrieved: int, signals: ~collections.abc.Mapping[str, ~typing.Any] = <factory>)[source]
The result of
discover()— retrieve → select → (optional) disclose.The qh-exposable payload:
to_dict()is fully JSON-serializable (lists of dicts, floats, strings, bools — no numpy, no objects), so a FastAPI facade can return it directly.- property ids: list[str]
The committed artifact ids, best-first.
- class ir.GraphStore(*args, **kwargs)[source]
Structural contract a traversal operator binds to — node + neighbors.
Deliberately minimal (two methods) so it is satisfied by ir’s
CorpusGraphand by any external graph:__getitem__resolves a node id to its scorable payload,neighborslists adjacent node ids (optionally of one edge type). Granularity-agnostic on purpose — an artifact graph and a surface-level tree are both ``GraphStore``s.runtime_checkablemakesisinstance(x, GraphStore)a structural check on attribute names only (not signatures) — enough to tell a conforming adapter from an arbitrary object, but it cannot validate thatneighborstakes the right arguments; treat it as a smoke check.
- class ir.IndexPlan(filter_fields: dict = <factory>, surfaces: list[Surface] = <factory>)[source]
An
IndexingStrategy’s output for one artifact.
- class ir.IndexingStrategy(*args, **kwargs)[source]
Decompose one artifact into filter fields + embeddable surfaces.
- class ir.Package(*, chunk_size: int = 1500, overlap: int = 200)[source]
Package strategy:
name + descriptionsurface plus README chunks.Filter fields capture ownership (ours vs third-party), name, deps. AI synopsis / problem-class surfaces are a documented extension point.
Surface indexing: the
descriptionsurface (kept whenever name or description is non-empty) occupies plan position 0, soreadme_chunkj is stored withRecord.surface_index == j + 1while its surface metadata sayschunk_index == j—surface_indexis plan-global,chunk_indexper-kind (seeir.base.Record.make_id()). Never derive sibling record ids fromchunk_index; use the ledger (ir.retrieve.records_for_artifact()).n_chunksis stamped on readme chunks at decompose time, but corpora built before the stamp keep records without it until the artifact re-indexes (content / embedder / strategy change) — read it withmetadata.get("n_chunks").
- class ir.Passage(artifact_id: str, surface_kind: str, score: float, text: str, record_ids: tuple[str, ...] = (), source: str | None = None, surface_index: int | None = None)[source]
An expanded hit: the seed’s identity + the stitched neighborhood text.
artifact_id/surface_kind/score/source/surface_indexare the seed hit’s — expansion never disturbs hit identity(source, artifact_id)or scores.textis the assembled neighborhood (overlap-deduped, plan order) andrecord_idsthe ordered stored segments it was stitched from (empty when expansion degraded to the seed’s own text).
- class ir.Record(id: str, artifact_id: str, surface_kind: str, surface_index: int, text: str, vector: ndarray, metadata: dict = <factory>)[source]
A stored, embedded surface — one row of the index.
- static make_id(artifact_id: str, surface_kind: str, surface_index: int) str[source]
Deterministic storage id for a surface of an artifact.
surface_indexis the surface’s plan-global position — its enumeration index across all surfaces of the artifact’sIndexPlan, regardless of kind — as assigned byir.index.build(). On multi-kind strategies it therefore differs from per-kind counters likemetadata["chunk_index"](e.g.Package: thedescriptionsurface takes position 0, shiftingreadme_chunkj tosurface_indexj+1— and the offset is plan-dependent, since empty surfaces are dropped).Ids of already-built corpora are a stability contract: never re-derive a sibling’s id from a per-kind index — address siblings through the ledger via
ir.retrieve.records_for_artifact().
- class ir.SearchHit(artifact_id: str, surface_kind: str, score: float, text: str, metadata: Mapping[str, ~typing.Any]=<factory>, source: str | None = None, surface_index: int | None = None)[source]
A scored record returned by retrieval (higher score = closer).
Maps onto ir_09’s
Result:textis the snippet,scorethe rank score,metadatathe meta, andpointerthe key into a resource store (ir_09 §5).to_dict()is the serialization-clean form for a cross-process / subagent boundary (no numpy scalars leak).sourceis the corpus/source name the hit came from (Nonewhen unattributed — e.g. an ad-hoc corpus without a name). It is a first-class field, not a metadata key, becausemetadatais the strategy-owned hard-filter namespace and provenance is structural: artifact identity is only unique within a source, so any cross-source operation keys on(source, artifact_id)(seebest_per_artifact()).surface_indexis the storedRecord.surface_indexof the hit’s surface — its plan-global position among the artifact’s surfaces — so a hit can name which surface of its artifact it is (the prerequisite for sibling addressing and context expansion).Nonewhen unknown (e.g. a hand-built hit). It is not the per-kindmetadata["chunk_index"]; seeRecord.make_id()for why the two differ on multi-kind strategies.- property pointer: str | None
The disclosure pointer on this hit, if any (see
POINTER_KEYS).
- class ir.Selection(selected: list[~ir.base.SearchHit], candidates: list[~ir.base.SearchHit], abstained: bool, reason: str, signals: ~collections.abc.Mapping[str, ~typing.Any] = <factory>)[source]
A selector’s commitment: the chosen subset of a ranked candidate list.
- selected
the committed hits, best-first (empty iff
abstained).- Type:
list[ir.base.SearchHit]
- candidates
the full ranked input, kept for provenance / audit.
- Type:
list[ir.base.SearchHit]
- abstained
True iff the selector committed to nothing by policy.
- Type:
bool
- reason
which rule ended the commit (e.g.
"rel_threshold","score_gap","max_k","abstain:below_floor").- Type:
str
- signals
concrete, defined numbers behind the decision (
top_score,n_candidates,n_selected,min_ratio) — the auditable replacement for an opaque “confidence” float.- Type:
collections.abc.Mapping[str, Any]
- property selected_ids: list[str]
The committed artifact ids, best-first.
- property sufficient: bool
A model-free sufficiency hint for an agent’s Evaluator (ir_09 §3).
Truewhen this selection committed to at least one item (i.e. did not abstain). It is a signal, not a directive: the re-query /refinementdecision and the loop belong to the agent layer (the back-edge, ir_09 §4) —irderives this from its own outcome and never acts on it.
- class ir.Skill[source]
Capability strategy: embed
name + descriptiononly.The body (SKILL.md) is loaded post-selection and is not indexed; name and parent are filter fields.
- class ir.Surface(artifact_id: str, kind: str, text: str, granularity: str = 'document', metadata: Mapping[str, ~typing.Any]=<factory>)[source]
One embeddable unit derived from an artifact.
kindnames the surface type (e.g."description","synopsis","problem_class","chunk") so a query can match the right part of an artifact.granularityis a coarse hint ("document"/"chunk"/"field").metadatais surface-local (e.g. chunk offsets).
- class ir.WalkPolicy(*args, **kwargs)[source]
The pluggable strategy of a walk — graph semantics, not safety.
seedproduces the initial frontier;scoreranks a node against the query;selectchooses which scored frontier nodes to commit/expand this step (beam/greedy — default: all, best-first);expandyields a node’s neighbors;node_idis the hashable visited-set key;stopis the injected sufficiency check;to_hitmaterializes a committed node as aSearchHit— orNonefor a router-only node (a summary that routes but is not itself a result).
- class ir.WalkState(query: str, max_depth: int, budget: int, visited: set = <factory>, results: list = <factory>, cache: dict = <factory>)[source]
The operator-owned state of one
traverse()call — the safety home.visited(node ids already committed),budget, andmax_depthare the structural safety primitives the operator enforces;resultsare the emitted hits;cacheis scratch space a policy may use (e.g. to embed the query once). A policy reads this but the operator enforces the bounds — a policy cannot opt out of termination.
- class ir.WholeText(*, text_key: str | None = None, kind: str = 'document')[source]
One surface = the entire text. Sensible default for a naive corpus.
- ir.as_retriever(corpus_or_name, **search_defaults) Callable[[...], list[SearchHit]][source]
Bind ONE corpus to the uniform
Retrievercontract.Returns
retrieve(query, **overrides) -> list[SearchHit]that callssearch()withsearch_defaults(a per-call kwarg overrides a bound default). A corpus name is resolved once viair.open_corpus(); pass an openCorpusto skip that. The returned callable carries the bound corpus on.corpusfor introspection.>>> retr = as_retriever(corpus, mode="hybrid", k=20) >>> hits = retr("how do I deploy the app") >>> hits = retr("deploy", filter={"owner": "me"})
- ir.build(source: CorpusSource, *, store: CorpusStore | None = None, embedder: Any = None, full: bool = True, batch_size: int = 256, edge_extractor: Callable | None = None) Corpus[source]
Build or incrementally update source into a
Corpus.- Parameters:
store (the persistence backend (default: file-backed under XDG data dir).)
embedder (override the source's embedder spec.)
full (when True (default), prune artifacts no longer in the source.)
batch_size (embedding batch size.)
edge_extractor (an optional
EdgeExtractor) – ((artifact_id, filter_fields) -> {edge_type: [target]}) that populates the corpus’s semanticlinksgraph (seeir.graph; passir.default_edge_extractor()for the latent deps/parent edges). Ingest is eager — edges are (re)written for every in-scope artifact, a decompose-only pass with no embedding, so the graph never goes partially stale — while embedding stays fully incremental. Edges are derived state, not part of build identity. A rebuild without an extractor leaves existing edges untouched (they are only refreshed by re-running with one, and only cleared per artifact by thefullprune below) — so droppingedge_extractordoes not wipe a graph.
- ir.build_corpus(name, **kwargs)[source]
Build (or update) a registered/preset corpus by name; returns a
Corpus.**kwargsare forwarded toir.build()— notablystore,embedder(e.g."light"for the numpy-only hashing embedder),full(prune artifacts no longer in the source), andbatch_size.
- ir.canonical_node_id(target: Any, *, source: str | None) tuple[str | None, str][source]
Canonicalize a neighbor target to a
(source, artifact_id)node id.The repo’s node identity is
(source, artifact_id)— the key a traversal visited-set must use so the same id in two corpora stays two nodes.CorpusGraph.neighbors()returns targets in stored form: a bareartifact_id(implicitly in source, the graph it came from) or a[source, artifact_id]cross-corpus pair. This resolves either to the canonical tuple.>>> canonical_node_id("dol", source="packages") ('packages', 'dol') >>> canonical_node_id(["skills", "deploy"], source="packages") ('skills', 'deploy')
- ir.collapsed_tree_policy(*, summary_kinds: Iterable[str] = ('description', 'synopsis', 'capability', 'document'), leaf_kinds: Iterable[str] = ('chunk', 'readme_chunk'), seed_k: int = 10) WalkPolicy[source]
The pure-vector summary-routing / collapsed-tree
WalkPolicy.Seeds on the top seed_k matches among
summary_kindssurfaces and descends to each routed artifact’sleaf_kindssurfaces (the emitted results), scored by cosine to the query. No LLM in the loop. A summary surface is a router (suppressed from results) only when its artifact has leaf surfaces; on a single-surface corpus (WholeTextdocument, Skillcapability) the summaries are leaf-less and emitted directly, so the walk degrades to flat-over-summaries instead of returning nothing.The defaults keep
document/capabilityinsummary_kindson purpose — that is what lets a WholeText / Skill corpus seed at all; the structural router check (above) is what keeps those seeds from being silently swallowed.>>> hits = traverse(q, corpus, policy=collapsed_tree_policy())
- ir.corpora() dict[str, Any]
All registered corpus definitions, keyed by name.
- ir.default_edge_extractor(artifact_id: str, filter_fields: Mapping[str, Any]) dict[str, list][source]
Edges latent in the standard filter fields:
deps→ REF,parent→ PARENT.deps(Package) →REFedges to each dependency’s bare name (version specifiers / extras / markers stripped). Self-edges and blanks are dropped.parent(Skill) → a singlePARENTedge.
A package whose
depsname other packages in the same corpus gets intra-corpus REF edges; third-party deps become REF edges to ids not in the corpus (harmless —CorpusGraph.neighbors()lists them, and a traversal simply finds no node to expand).Self-edges are dropped case-insensitively (
_dep_namelower-cases, so a package"AA"depending on"aa"is recognized as a self-reference).
- ir.disclose(selection: Selection, *, level: str = 'body', loader: Callable[[Mapping[str, Any]], str | None] | None = None, store: Mapping[str, Any] | None = None, expand: Callable[[SearchHit, Sequence[Record]], Sequence[Record]] | None = None, corpus: Any = None) list[Disclosure][source]
Reveal the payload of each selected hit at
level— append-only, pure.- Parameters:
selection – a committed
Selection.level –
"metadata"(no I/O — summary + pointer only),"body"(load the pointer’s full text), or"bundled"(body + extras; today the same as"body", reserved for bundled scripts/references).loader – override the body resolver —
metadata -> str | None. The default reads theskill_path/pathpointer from disk and tolerates a missing target (returnsNone, never raises).store – a
ResourceStore(pointer -> payloadMapping) to dereference instead of disk — ir_09 §5 pointer-passing over adolstore / URL map / blob storage. Mutually exclusive withloader.expand – a
NeighborhoodPolicyto also stitch each hit’s neighborhood from the corpus’s stored records intoDisclosure.passage(seeir.expand). Orthogonal tolevel, which governs pointer payloads: e.g.level="metadata", expand=sentence_window_policy()reads no pointer at all but still returns mid-granularity passages. Requirescorpus=.corpus – where
expandfinds each hit’s stored siblings — aCorpus/CorpusStore/ name, or, for cross-source selections, a{source_name: corpus}Mappingresolved per hit viahit.source. Only meaningful withexpand=.
- Returns:
one
Disclosureper selected hit, best-first. This is a pure read: theSelectionand its hits are never mutated, so a caller can disclose append-only without disturbing a cached ranked prefix.
- ir.discover(corpus: Any, query: str, *, k: int = 10, mode: str = 'hybrid', strategy: str | Callable[[Sequence[SearchHit]], list[SearchHit]] = 'conservative', disclose_level: str = 'metadata', filter: Mapping[str, Any] | None = None, surfaces: Iterable[str] | None = None, max_k: int = 3, rel: float = 0.9, gap_ratio: float = 0.5, min_score: float | str | Mapping[str, float | str | None] | None = None, merge: str | Callable = 'rrf', merge_weights: Mapping[str, float] | None = None, merge_rrf_k: int | None = None, loader: Callable[[Mapping[str, Any]], str | None] | None = None, store: Mapping[str, Any] | None = None, expand: Callable[[SearchHit, Sequence[Record]], Sequence[Record]] | None = None, **search_kw: Any) DiscoveryResult[source]
Find and commit to the capabilities for
query— the one search tool.Retrieves
kcandidates, commits to a distractor-robust subset, and (optionally) discloses each committed item’s payload. This is the single agent-callable surface the capability-discovery research argues for: one tool that returns few, high-precision answers rather than a long candidate list the model must then filter under context rot.- Parameters:
corpus – a built
Corpus, or a registered corpus name (resolved withir.open_corpus()). Pass a name for the qh / HTTP surface — it is the JSON-friendly form. Pass a list/tuple of names (or Corpus objects) for single-shot federated discovery across several corpora: each is searched, per-source abstention floors gate before any merging, and the survivors are rank-fused (seemerge). The caller names the sources explicitly; ir never chooses the set (source planning is the agent layer’s job, ir_09 §3).query – the user intent.
k – candidate depth retrieved before selection. Federated:
kcandidates are retrieved per source, and the fused ranking is also truncated tokbefore selection.mode – ranking mode —
"hybrid"(default;ir’s strongest overall),"dense", or"lexical".strategy – selection strategy (see
select()).disclose_level –
"metadata"(default; cheap, no body I/O),"body", or"bundled".filter – retrieval constraints (forwarded to
ir.retrieve.search()).surfaces – retrieval constraints (forwarded to
ir.retrieve.search()).max_k – selection parameters (see
select()).min_score="auto"loads the floor calibrated for this(corpus, mode)byir.eval.calibrate_min_score()and persisted on the corpus — the opt-in that turns on absolute abstention; it falls back to no floor (with a warning) when no calibration is stored or it is stale (a different embedder). Federated: floors are per-(corpus, mode, embedder), so a single number cannot apply across corpora — pass"auto"(each source’s own calibrated floor), a{name: floor_or_"auto"}mapping, orNone; a bare float raises. Floors gate each source on its own raw scores before fusion; the fused ranking is never floored (rank-fused scores are ordinal — ir_07/ir_08).rel – selection parameters (see
select()).min_score="auto"loads the floor calibrated for this(corpus, mode)byir.eval.calibrate_min_score()and persisted on the corpus — the opt-in that turns on absolute abstention; it falls back to no floor (with a warning) when no calibration is stored or it is stale (a different embedder). Federated: floors are per-(corpus, mode, embedder), so a single number cannot apply across corpora — pass"auto"(each source’s own calibrated floor), a{name: floor_or_"auto"}mapping, orNone; a bare float raises. Floors gate each source on its own raw scores before fusion; the fused ranking is never floored (rank-fused scores are ordinal — ir_07/ir_08).gap_ratio – selection parameters (see
select()).min_score="auto"loads the floor calibrated for this(corpus, mode)byir.eval.calibrate_min_score()and persisted on the corpus — the opt-in that turns on absolute abstention; it falls back to no floor (with a warning) when no calibration is stored or it is stale (a different embedder). Federated: floors are per-(corpus, mode, embedder), so a single number cannot apply across corpora — pass"auto"(each source’s own calibrated floor), a{name: floor_or_"auto"}mapping, orNone; a bare float raises. Floors gate each source on its own raw scores before fusion; the fused ranking is never floored (rank-fused scores are ordinal — ir_07/ir_08).min_score – selection parameters (see
select()).min_score="auto"loads the floor calibrated for this(corpus, mode)byir.eval.calibrate_min_score()and persisted on the corpus — the opt-in that turns on absolute abstention; it falls back to no floor (with a warning) when no calibration is stored or it is stale (a different embedder). Federated: floors are per-(corpus, mode, embedder), so a single number cannot apply across corpora — pass"auto"(each source’s own calibrated floor), a{name: floor_or_"auto"}mapping, orNone; a bare float raises. Floors gate each source on its own raw scores before fusion; the fused ranking is never floored (rank-fused scores are ordinal — ir_07/ir_08).merge – federated only — how the per-source rankings combine:
"rrf"(default; rank-based, scale-free — seeir.retrieve.fuse_hits()),"score"(raw-score merge, valid only when all corpora share an embedder — verified, raises on mismatch), or a callable{name: hits} -> hits.merge_weights – federated only — per-source trust weights for
merge="rrf"(default 1.0 each).merge_rrf_k – federated only — the cross-source RRF rank constant (default:
DFLT_RRF_K; distinct from the within-corpus hybridrrf_kinsearch_kw).loader – optional body resolver for disclosure (see
disclose()).expand – a
NeighborhoodPolicy— also stitch each committed hit’s neighborhood from its corpus’s stored records intoDisclosure.passage(retrieval-time context expansion, seeir.expand). Works at anydisclose_level; the federated form resolves each hit’s corpus via itssource.**search_kw – any other
ir.retrieve.search()keyword (rrf_k,rerank,bm25, …).
- Returns:
a
DiscoveryResult(.to_dict()for JSON / qh). Federated results addsignals["per_source"](per-corpusn_retrieved/top_score/floor/abstained) and each disclosure carries itssource.
- ir.expand(hit: SearchHit, corpus: Any, *, policy: Callable[[SearchHit, Sequence[Record]], Sequence[Record]] | None = None) Passage[source]
Expand hit into a
Passageof its neighborhood in corpus.Fetches the hit’s sibling records through the ledger (
ir.retrieve.records_for_artifact()), asks policy which to keep, and stitches them in plan order with overlap-aware dedupe. The default policy is a ±:data:DFLT_WINDOW sentence window; passparent_policy()for the whole artifact, or anyNeighborhoodPolicy.- Parameters:
hit – the seed
SearchHit(its identity and score pass through to thePassageuntouched).corpus – a
Corpus,CorpusStore, or corpus name — whateverrecords_for_artifact()accepts. Must be the corpus the hit came from.policy – which siblings make up the neighborhood (default: sentence window). A policy that selects nothing degrades the passage to the hit’s own text (
record_ids=()) rather than returning nothing.
- Raises:
KeyError – the corpus has no ledger entry for the hit’s artifact (
ir.retrieve.NoLedgerEntry), or the ledger is stale — an entry listing records missing from the store.SeedNotFound – the default window policy cannot find the seed among its artifact’s stored records (stale hit / wrong corpus).
ValueError – the policy returned records that are not siblings of the hit’s artifact (operator-enforced safety), or the seed hit lacks
surface_index(hand-built hit) under the default window policy — useparent_policy(), which needs no seed position.
- ir.fuse_hits(hits_by_source: Mapping[str | None, Sequence[SearchHit]], *, rrf_k: int = 60, weights: Mapping[str, float] | None = None, identity: Callable[[SearchHit], Any] | str | None = None, k: int | None = None) list[SearchHit][source]
Merge per-source ranked hit lists into one ranking — by rank, not score.
The cross-source counterpart of the within-corpus hybrid fusion: scores from different (corpus, mode, embedder) tuples live on incommensurable scales (ir_07: “a different model re-scales everything”), so raw scores never cross the source boundary — within each source they order and dedup that source’s hits (one scale, sound), and across sources only ranks interact, via weighted Reciprocal Rank Fusion: each hit contributes
weights[source] / (rrf_k + rank).- Parameters:
hits_by_source –
{source_name: ranked hits}. Hits without asourceare stamped with their mapping key (existing tags win, so one corpus bound under two keys still counts as one source). ANonekey is the untagged pseudo-source: its hits fuse as one rank group and stay unattributed (source=None). Within each list, duplicate artifacts — and, whenidentityis given, identity-duplicates — collapse to their best raw score before ranking, so a multi-query / multi-round pool can never double-count one artifact’s RRF mass.rrf_k – the RRF rank constant (standard default 60).
weights – optional per-source trust dial (default 1.0 each) — a source’s contribution scales linearly, no score comparability needed. Keys naming sources absent from
hits_by_sourceare ignored (a per-round pool may legitimately lack a configured source); callers with a closed source set should validate keys upfront, as federatedir.discover()does.identity – how cross-source duplicates merge — see
Identity. DefaultNone: never; each(source, artifact_id)stays a distinct result.k – truncate the fused ranking to this many hits.
- Returns:
the fused hits, best-first. Each carries the fused score in
scoreand keeps its pre-fusion magnitude asmetadata["source_score"](+"source_rank"), so downstream consumers (abstention gates, LLM judges) never lose the per-source signal. When anidentitymerge combined several sources’ hits,metadata["fused_sources"]lists them and the representative hit is the one with the best rank. Single-source input passes through with raw scores (RRF of one list is that list’s order — same convention as the hybrid fusion’s single-channel fallback), so the fused-score rescaling only happens when there is genuinely something to fuse. The post-fusionscoreis ordinal: valid for ordering and relative cuts, meaningless against absolute floors — apply calibratedmin_scorefloors per source, before fusing (see ir_07/ir_08 andir.discover’s federated form).
- ir.make_llm_formulator(*, rewriter: Callable[[str], str | Sequence[str]] | None = None, prompt: str = 'Rewrite the search query into {n} short, diverse alternative search queries that would retrieve the same target documents: fix typos, expand jargon, and add synonyms, but keep each a terse search phrase. One query per line, no numbering.\n\nQuery: {query}', n: int = 3, fallback: Callable[[str], str | Sequence[str]] | None = None, **prompt_function_kwargs: Any) Callable[[str], str | Sequence[str]][source]
An LLM-backed
Formulator(rewrite / expand / multi-query).rewriteris an injectablequery -> str | [str, ...]callable (a test double, or your own router); when omitted it is built lazily onoa(oa.prompt_function), so importing this module stays offline.nis the multi-query fan-out width. Any error or empty reply falls back tofallback(default:identity_formulator()).
- ir.make_llm_synthesizer(*, summarize: Callable[[str], str] | None = None, prompt: str = 'Write a concise synopsis (2-4 sentences) of the document below: what it is about and what questions it answers, so that a search over synopses can route to it. Output only the synopsis, no preamble.\n\nDocument:\n{text}', model: str | None = None, synthesizer_id: str | None = None, text_key: str | None = None, **prompt_function_kwargs: Any) Callable[[Artifact], str][source]
An LLM-backed
Synthesizer(Artifact→ synopsis).summarizeis an injectabletext -> strcallable (a test double, or your own summarizer); when omitted it is built lazily onoa(oa.prompt_function) on the first synthesis and reused — so importing this module, and even constructing the synthesizer, stays offline. The artifact’s text is extracted withir.strategy.text_of()usingtext_key— whichwith_synopsis()threads from the inner strategy, so the synopsis summarizes the same field the strategy indexes. An empty text, or any synthesis error, yields""(the surface is then skipped, never a fabricated summary).The returned callable carries a
synthesizer_idattribute (default"oa:{model}:{sha(prompt)[:12]}") thatwith_synopsis()reads into the corpus’sstrategy_idfor staleness — a prompt or model change re-synthesizes.
- ir.open_corpus(name: str, *, embedder: Any = None) Corpus[source]
Reopen a previously built corpus by name (resolves its embedder).
- ir.parent_policy() Callable[[SearchHit, Sequence[Record]], Sequence[Record]][source]
The whole artifact (small-to-big): every stored surface, plan order.
The mid-granularity analogue of
disclose(level="body")— but assembled from the indexed segments rather than dereferencing the pointer, so it works for corpora whose artifacts have no on-disk body.
- ir.records_for_artifact(store_or_corpus, artifact_id: str, *, surface_kind: str | None = None) list[Record][source]
All stored records of artifact_id, ordered by
surface_index.The sibling-addressing primitive beneath retrieval-time context expansion: a
SearchHitnames its artifact (and, viasurface_index, which surface of it matched); this returns every surface of that artifact, in plan order, so an expansion policy can stitch neighbors / parents around the hit.Resolution is ledger-backed only: the artifact’s ledger entry lists its
record_ids. Record ids are never re-derived from a per-kind index likemetadata["chunk_index"]— on multi-kind strategies that index differs from the plan-globalsurface_indexbaked into the id (seeir.base.Record.make_id()), so derivation would fetch wrong or missing siblings.- Parameters:
store_or_corpus – a
CorpusStore, anything carrying one as.store(e.g. aCorpus), or a corpus name — resolved straight to its local store: sibling lookup never embeds, so unlikeir.open_corpus()no embedder is loaded.artifact_id – the artifact whose surfaces to fetch.
surface_kind – restrict to one surface kind (e.g.
"readme_chunk"); a known artifact with no surfaces of that kind yields[].
- Raises:
NoLedgerEntry – the ledger has no entry for artifact_id (an unknown artifact, or a corpus built without
ir.index.build()’s ledger bookkeeping). AKeyErrorsubclass.KeyError – an entry exists but lists a record missing from the store (a stale ledger: interrupted build or out-of-band
delete_record) — data corruption, named in the message.
- ir.register(name: str, kind: str, *, embedder: str = 'default', **params) dict[source]
Register (or overwrite) a named corpus definition.
- ir.retriever_for(name: str, **search_defaults: Any)[source]
A
Retrieverbound to the registered corpus name.Opens the corpus (it must have been built) and wraps it with
ir.as_retriever();search_defaults(e.g.mode="hybrid") bind to every call.
- ir.retrievers(**search_defaults: Any) Mapping[str, Any][source]
A lazy
Mapping[name, Retriever]view over the registry (ir_09 §8).The query-time projection of the build-recipe registry: each value is a ready-to-call
Retriever. This is the source-registry facade an orchestration layer (raglab) consumes — it never opens a corpus until the key is accessed, and always reflects the currentregistered()set.search_defaultsapply to every source.
- ir.search(corpus, query, **kwargs)[source]
Search a
Corpus(or a corpus name, reopened lazily).Thin facade over
ir.retrieve.search();**kwargsare forwarded to it — the useful ones arek(how many hits),mode("dense"/"lexical"/"hybrid"),filter(avdMongo-style metadata filter),surfaces(restrict to surface kinds), andper_artifact(collapse to the best surface per artifact). Seeir.retrieve.search()for the full signature and defaults.
- ir.select(hits: Sequence[SearchHit], *, strategy: str | Callable[[Sequence[SearchHit]], list[SearchHit]] = 'conservative', max_k: int = 3, rel: float = 0.9, gap_ratio: float = 0.5, min_score: float | None = None) Selection[source]
Commit to a subset of ranked
hits— the selection stage.- Parameters:
hits – ranked
SearchHits (best first), as returned byir.retrieve.search().strategy –
"conservative"(the default distractor-robust commit), one of"top_k"/"abs_threshold"/"rel_threshold"/"score_gap", or anySelectorcallable (hits -> subset) — e.g. one built bymake_llm_selector().max_k – never commit to more than this many (caps distractor exposure).
rel – relative-to-top keep threshold for
"conservative"/ the ratio for"rel_threshold".gap_ratio – score-gap elbow ratio — used by the
"score_gap"strategy only ("conservative"deliberately usesrelalone, not an elbow; see this module’s docstring).min_score – optional absolute floor; with
"conservative"the selector abstains when even the top hit falls below it (also usable as the"abs_threshold"floor).
- Returns:
a
Selection.abstainedis True iffselectedis empty.
- ir.sentence_window_policy(k: int = 1) Callable[[SearchHit, Sequence[Record]], Sequence[Record]][source]
±*k* same-kind neighbors around the seed (NEXT/PREV expansion).
The window runs over the artifact’s surfaces of the seed’s kind only, in plan order — a
readme_chunkwindow never swallows thedescriptionsurface.k=0selects just the seed’s own record.
- ir.tag_source(hits: Sequence[SearchHit], source: str | None) list[SearchHit][source]
Stamp source on every hit that doesn’t already carry one.
Existing tags win: a hit already attributed to a corpus keeps that attribution (so re-tagging under a different registry key cannot double-count one corpus as two sources). A
Nonesource is the untagged pseudo-source — hits pass through unattributed.
- ir.traverse(query: str, store: Any, *, policy: WalkPolicy, max_depth: int = 2, node_budget: int = 64, k: int = 10) list[SearchHit][source]
Walk store from query under policy, returning the top-k hits.
The loop — score the frontier → select → commit → expand — is the operator’s; the safety primitives are non-negotiable and enforced here: a node id is committed at most once (the visited-set), expansion stops at
max_depth, and no more thannode_budgetnodes are ever committed. A policy whoseexpandcycles forever and whosestopnever fires still terminates.- Parameters:
query – the user intent.
store – passed to policy verbatim — a
Corpusforcollapsed_tree_policy(), aCorpusGraphfor an artifact-link policy. The operator never inspects it.policy – the
WalkPolicy(e.g.collapsed_tree_policy()).max_depth – maximum expansion depth from a seed (safety).
node_budget – maximum nodes committed (safety).
k – number of hits to return.
- Returns:
the committed hits, best-first, top-k — each a
SearchHitwithmetadata["walk_depth"]/["seed"].
- ir.with_synopsis(strategy: IndexingStrategy, *, synthesize: Callable[[Artifact], str] | None = None, synthesizer_id: str | None = None, synopsis_kind: str = 'synopsis') IndexingStrategy[source]
Wrap strategy to add one LLM-derived
synopsissurface per artifact.- Parameters:
strategy – the inner
IndexingStrategy(Chunked,Package, …). Its surfaces are kept; the synopsis is prepended.synthesize – an injectable
Artifact -> str(test double / custom summarizer). Omitted →make_llm_synthesizer()(lazyoa).synthesizer_id – explicit identity stamp for staleness (recommended when injecting an unnamed callable / lambda). Omitted → the synthesizer’s own
synthesizer_id/__qualname__.synopsis_kind – the surface kind (default
"synopsis", a summary kind).
- Returns:
an
IndexingStrategyusable anywhere a strategy is —ir.CorpusSource.from_mapping(docs, name=..., strategy=with_synopsis(...)).
>>> strat = with_synopsis(Chunked(), synthesize=lambda a: "a summary")