ir

ir — an information-retrieval substrate for agentic systems.

One uniform “find the relevant things in this corpus” contract that scales from an ad-hoc search over an ephemeral list to a maintained search engine. Retrieval is the core; selection/expansion/reranking/generation are layered on top.

Quick start:

import ir

# Define a corpus source (abstract strategy + parameters, smart defaults):
source = ir.CorpusSource.from_md_reports()          # project docs/ reports
corpus = ir.build(source)                            # index (incremental)
hits = ir.search(corpus, "how do I deploy the app")  # ranked SearchHits

# Light, dependency-free embedding for fast tests:
corpus = ir.build(source, embedder="light")

A corpus source is defined by a scope (what is in the corpus), a change_signal (what counts as stale), an indexing_strategy (how a raw item becomes filter fields + embeddable surfaces), and an embedder. The default embedder is a decent local model (all-MiniLM-L6-v2); "light" selects a numpy-only hashing embedder. Data persists under XDG dirs through a dol repository layer.

class ir.Artifact(id: str, raw: Any, metadata: dict = <factory>)[source]: A logical corpus item before decomposition into surfaces.

class ir.Chunked(*, chunk_size: int = 1200, overlap: int = 200, text_key: str | None = None, kind: str = 'chunk')[source]: Split the artifact’s text into overlapping chunk surfaces.

class ir.ClaudeTurn(*, include_full: bool = False)[source]

Index a Claude Code session turn-pair: user prompt + assistant summary.

Two surfaces by default — user_prompt (what the human asked) and assistant_summary (the assistant’s final end-of-turn text, the highest-signal “here’s what I did”; the deliberation before it is mostly noise) — so a query can target either side via surfaces={"user_prompt"} / {"assistant_summary"}. With include_full=True a third assistant_full surface (all the turn’s assistant natural-language text) trades noise for recall — off by default. Session / project / time / model / tool-use become hard-filter fields.

The raw artifact is a turn-pair record (see priv.claude_transcripts.turn_pair_records()): a mapping with user_prompt / assistant_summary / assistant_full plus metadata.

class ir.Corpus(name: str, store: CorpusStore, embedder: Callable, embedder_id: str)[source]

A built, queryable corpus: a store plus its embedder.

search(query, **kwargs)[source]

Search this corpus for query.

**kwargs (k / mode / filter / surfaces / per_artifact / …) are forwarded to ir.retrieve.search().

class ir.CorpusGraph(store_or_corpus: Any)[source]

A GraphStore over one corpus — artifact nodes, links edges.

node_id is an artifact_id. graph[aid] is the artifact’s stored records (its scorable surfaces, in plan order); graph.neighbors(aid, edge_type=...) reads the corpus store’s links view. Single-corpus, so it resolves intra-corpus targets; cross-corpus [source, artifact_id] targets are returned by neighbors() verbatim but __getitem__ only dereferences ids in this corpus (federated traversal is a follow-up).

edge_types(node_id: str) → list[str][source]: The edge types present on node_id ([] if none).

neighbors(node_id: str, *, edge_type: str | None = None) → list[source]

Outgoing neighbor ids of node_id, optionally of one edge_type.

Returns target ids in stored form (a bare artifact_id, whose source is this graph’s source; or a [source, artifact_id] list for a cross-corpus edge), de-duplicated with first-seen order preserved. An artifact with no edges — or a store without a links view — yields []. Pass a canonical (source, artifact_id) to canonical_node_id() for a traversal’s visited-set.

node_id is an intra-corpus artifact_id (a str); a cross-corpus target fetched from another graph is out of contract here (it has no edges in this corpus).

source: The corpus name, when known — the source half of this graph’s node identities (None for a bare store).

class ir.CorpusSource(name: str, scope: ~collections.abc.Mapping[str, ~typing.Any], indexing_strategy: ~ir.strategy.IndexingStrategy = <factory>, change_signal: ~collections.abc.Callable[[str, ~typing.Any], str] = <function content_hash_signal>, embedder: ~typing.Any = 'default', metadata_of: ~collections.abc.Callable[[str, ~typing.Any], ~collections.abc.Mapping[str, ~typing.Any]] | None = None)[source]

A corpus definition: scope + change signal + strategy + embedder.

change_signal(raw: Any) → str: Default change signal: a content hash of the raw payload.

classmethod from_claude_sessions(*, name: str = 'sessions', since: float | None = 90, projects: Any = None, include_full: bool = False, include_session_title: bool = True, max_sessions: int | None = None, root: str | Path | None = None, fetcher: Callable[[], list] | None = None, strategy: IndexingStrategy | None = None, **kwargs) → CorpusSource[source]

The user’s Claude Code session transcripts as a corpus (turn pairs).

Each artifact is one user→assistant turn pair; the default ClaudeTurn strategy indexes the user prompt and the assistant’s end-of-turn summary as separate surfaces (target either with surfaces={"user_prompt"} / {"assistant_summary"}). include_full adds the full assistant text surface (off by default — the summary is the signal). include_session_title (default on) also indexes one record per session whose surface is the session’s persisted custom/AI title — a cheap “what was this session about” surface. Scope defaults to the last since days (a full-history build is heavy); narrow with projects (a cwd substring or list) and max_sessions.

fetcher overrides the record source (each a mapping with user_prompt / assistant_summary / … ) — inject a test double to avoid the priv dependency. Otherwise records come from priv.claude_transcripts.turn_pair_records().

classmethod from_files(root: str | Path, *, name: str | None = None, pattern: str = '.*\\.md$', exclude: Callable[[str], bool] | None = None, strategy: IndexingStrategy | None = None, **kwargs) → CorpusSource[source]: A directory tree of text files as a corpus (lazy dol scope).

classmethod from_mapping(mapping: Mapping[str, Any], *, name: str, strategy: IndexingStrategy | None = None, **kwargs) → CorpusSource[source]: Any mapping {id -> raw} (dict, dol store) as a corpus.

classmethod from_md_reports(*, name: str = 'reports', projects_root: str | Path | None = None, strategy: IndexingStrategy | None = None, **kwargs) → CorpusSource[source]

Markdown reports under projects’ docs/ and misc/docs/.

Excludes ALL-CAPS filenames (README/CLAUDE/MEMORY/SKILL…). Each record is a project-tagged document; ids are paths relative to the projects root.

classmethod from_packages(*, name: str = 'packages', manifest: str | Path | None = None, readme_chars: int = 20000, strategy: IndexingStrategy | None = None, **kwargs) → CorpusSource[source]: The local package ecosystem, scanned from the .pth manifest.

classmethod from_skills(*, name: str = 'skills', filter: Any = None, fetcher: Callable[[], list] | None = None, strategy: IndexingStrategy | None = None, **kwargs) → CorpusSource[source]

The agent-skills corpus, via priv.skills_index.

fetcher overrides the source of skill records (each a mapping with name/description/parent) — inject a test double to avoid the priv dependency.

items()[source]: Iterate (artifact_id, raw) pairs over the corpus scope.

class ir.CorpusStore(meta: MutableMapping[str, Any], vectors: MutableMapping[str, ndarray], ledger: MutableMapping[str, Any], config: MutableMapping[str, Any], calibration: MutableMapping[str, Any] | None = None, links: MutableMapping[str, Any] | None = None, *, packed_dir: str | Path | None = None)[source]

Repository bundling the meta/vectors/ledger/config views of one corpus.

calibration_modes() → list[str][source]: The ranking modes that currently have a stored calibration.

delete_ledger_entry(key: str) → None[source]: Remove a ledger entry; a missing key is tolerated.

delete_links(artifact_id: str) → None[source]: Remove an artifact’s edges; a missing entry is tolerated.

delete_record(record_id: str) → None[source]: Remove a record’s metadata + vector; a missing id is tolerated.

get_calibration(mode: str) → dict | None[source]

The stored calibration record for ranking mode (None if absent).

A deep copy, so a caller cannot mutate the nested grid back into the stored record (in-memory stores share their objects by reference).

get_config() → dict[source]: The persisted corpus build settings (empty dict if never written).

get_ledger_entry(key: str) → dict | None[source]: The ledger entry for key (None if absent).

get_links(artifact_id: str) → dict[source]

The outgoing edges of artifact_id — {edge_type: [target, ...]}.

Empty dict when the artifact has no stored edges (or no links view). A copy, so a caller cannot mutate the persisted adjacency in place.

get_maintenance_state() → dict[source]

Background-work bookkeeping (e.g. last_maintained); {} if unset.

Kept under a separate config-view key from the build settings: it is regenerable scheduler state (when ir maintain last ran), not part of the corpus’s build identity, so it must never clobber it.

get_record(record_id: str) → Record[source]: Reassemble the Record for record_id (KeyError if absent).

ledger_items() → Iterator[tuple[str, dict]][source]: Iterate (key, entry) ledger pairs (the ledger may be mutated while iterating).

link_items() → Iterator[tuple[str, dict]][source]: Iterate (artifact_id, {edge_type: [target]}) adjacency pairs.

classmethod local(name: str) → CorpusStore[source]: File-backed store under ~/.local/share/ir/corpora/<name>.

matrix() → tuple[list[str], ndarray, list[dict]][source]

Return (record_ids, normalized_matrix, metas) for brute force.

Rows are L2-normalized so cosine similarity is a dot product. Empty corpora return a (0, 0) matrix.

Caching is two-tier: an in-process cache (invalidated on the next write) backed, for file-rooted stores, by an on-disk packed cache — one normalized-matrix .npy plus its ids/metas, written once and reloaded with a single memory-mapped read. The packed cache turns a cold reopen from a per-record vector-file storm (thousands of tiny reads) into three file reads; it is cleared by any record write, so it never goes stale.

classmethod memory() → CorpusStore[source]: In-memory store (no dependencies); ideal for tests.

metas() → tuple[list[str], list[dict]][source]

Return (record_ids, metas) without loading any vectors.

The vector-free counterpart of matrix(), for ranking modes that score on text alone (mode="lexical"): they need candidate metadata (text + filter fields) but never the embedding matrix, so they must not pay its I/O. Reuses the in-process or packed cache when present; else reads only the meta view (not vectors).

put_record(record: Record) → None[source]: Persist record’s metadata + vector, invalidating the search matrix.

record_ids() → Iterator[str][source]: Iterate the record ids currently stored.

set_calibration(mode: str, record: Mapping[str, Any]) → None[source]

Persist a calibration record for ranking mode (one per mode).

mode keys a file in the calibration store, so it must be a non-empty string with no path separator (the real modes — dense / lexical / hybrid — already satisfy this).

set_config(settings: Mapping[str, Any]) → None[source]: Persist the corpus build settings (name / embedder spec + id).

set_ledger_entry(key: str, entry: Mapping[str, Any]) → None[source]: Write the ledger entry (version / embedder id / record ids) for key.

set_links(artifact_id: str, edges: Mapping[str, Any]) → None[source]

Persist artifact_id’s outgoing edges ({edge_type: [target]}).

Empty edge-type lists are dropped; an empty result deletes the entry (no empty adjacency rows linger). Targets are stored verbatim — a bare artifact_id or a [source, artifact_id] pair.

set_maintenance_state(state: Mapping[str, Any]) → None[source]: Persist the maintenance bookkeeping for this corpus.

class ir.Disclosure(artifact_id: str, level: str, name: str, score: float, summary: str, body: str | None = None, pointer: str | None = None, metadata: Mapping[str, ~typing.Any]=<factory>, source: str | None = None, passage: str | None = None)[source]

The progressively-disclosed payload for one selected artifact.

artifact_id

the artifact this payload belongs to.

Type:: str

level

how much was loaded — "metadata" (no I/O), "body" (the pointer’s full text), or "bundled" (body + extras).

Type:: str

name

a display name (the name filter field, else the id).

Type:: str

score

the selecting hit’s score.

Type:: float

summary

the matched surface text — always present, always cheap.

Type:: str

body

the full payload (SKILL.md / file text); None below "body" level or when the pointer could not be read.

Type:: str | None

pointer

the source pointer (skill_path / path) — the “package pointer” an agent follows to act; None if the hit has none.

Type:: str | None

metadata

the hit’s filter metadata, plus a disclosure note when a pointer was present but unreadable (stale/moved/deleted), and an expansion note when expansion was requested but not possible for this hit.

Type:: collections.abc.Mapping[str, Any]

source

the corpus/source name the selecting hit came from (None when unattributed) — the attribution a federated caller needs to tell two same-id artifacts from different corpora apart.

Type:: str | None

passage

the expanded neighborhood text (disclose(..., expand=...)) — the mid-granularity payload between summary (the matched surface) and body (the pointer’s full text); None when expansion was not requested or not possible. Assembled from the corpus’s stored records (see ir.expand), unlike body, which dereferences the pointer to an external resource.

Type:: str | None

to_dict() → dict[source]: JSON-serializable form (score cast to float).

class ir.DiscoveryResult(query: str, mode: str, strategy: str, disclose_level: str, results: list[~ir.select.Disclosure], abstained: bool, reason: str, n_retrieved: int, signals: ~collections.abc.Mapping[str, ~typing.Any] = <factory>)[source]

The result of discover() — retrieve → select → (optional) disclose.

The qh-exposable payload: to_dict() is fully JSON-serializable (lists of dicts, floats, strings, bools — no numpy, no objects), so a FastAPI facade can return it directly.

property ids: list[str]: The committed artifact ids, best-first.

to_dict() → dict[source]: JSON-serializable result for the qh / HTTP surface.

class ir.GraphStore(*args, **kwargs)[source]

Structural contract a traversal operator binds to — node + neighbors.

Deliberately minimal (two methods) so it is satisfied by ir’s CorpusGraph and by any external graph: __getitem__ resolves a node id to its scorable payload, neighbors lists adjacent node ids (optionally of one edge type). Granularity-agnostic on purpose — an artifact graph and a surface-level tree are both ``GraphStore``s.

runtime_checkable makes isinstance(x, GraphStore) a structural check on attribute names only (not signatures) — enough to tell a conforming adapter from an arbitrary object, but it cannot validate that neighbors takes the right arguments; treat it as a smoke check.

class ir.IndexPlan(filter_fields: dict = <factory>, surfaces: list[Surface] = <factory>)[source]: An IndexingStrategy’s output for one artifact.

class ir.IndexingStrategy(*args, **kwargs)[source]: Decompose one artifact into filter fields + embeddable surfaces.

class ir.MaintenancePolicy(reindex: ReindexPolicy = <factory>, synopsis: SynopsisPolicy = <factory>)[source]

The background-work policy for one corpus (reindex + synopsis).

merged(override: dict | None) → MaintenancePolicy[source]: Layer an override dict on top of this policy (entry over defaults).

class ir.MaintenanceResult(name: str, ran: bool, reason: str, reindex: bool = False, synopsis: bool = False, records: int | None = None)[source]: What maintain_corpus() did (or would do) for one corpus.

class ir.Package(*, chunk_size: int = 1500, overlap: int = 200, embed_deps: bool = False, deps_template: Callable[[list[str]], str] | None = None)[source]

Package strategy: name + description surface plus README chunks.

Filter fields capture ownership (ours vs third-party), name, deps. AI synopsis / problem-class surfaces are a documented extension point.

With embed_deps=True, decompose additionally emits one Surface(kind="deps", granularity="field") whose text is a prefix-form serialization of the bare dependency names (deps_template, default _default_deps_text()) — so a query for a domain matches a package by the libraries it depends on (e.g. sentence-transformers -> embeddings, networkx -> graphs), and the BM25 leg picks up exact dep-token matches. The deps bag is kept separate from prose (its own surface) so a rare library name is not diluted, and deps remain a filter field regardless. embed_deps defaults False (today’s behavior); it folds into the strategy id, so toggling it re-decomposes incrementally. The deps surface is appended last, leaving the description (position 0) and readme_chunk indices unchanged.

Surface indexing: the description surface (kept whenever name or description is non-empty) occupies plan position 0, so readme_chunk j is stored with Record.surface_index == j + 1 while its surface metadata says chunk_index == j — surface_index is plan-global, chunk_index per-kind (see ir.base.Record.make_id()). Never derive sibling record ids from chunk_index; use the ledger (ir.retrieve.records_for_artifact()). n_chunks is stamped on readme chunks at decompose time, but corpora built before the stamp keep records without it until the artifact re-indexes (content / embedder / strategy change) — read it with metadata.get("n_chunks").

class ir.Passage(artifact_id: str, surface_kind: str, score: float, text: str, record_ids: tuple[str, ...] = (), source: str | None = None, surface_index: int | None = None)[source]

An expanded hit: the seed’s identity + the stitched neighborhood text.

artifact_id / surface_kind / score / source / surface_index are the seed hit’s — expansion never disturbs hit identity (source, artifact_id) or scores. text is the assembled neighborhood (overlap-deduped, plan order) and record_ids the ordered stored segments it was stitched from (empty when expansion degraded to the seed’s own text).

to_dict() → dict[source]: JSON-serializable form (score cast to float, ids as a list).

class ir.Record(id: str, artifact_id: str, surface_kind: str, surface_index: int, text: str, vector: ndarray, metadata: dict = <factory>)[source]

A stored, embedded surface — one row of the index.

static make_id(artifact_id: str, surface_kind: str, surface_index: int) → str[source]

Deterministic storage id for a surface of an artifact.

surface_index is the surface’s plan-global position — its enumeration index across all surfaces of the artifact’s IndexPlan, regardless of kind — as assigned by ir.index.build(). On multi-kind strategies it therefore differs from per-kind counters like metadata["chunk_index"] (e.g. Package: the description surface takes position 0, shifting readme_chunk j to surface_index j+1 — and the offset is plan-dependent, since empty surfaces are dropped).

Ids of already-built corpora are a stability contract: never re-derive a sibling’s id from a per-kind index — address siblings through the ledger via ir.retrieve.records_for_artifact().

class ir.ReindexPolicy(on: str = 'source-change', every_hours: float | None = None)[source]: When to (incrementally) rebuild a corpus.

class ir.SearchHit(artifact_id: str, surface_kind: str, score: float, text: str, metadata: Mapping[str, ~typing.Any]=<factory>, source: str | None = None, surface_index: int | None = None)[source]

A scored record returned by retrieval (higher score = closer).

Maps onto ir_09’s Result: text is the snippet, score the rank score, metadata the meta, and pointer the key into a resource store (ir_09 §5). to_dict() is the serialization-clean form for a cross-process / subagent boundary (no numpy scalars leak).

source is the corpus/source name the hit came from (None when unattributed — e.g. an ad-hoc corpus without a name). It is a first-class field, not a metadata key, because metadata is the strategy-owned hard-filter namespace and provenance is structural: artifact identity is only unique within a source, so any cross-source operation keys on (source, artifact_id) (see best_per_artifact()).

surface_index is the stored Record.surface_index of the hit’s surface — its plan-global position among the artifact’s surfaces — so a hit can name which surface of its artifact it is (the prerequisite for sibling addressing and context expansion). None when unknown (e.g. a hand-built hit). It is not the per-kind metadata["chunk_index"]; see Record.make_id() for why the two differ on multi-kind strategies.

property pointer: str | None: The disclosure pointer on this hit, if any (see POINTER_KEYS).

to_dict() → dict[source]: JSON-serializable form (score cast to a Python float).

class ir.Selection(selected: list[~ir.base.SearchHit], candidates: list[~ir.base.SearchHit], abstained: bool, reason: str, signals: ~collections.abc.Mapping[str, ~typing.Any] = <factory>)[source]

A selector’s commitment: the chosen subset of a ranked candidate list.

selected

the committed hits, best-first (empty iff abstained).

Type:: list[ir.base.SearchHit]

candidates

the full ranked input, kept for provenance / audit.

Type:: list[ir.base.SearchHit]

abstained

True iff the selector committed to nothing by policy.

Type:: bool

reason

which rule ended the commit (e.g. "rel_threshold", "score_gap", "max_k", "abstain:below_floor").

Type:: str

signals

concrete, defined numbers behind the decision (top_score, n_candidates, n_selected, min_ratio) — the auditable replacement for an opaque “confidence” float.

Type:: collections.abc.Mapping[str, Any]

property selected_ids: list[str]: The committed artifact ids, best-first.

property sufficient: bool

A model-free sufficiency hint for an agent’s Evaluator (ir_09 §3).

True when this selection committed to at least one item (i.e. did not abstain). It is a signal, not a directive: the re-query / refinement decision and the loop belong to the agent layer (the back-edge, ir_09 §4) — ir derives this from its own outcome and never acts on it.

to_dict() → dict[source]: JSON-serializable form (scores cast to float).

class ir.Skill[source]

Capability strategy: embed name + description only.

The body (SKILL.md) is loaded post-selection and is not indexed; name and parent are filter fields.

class ir.Surface(artifact_id: str, kind: str, text: str, granularity: str = 'document', metadata: Mapping[str, ~typing.Any]=<factory>)[source]

One embeddable unit derived from an artifact.

kind names the surface type (e.g. "description", "synopsis", "problem_class", "chunk") so a query can match the right part of an artifact. granularity is a coarse hint ("document" / "chunk" / "field"). metadata is surface-local (e.g. chunk offsets).

class ir.SynopsisPolicy(enabled: bool = False, scope: str = 'recent', window_days: int = 30, downtime_hours: tuple[int, int] | None = None)[source]

Whether/when to attach (expensive, LLM-generated) synopses.

downtime_hours is a [start, end) pair of local-clock hours (wrapping past midnight is allowed, e.g. (22, 6)); None means “any time”. scope="recent" limits synthesis to artifacts whose timestamp is within window_days (corpora that expose a time signal); scope="all" synthesizes every artifact (bounded by incrementality).

class ir.WalkPolicy(*args, **kwargs)[source]

The pluggable strategy of a walk — graph semantics, not safety.

seed produces the initial frontier; score ranks a node against the query; select chooses which scored frontier nodes to commit/expand this step (beam/greedy — default: all, best-first); expand yields a node’s neighbors; node_id is the hashable visited-set key; stop is the injected sufficiency check; to_hit materializes a committed node as a SearchHit — or None for a router-only node (a summary that routes but is not itself a result).

class ir.WalkState(query: str, max_depth: int, budget: int, visited: set = <factory>, results: list = <factory>, cache: dict = <factory>)[source]

The operator-owned state of one traverse() call — the safety home.

visited (node ids already committed), budget, and max_depth are the structural safety primitives the operator enforces; results are the emitted hits; cache is scratch space a policy may use (e.g. to embed the query once). A policy reads this but the operator enforces the bounds — a policy cannot opt out of termination.

class ir.WholeText(*, text_key: str | None = None, kind: str = 'document')[source]: One surface = the entire text. Sensible default for a naive corpus.

ir.as_retriever(corpus_or_name, **search_defaults) → Callable[[...], list[SearchHit]][source]

Bind ONE corpus to the uniform Retriever contract.

Returns retrieve(query, **overrides) -> list[SearchHit] that calls search() with search_defaults (a per-call kwarg overrides a bound default). A corpus name is resolved once via ir.open_corpus(); pass an open Corpus to skip that. The returned callable carries the bound corpus on .corpus for introspection.

>>> retr = as_retriever(corpus, mode="hybrid", k=20)
>>> hits = retr("how do I deploy the app")
>>> hits = retr("deploy", filter={"owner": "me"})

ir.build(source: CorpusSource, *, store: CorpusStore | None = None, embedder: Any = None, full: bool = True, batch_size: int = 256, edge_extractor: Callable | None = None) → Corpus[source]

Build or incrementally update source into a Corpus.

Parameters:

store (the persistence backend (default: file-backed under XDG data dir).)
embedder (override the source's embedder spec.)
full (when True (default), prune artifacts no longer in the source.)
batch_size (embedding batch size.)
edge_extractor (an optional EdgeExtractor) – ((artifact_id, filter_fields) -> {edge_type: [target]}) that populates the corpus’s semantic links graph (see ir.graph; pass ir.default_edge_extractor() for the latent deps/parent edges). Ingest is eager — edges are (re)written for every in-scope artifact, a decompose-only pass with no embedding, so the graph never goes partially stale — while embedding stays fully incremental. Edges are derived state, not part of build identity. A rebuild without an extractor leaves existing edges untouched (they are only refreshed by re-running with one, and only cleared per artifact by the full prune below) — so dropping edge_extractor does not wipe a graph.

ir.build_corpus(name, **kwargs)[source]

Build (or update) a registered/preset corpus by name; returns a Corpus.

**kwargs are forwarded to ir.build() — notably store, embedder (e.g. "light" for the numpy-only hashing embedder), full (prune artifacts no longer in the source), and batch_size.

ir.canonical_node_id(target: Any, *, source: str | None) → tuple[str | None, str][source]

Canonicalize a neighbor target to a (source, artifact_id) node id.

The repo’s node identity is (source, artifact_id) — the key a traversal visited-set must use so the same id in two corpora stays two nodes. CorpusGraph.neighbors() returns targets in stored form: a bare artifact_id (implicitly in source, the graph it came from) or a [source, artifact_id] cross-corpus pair. This resolves either to the canonical tuple.

>>> canonical_node_id("dol", source="packages")
('packages', 'dol')
>>> canonical_node_id(["skills", "deploy"], source="packages")
('skills', 'deploy')

ir.collapsed_tree_policy(*, summary_kinds: Iterable[str] = ('description', 'synopsis', 'capability', 'document'), leaf_kinds: Iterable[str] = ('chunk', 'readme_chunk'), seed_k: int = 10) → WalkPolicy[source]

The pure-vector summary-routing / collapsed-tree WalkPolicy.

Seeds on the top seed_k matches among summary_kinds surfaces and descends to each routed artifact’s leaf_kinds surfaces (the emitted results), scored by cosine to the query. No LLM in the loop. A summary surface is a router (suppressed from results) only when its artifact has leaf surfaces; on a single-surface corpus (WholeText document, Skill capability) the summaries are leaf-less and emitted directly, so the walk degrades to flat-over-summaries instead of returning nothing.

The defaults keep document / capability in summary_kinds on purpose — that is what lets a WholeText / Skill corpus seed at all; the structural router check (above) is what keeps those seeds from being silently swallowed.

>>> hits = traverse(q, corpus, policy=collapsed_tree_policy())

ir.corpora() → dict[str, Any]: All registered corpus definitions, keyed by name.

ir.default_edge_extractor(artifact_id: str, filter_fields: Mapping[str, Any]) → dict[str, list][source]

Edges latent in the standard filter fields: deps → REF, parent → PARENT.

deps (Package) → REF edges to each dependency’s bare name (version specifiers / extras / markers stripped). Self-edges and blanks are dropped.
parent (Skill) → a single PARENT edge.

A package whose deps name other packages in the same corpus gets intra-corpus REF edges; third-party deps become REF edges to ids not in the corpus (harmless — CorpusGraph.neighbors() lists them, and a traversal simply finds no node to expand).

Self-edges are dropped case-insensitively (_dep_name lower-cases, so a package "AA" depending on "aa" is recognized as a self-reference).

ir.default_policy_for_kind(kind: str) → MaintenancePolicy[source]: The smart-default MaintenancePolicy for a corpus kind.

ir.disclose(selection: Selection, *, level: str = 'body', loader: Callable[[Mapping[str, Any]], str | None] | None = None, store: Mapping[str, Any] | None = None, expand: Callable[[SearchHit, Sequence[Record]], Sequence[Record]] | None = None, corpus: Any = None) → list[Disclosure][source]

Reveal the payload of each selected hit at level — append-only, pure.

Parameters:

selection – a committed Selection.
level – "metadata" (no I/O — summary + pointer only), "body" (load the pointer’s full text), or "bundled" (body + extras; today the same as "body", reserved for bundled scripts/references).
loader – override the body resolver — metadata -> str | None. The default reads the skill_path / path pointer from disk and tolerates a missing target (returns None, never raises).
store – a ResourceStore (pointer -> payload Mapping) to dereference instead of disk — ir_09 §5 pointer-passing over a dol store / URL map / blob storage. Mutually exclusive with loader.
expand – a NeighborhoodPolicy to also stitch each hit’s neighborhood from the corpus’s stored records into Disclosure.passage (see ir.expand). Orthogonal to level, which governs pointer payloads: e.g. level="metadata", expand=sentence_window_policy() reads no pointer at all but still returns mid-granularity passages. Requires corpus=.
corpus – where expand finds each hit’s stored siblings — a Corpus / CorpusStore / name, or, for cross-source selections, a {source_name: corpus} Mapping resolved per hit via hit.source. Only meaningful with expand=.

Returns:

one Disclosure per selected hit, best-first. This is a pure read: the Selection and its hits are never mutated, so a caller can disclose append-only without disturbing a cached ranked prefix.

ir.discover(corpus: Any, query: str, *, k: int = 10, mode: str = 'hybrid', strategy: str | Callable[[Sequence[SearchHit]], list[SearchHit]] = 'conservative', disclose_level: str = 'metadata', filter: Mapping[str, Any] | None = None, surfaces: Iterable[str] | None = None, max_k: int = 3, rel: float = 0.9, gap_ratio: float = 0.5, min_score: float | str | Mapping[str, float | str | None] | None = None, merge: str | Callable = 'rrf', merge_weights: Mapping[str, float] | None = None, merge_rrf_k: int | None = None, loader: Callable[[Mapping[str, Any]], str | None] | None = None, store: Mapping[str, Any] | None = None, expand: Callable[[SearchHit, Sequence[Record]], Sequence[Record]] | None = None, **search_kw: Any) → DiscoveryResult[source]

Find and commit to the capabilities for query — the one search tool.

Retrieves k candidates, commits to a distractor-robust subset, and (optionally) discloses each committed item’s payload. This is the single agent-callable surface the capability-discovery research argues for: one tool that returns few, high-precision answers rather than a long candidate list the model must then filter under context rot.

Parameters:

corpus – a built Corpus, or a registered corpus name (resolved with ir.open_corpus()). Pass a name for the qh / HTTP surface — it is the JSON-friendly form. Pass a list/tuple of names (or Corpus objects) for single-shot federated discovery across several corpora: each is searched, per-source abstention floors gate before any merging, and the survivors are rank-fused (see merge). The caller names the sources explicitly; ir never chooses the set (source planning is the agent layer’s job, ir_09 §3).
query – the user intent.
k – candidate depth retrieved before selection. Federated: k candidates are retrieved per source, and the fused ranking is also truncated to k before selection.
mode – ranking mode — "hybrid" (default; ir’s strongest overall), "dense", or "lexical".
strategy – selection strategy (see select()).
disclose_level – "metadata" (default; cheap, no body I/O), "body", or "bundled".
filter – retrieval constraints (forwarded to ir.retrieve.search()).
surfaces – retrieval constraints (forwarded to ir.retrieve.search()).
max_k – selection parameters (see select()). min_score="auto" loads the floor calibrated for this (corpus, mode) by ir.eval.calibrate_min_score() and persisted on the corpus — the opt-in that turns on absolute abstention; it falls back to no floor (with a warning) when no calibration is stored or it is stale (a different embedder). Federated: floors are per-(corpus, mode, embedder), so a single number cannot apply across corpora — pass "auto" (each source’s own calibrated floor), a {name: floor_or_"auto"} mapping, or None; a bare float raises. Floors gate each source on its own raw scores before fusion; the fused ranking is never floored (rank-fused scores are ordinal — ir_07/ir_08).
rel – selection parameters (see select()). min_score="auto" loads the floor calibrated for this (corpus, mode) by ir.eval.calibrate_min_score() and persisted on the corpus — the opt-in that turns on absolute abstention; it falls back to no floor (with a warning) when no calibration is stored or it is stale (a different embedder). Federated: floors are per-(corpus, mode, embedder), so a single number cannot apply across corpora — pass "auto" (each source’s own calibrated floor), a {name: floor_or_"auto"} mapping, or None; a bare float raises. Floors gate each source on its own raw scores before fusion; the fused ranking is never floored (rank-fused scores are ordinal — ir_07/ir_08).
gap_ratio – selection parameters (see select()). min_score="auto" loads the floor calibrated for this (corpus, mode) by ir.eval.calibrate_min_score() and persisted on the corpus — the opt-in that turns on absolute abstention; it falls back to no floor (with a warning) when no calibration is stored or it is stale (a different embedder). Federated: floors are per-(corpus, mode, embedder), so a single number cannot apply across corpora — pass "auto" (each source’s own calibrated floor), a {name: floor_or_"auto"} mapping, or None; a bare float raises. Floors gate each source on its own raw scores before fusion; the fused ranking is never floored (rank-fused scores are ordinal — ir_07/ir_08).
min_score – selection parameters (see select()). min_score="auto" loads the floor calibrated for this (corpus, mode) by ir.eval.calibrate_min_score() and persisted on the corpus — the opt-in that turns on absolute abstention; it falls back to no floor (with a warning) when no calibration is stored or it is stale (a different embedder). Federated: floors are per-(corpus, mode, embedder), so a single number cannot apply across corpora — pass "auto" (each source’s own calibrated floor), a {name: floor_or_"auto"} mapping, or None; a bare float raises. Floors gate each source on its own raw scores before fusion; the fused ranking is never floored (rank-fused scores are ordinal — ir_07/ir_08).
merge – federated only — how the per-source rankings combine: "rrf" (default; rank-based, scale-free — see ir.retrieve.fuse_hits()), "score" (raw-score merge, valid only when all corpora share an embedder — verified, raises on mismatch), or a callable {name: hits} -> hits.
merge_weights – federated only — per-source trust weights for merge="rrf" (default 1.0 each).
merge_rrf_k – federated only — the cross-source RRF rank constant (default: DFLT_RRF_K; distinct from the within-corpus hybrid rrf_k in search_kw).
loader – optional body resolver for disclosure (see disclose()).
expand – a NeighborhoodPolicy — also stitch each committed hit’s neighborhood from its corpus’s stored records into Disclosure.passage (retrieval-time context expansion, see ir.expand). Works at any disclose_level; the federated form resolves each hit’s corpus via its source.
**search_kw – any other ir.retrieve.search() keyword (rrf_k, rerank, bm25, …).

Returns:

a DiscoveryResult (.to_dict() for JSON / qh). Federated results add signals["per_source"] (per-corpus n_retrieved / top_score / floor / abstained) and each disclosure carries its source.

ir.expand(hit: SearchHit, corpus: Any, *, policy: Callable[[SearchHit, Sequence[Record]], Sequence[Record]] | None = None) → Passage[source]

Expand hit into a Passage of its neighborhood in corpus.

Fetches the hit’s sibling records through the ledger (ir.retrieve.records_for_artifact()), asks policy which to keep, and stitches them in plan order with overlap-aware dedupe. The default policy is a ±:data:DFLT_WINDOW sentence window; pass parent_policy() for the whole artifact, or any NeighborhoodPolicy.

Parameters:

hit – the seed SearchHit (its identity and score pass through to the Passage untouched).
corpus – a Corpus, CorpusStore, or corpus name — whatever records_for_artifact() accepts. Must be the corpus the hit came from.
policy – which siblings make up the neighborhood (default: sentence window). A policy that selects nothing degrades the passage to the hit’s own text (record_ids=()) rather than returning nothing.

Raises:

KeyError – the corpus has no ledger entry for the hit’s artifact (ir.retrieve.NoLedgerEntry), or the ledger is stale — an entry listing records missing from the store.
SeedNotFound – the default window policy cannot find the seed among its artifact’s stored records (stale hit / wrong corpus).
ValueError – the policy returned records that are not siblings of the hit’s artifact (operator-enforced safety), or the seed hit lacks surface_index (hand-built hit) under the default window policy — use parent_policy(), which needs no seed position.

Merge per-source ranked hit lists into one ranking — by rank, not score.

The cross-source counterpart of the within-corpus hybrid fusion: scores from different (corpus, mode, embedder) tuples live on incommensurable scales (ir_07: “a different model re-scales everything”), so raw scores never cross the source boundary — within each source they order and dedup that source’s hits (one scale, sound), and across sources only ranks interact, via weighted Reciprocal Rank Fusion: each hit contributes weights[source] / (rrf_k + rank).

Parameters:

hits_by_source – {source_name: ranked hits}. Hits without a source are stamped with their mapping key (existing tags win, so one corpus bound under two keys still counts as one source). A None key is the untagged pseudo-source: its hits fuse as one rank group and stay unattributed (source=None). Within each list, duplicate artifacts — and, when identity is given, identity-duplicates — collapse to their best raw score before ranking, so a multi-query / multi-round pool can never double-count one artifact’s RRF mass.
rrf_k – the RRF rank constant (standard default 60).
weights – optional per-source trust dial (default 1.0 each) — a source’s contribution scales linearly, no score comparability needed. Keys naming sources absent from hits_by_source are ignored (a per-round pool may legitimately lack a configured source); callers with a closed source set should validate keys upfront, as federated ir.discover() does.
identity – how cross-source duplicates merge — see Identity. Default None: never; each (source, artifact_id) stays a distinct result.
k – truncate the fused ranking to this many hits.

Returns:

the fused hits, best-first. Each carries the fused score in score and keeps its pre-fusion magnitude as metadata["source_score"] (+ "source_rank"), so downstream consumers (abstention gates, LLM judges) never lose the per-source signal. When an identity merge combined several sources’ hits, metadata["fused_sources"] lists them and the representative hit is the one with the best rank. Single-source input passes through with raw scores (RRF of one list is that list’s order — same convention as the hybrid fusion’s single-channel fallback), so the fused-score rescaling only happens when there is genuinely something to fuse. The post-fusion score is ordinal: valid for ordering and relative cuts, meaningless against absolute floors — apply calibrated min_score floors per source, before fusing (see ir_07/ir_08 and ir.discover’s federated form).

ir.maintain(name: str | None = None, *, all: bool = False, now: datetime | None = None, dry_run: bool = False) → list[MaintenanceResult][source]

Run due background work for one corpus (name) or every registered one (all).

With neither, defaults to all registered corpora. Returns one MaintenanceResult per corpus considered.

ir.maintain_corpus(name: str, *, now: datetime | None = None, dry_run: bool = False, full: bool = True) → MaintenanceResult[source]

Do the due background work for one corpus (idempotent).

Reads the corpus’s resolved policy and its last_maintained time, decides whether a (synopsis-aware) reindex is due and permitted now, and — unless dry_run — runs the incremental build and records the run.

ir.make_llm_formulator(*, rewriter: Callable[[str], str | Sequence[str]] | None = None, prompt: str = 'Rewrite the search query into {n} short, diverse alternative search queries that would retrieve the same target documents: fix typos, expand jargon, and add synonyms, but keep each a terse search phrase. One query per line, no numbering.\n\nQuery: {query}', n: int = 3, fallback: Callable[[str], str | Sequence[str]] | None = None, **prompt_function_kwargs: Any) → Callable[[str], str | Sequence[str]][source]

An LLM-backed Formulator (rewrite / expand / multi-query).

rewriter is an injectable query -> str | [str, ...] callable (a test double, or your own router); when omitted it is built lazily on aix (aix.prompt_func), so importing this module stays offline. n is the multi-query fan-out width. Any error or empty reply falls back to fallback (default: identity_formulator()).

ir.make_llm_synthesizer(*, summarize: Callable[[str], str] | None = None, prompt: str = 'Write a concise synopsis (2-4 sentences) of the document below: what it is about and what questions it answers, so that a search over synopses can route to it. Output only the synopsis, no preamble.\n\nDocument:\n{text}', model: str | None = None, synthesizer_id: str | None = None, text_key: str | None = None, **prompt_function_kwargs: Any) → Callable[[Artifact], str][source]

An LLM-backed Synthesizer (Artifact → synopsis).

summarize is an injectable text -> str callable (a test double, or your own summarizer); when omitted it is built lazily on aix (aix.prompt_func) on the first synthesis and reused — so importing this module, and even constructing the synthesizer, stays offline. The artifact’s text is extracted with ir.strategy.text_of() using text_key — which with_synopsis() threads from the inner strategy, so the synopsis summarizes the same field the strategy indexes. An empty text, or any synthesis error, yields "" (the surface is then skipped, never a fabricated summary).

The returned callable carries a synthesizer_id attribute (default "aix:{model}:{sha(prompt)[:12]}") that with_synopsis() reads into the corpus’s strategy_id for staleness — a prompt or model change re-synthesizes.

ir.make_search(corpus: Any, *, name: str | None = None, description: str | None = None, k: int = 8, mode: str = 'hybrid') → Callable[[...], dict][source]

Return a corpus-bound search(query, k=...) -> dict tool.

The returned function exposes only query (and k) — the corpus is fixed — so a connector built over it surfaces exactly one corpus and nothing else. Its __name__ / __doc__ are set so an MCP/agent host shows a clean tool name and description. Use this when wiring a single-corpus connector; use search() when the caller should choose the corpus.

ir.open_corpus(name: str, *, embedder: Any = None) → Corpus[source]

Reopen a previously built corpus by name.

The embedding model is lazily resolved (see _LazyEmbedder): the returned corpus knows its embedder_id from stored config immediately, but only loads the model when a dense/hybrid query actually embeds. So ir ls, ir info, and lexical-only search open a corpus without the model-load cost. Pass embedder= to override the stored spec.

ir.parent_policy() → Callable[[SearchHit, Sequence[Record]], Sequence[Record]][source]

The whole artifact (small-to-big): every stored surface, plan order.

The mid-granularity analogue of disclose(level="body") — but assembled from the indexed segments rather than dereferencing the pointer, so it works for corpora whose artifacts have no on-disk body.

ir.policy_for(name: str)[source]

The effective ir.policy.MaintenancePolicy for corpus name.

Resolves the registered entry’s maintenance over its kind’s smart default over the global default (see ir.policy.resolve_policy()). An unregistered name resolves to the global default policy.

ir.records_for_artifact(store_or_corpus, artifact_id: str, *, surface_kind: str | None = None) → list[Record][source]

All stored records of artifact_id, ordered by surface_index.

The sibling-addressing primitive beneath retrieval-time context expansion: a SearchHit names its artifact (and, via surface_index, which surface of it matched); this returns every surface of that artifact, in plan order, so an expansion policy can stitch neighbors / parents around the hit.

Resolution is ledger-backed only: the artifact’s ledger entry lists its record_ids. Record ids are never re-derived from a per-kind index like metadata["chunk_index"] — on multi-kind strategies that index differs from the plan-global surface_index baked into the id (see ir.base.Record.make_id()), so derivation would fetch wrong or missing siblings.

Parameters:

store_or_corpus – a CorpusStore, anything carrying one as .store (e.g. a Corpus), or a corpus name — resolved straight to its local store: sibling lookup never embeds, so unlike ir.open_corpus() no embedder is loaded.
artifact_id – the artifact whose surfaces to fetch.
surface_kind – restrict to one surface kind (e.g. "readme_chunk"); a known artifact with no surfaces of that kind yields [].

Raises:

NoLedgerEntry – the ledger has no entry for artifact_id (an unknown artifact, or a corpus built without ir.index.build()’s ledger bookkeeping). A KeyError subclass.
KeyError – an entry exists but lists a record missing from the store (a stale ledger: interrupted build or out-of-band delete_record) — data corruption, named in the message.

ir.register(name: str, kind: str, *, embedder: str = 'default', strategy: Any = None, maintenance: Mapping[str, Any] | None = None, storage: Mapping[str, Any] | None = None, **params) → dict[source]

Register (or overwrite) a named corpus definition.

Beyond the v1 kind / embedder / params, an entry may now carry (all optional, with smart per-kind defaults applied at resolution time — see ir.policy):

strategy — an IndexingStrategy (or a {"name", "params"} spec) persisted so the corpus’s segmentation is stable across rebuilds. None keeps the preset’s default strategy.
maintenance — the background-work policy dict (reindex / synopsis; validated here, see ir.policy.MaintenancePolicy).
storage — the persistence backend (default {"backend": "local"}).

Entries written by older ir (none of these keys) keep working unchanged.

ir.resolve_policy(entry: dict | None) → MaintenancePolicy[source]

The effective policy for a registry entry: entry over kind over global.

A v1 entry (no maintenance key) resolves to its kind’s smart default, so existing corpora gain a sensible policy without a migration.

ir.retriever_for(name: str, **search_defaults: Any)[source]

A Retriever bound to the registered corpus name.

Opens the corpus (it must have been built) and wraps it with ir.as_retriever(); search_defaults (e.g. mode="hybrid") bind to every call.

ir.retrievers(**search_defaults: Any) → Mapping[str, Any][source]

A lazy Mapping[name, Retriever] view over the registry (ir_09 §8).

The query-time projection of the build-recipe registry: each value is a ready-to-call Retriever. This is the source-registry facade an orchestration layer (raglab) consumes — it never opens a corpus until the key is accessed, and always reflects the current registered() set. search_defaults apply to every source.

ir.search(corpus, query, **kwargs)[source]

Search a Corpus (or a corpus name, reopened lazily).

Thin facade over ir.retrieve.search(); **kwargs are forwarded to it — the useful ones are k (how many hits), mode ("dense" / "lexical" / "hybrid"), filter (a vd Mongo-style metadata filter), surfaces (restrict to surface kinds), and per_artifact (collapse to the best surface per artifact). See ir.retrieve.search() for the full signature and defaults.

ir.search_tool(query: str, *, corpus: Any, k: int = 8, mode: str = 'hybrid', filter: dict | None = None) → dict

Search a named ir corpus and return a JSON-serializable result dict.

Thin agent-callable wrapper over ir.discover() — returns its .to_dict() (committed results, scores, disclosures), fit to hand straight back from an MCP tool or HTTP endpoint. The corpus is a parameter, so this single function serves any corpus.

Parameters:

query – the natural-language query.
corpus – a registered corpus name (str), a list of names (federated search), or a built ir.Corpus.
k – maximum number of results.
mode – "dense" | "lexical" | "hybrid".
filter – optional vd Mongo-style metadata filter (hard pre-filter).

ir.select(hits: Sequence[SearchHit], *, strategy: str | Callable[[Sequence[SearchHit]], list[SearchHit]] = 'conservative', max_k: int = 3, rel: float = 0.9, gap_ratio: float = 0.5, min_score: float | None = None) → Selection[source]

Commit to a subset of ranked hits — the selection stage.

Parameters:

hits – ranked SearchHits (best first), as returned by ir.retrieve.search().
strategy – "conservative" (the default distractor-robust commit), one of "top_k" / "abs_threshold" / "rel_threshold" / "score_gap", or any Selector callable (hits -> subset) — e.g. one built by make_llm_selector().
max_k – never commit to more than this many (caps distractor exposure).
rel – relative-to-top keep threshold for "conservative" / the ratio for "rel_threshold".
gap_ratio – score-gap elbow ratio — used by the "score_gap" strategy only ("conservative" deliberately uses rel alone, not an elbow; see this module’s docstring).
min_score – optional absolute floor; with "conservative" the selector abstains when even the top hit falls below it (also usable as the "abs_threshold" floor).

Returns:

a Selection. abstained is True iff selected is empty.

ir.sentence_window_policy(k: int = 1) → Callable[[SearchHit, Sequence[Record]], Sequence[Record]][source]

±*k* same-kind neighbors around the seed (NEXT/PREV expansion).

The window runs over the artifact’s surfaces of the seed’s kind only, in plan order — a readme_chunk window never swallows the description surface. k=0 selects just the seed’s own record.

ir.tag_source(hits: Sequence[SearchHit], source: str | None) → list[SearchHit][source]

Stamp source on every hit that doesn’t already carry one.

Existing tags win: a hit already attributed to a corpus keeps that attribution (so re-tagging under a different registry key cannot double-count one corpus as two sources). A None source is the untagged pseudo-source — hits pass through unattributed.

ir.traverse(query: str, store: Any, *, policy: WalkPolicy, max_depth: int = 2, node_budget: int = 64, k: int = 10) → list[SearchHit][source]

Walk store from query under policy, returning the top-k hits.

The loop — score the frontier → select → commit → expand — is the operator’s; the safety primitives are non-negotiable and enforced here: a node id is committed at most once (the visited-set), expansion stops at max_depth, and no more than node_budget nodes are ever committed. A policy whose expand cycles forever and whose stop never fires still terminates.

Parameters:

query – the user intent.
store – passed to policy verbatim — a Corpus for collapsed_tree_policy(), a CorpusGraph for an artifact-link policy. The operator never inspects it.
policy – the WalkPolicy (e.g. collapsed_tree_policy()).
max_depth – maximum expansion depth from a seed (safety).
node_budget – maximum nodes committed (safety).
k – number of hits to return.

Returns:

the committed hits, best-first, top-k — each a SearchHit with metadata["walk_depth"] / ["seed"].

ir.with_synopsis(strategy: IndexingStrategy, *, synthesize: Callable[[Artifact], str] | None = None, synthesizer_id: str | None = None, synopsis_kind: str = 'synopsis') → IndexingStrategy[source]

Wrap strategy to add one LLM-derived synopsis surface per artifact.

Parameters:

strategy – the inner IndexingStrategy (Chunked, Package, …). Its surfaces are kept; the synopsis is prepended.
synthesize – an injectable Artifact -> str (test double / custom summarizer). Omitted → make_llm_synthesizer() (lazy aix).
synthesizer_id – explicit identity stamp for staleness (recommended when injecting an unnamed callable / lambda). Omitted → the synthesizer’s own synthesizer_id / __qualname__.
synopsis_kind – the surface kind (default "synopsis", a summary kind).

Returns:

an IndexingStrategy usable anywhere a strategy is — ir.CorpusSource.from_mapping(docs, name=..., strategy=with_synopsis(...)).

>>> strat = with_synopsis(Chunked(), synthesize=lambda a: "a summary")