CANONICAL EVAL · 29/30 · GO WITH CAVEATS · 20260519-153524
A retrieval system designed around Trust.
FirmMemory is built to refuse what it can't ground, surface what it doesn't know, and cite what it does. The architecture is shaped by that constraint first, performance second.
Stage-1 retrieval, the step that decides which matters to look at, blends three signals: semantic similarity over matter summaries, full-text search over the ledger, and a one-step walk over the matter cross-reference graph.
The three signals are fused with reciprocal rank fusion (RRF) at k = 60. Rank-1 order is RRF-driven; cosine remains attached to every candidate for transparency and downstream routing.
This matters because cosine alone gets the wrong lead matter on a non-trivial fraction of real queries. RRF lets a matter that is moderately similar on text but strongly connected to the right cluster, or one that the ledger surfaces on a partner name or matter type, beat a noisier cosine winner.
Fig 01 · RRF over three signalsk = 60 · top-5 per signal · fused list at right
cosine · matter summariesw = uniform
01Project Meridian.81
02Project Lyra.79
03Project Falcon.74
04NSI practice note (rev. Q4'24).68
05Project Sable.61
fts · ledgerbm25
01Project Falcon12.4
02Project Lyra11.6
03Project Meridian10.8
04Project Aurora (gap)9.4
05Project Tern7.1
graph · cross-reference walk1-hop
01Project Lyra.92
02Project Meridian.88
03Project Falcon.83
04Project Aurora.71
05NSI practice note.62
Fused rank · RRF
01Project Lyra.0364
02Project Meridian.0357
03Project Falcon.0341
04Project Aurora (gap).0192
05NSI practice note.0184
06Project Sable.0098
rrf_score(matter) = Σ 1 / (k + rank_in_signal_i)
i ∈ {cosine, fts, graph}
k = 60
Every retrieval decision Firm Memory makes starts here. The fused order is what stage 2 reads.
The trade-off is deliberate: RRF accepts that no single signal is authoritative. The system reports all three, ranked and fused, so the shortlist is defensible to the lawyer reading it, not just to the model.
§ 02 — Two-stage selection
002 / 008
Matters first. Paragraphs second. With a doc-type diversity floor.
Stage 2 is per-matter. For each shortlisted matter, the system retrieves up to M answer paragraphs (default 3) and K reflection paragraphs (default 2) from the matter's documents.
Answer paragraphs exclude reflection document types (post-matter reviews, research memos) and are subject to a doc-type diversity penalty. The selection avoids stacking three paragraphs from the same memo when other documents in the matter could speak to the question.
Reflection paragraphs are pulled separately and force-included regardless of cosine score, because reflection content (limitation language, hindsight notes, “first-of-its-kind” flags) is what feeds the gap detection layer downstream.
Fig 02 · Stage 1 → Stage 2Reflection pool selected by a different rule
matter_opening_memo : ¶1 · deal facts & sovereign counterparty
fsr_readiness_assessment : ¶4 · first FSR clearance shape
regulatory_bundle : ¶3 · NSI / FCA s.178 timing
Reflection poolforce-included · feeds T1 gaps
post_matter_review : ¶6 · “first-of-its-kind FSR” flag
research_memo : ¶2 · limitation note on EU NZIA exposure
The bench picture is honest about how many matters were touched because diversity is enforced at retrieval, not begged for at synthesis.
This is why the home-page demo's bench picture is honest about how many matters were touched: the system structurally cannot collapse five matters into one paragraph from one document. Diversity is enforced at retrieval, not begged for at synthesis.
§ 03 — The bench picture
003 / 008
What retrieval actually returned, surfaced before the model writes.
The bench picture is six deterministic fields plus three transparency stats. Built from the matters table and the stage-1 shortlist directly. No LLM involvement.
Six fields: matter count, practice distribution, lead-partner concentration, temporal spread, office distribution, deal value range where known.
Three transparency stats: stage-1 cosine span, cosine gap between rank-1 and rank-N, paragraph depth across the bundle. These let the reader judge whether the shortlist is tight or diffuse without asking the model to self-assess.
Fig 03 · Bench response schemadeterministic · no LLM
The fields a Knowledge Director would derive by hand if they had two hours per query.
The bench picture is non-negotiable. Every response carries it. If retrieval returned only one matter, the bench says so plainly, including the degraded-copy convention (“Firm's only matter on this fact pattern”) that makes thin retrieval visible rather than hidden.
§ 04 — Four-channel gaps
004 / 008
The system doesn't conflate “we don't know” with “we don't know yet.”
The gaps and limits panel runs every response. It's never omitted. The content comes from four distinct channels, surfaced separately so the reader can tell which kind of gap they're looking at.
T1 · Acknowledged limitations
From the firm's own reflection
Limitation language pulled from post-matter reviews and research memos in the corpus.
Eval example · Q08 Lyra“Project Lyra's research memo notes 'no prior firm experience with EU FSR in a live transactional context.'”
T2 · Coverage
When retrieval is thin or clustered
An honest note when retrieval is thin or thematically clustered, with no implied external facts.
Eval example · Q19 Russian inbound“The retrieval slice for 'Russian inbound investment' returned no substantive matter bundles.”
T3 · Corpus adjacency
Named but not present
Matters referenced but not present in the corpus, or gap matters surfaced via cross-reference.
Eval example · Q20 Aurora“Project Aurora (HC-2018-MA-0034) is named in the ledger but holds no documents.”
T4 · Question coverage
Facets the text doesn't cover
Facets of the question the retrieved text doesn't cover, surfaced by the synthesis layer from its own unsupported_aspects field.
Eval example · Q03 enforcement“The retrieval set does not cover post-completion enforcement outcomes for the three matters.”
T1 through T3 are built at retrieval time, deterministically, from the data. T4 is the only channel that the language model contributes to, and the model can only write into T4 if it explicitly flagged a question facet as unsupported in its own structured output.
This is the trust architecture's load-bearing decision. The gap panel cannot say something the retrieval layer hasn't already established. The model cannot hallucinate a gap any more than it can hallucinate an answer.
§ 05 — Shape-aware synthesis
005 / 008
Five response shapes. One pipeline.
Firm Memory routes queries to one of five response shapes before synthesis runs. The router reads stage-1 output (RRF rank, cosine, paragraph counts, gap-matter membership) and picks the shape that matches the epistemic reality of what was found.
Fig 05 · Shape router · five shapesdeterministic on stage-1 output
Shape
When it fires
Claim cap
hero
≥ 3 matters with substance
rank-1 cosine ≥ 0.55 OR
rank-1 RRF ≥ 0.032 + ≥ 3 ¶s
6 claims
thin
1–2 substantive matters, or
tail bundle weak
3 claims
known_empty
rank-1 is a status=gap matter
or atomic gap fires
3 claims
known_empty_decomposed
query entity unmatched in
retrieved text · theme adjacency available
Each shape has its own prompt template and claim cap. The router is deterministic.
The shapes exist because the same prompt can't produce the right answer for both “have we advised on this exact fact pattern” and “have we ever done anything like this.” Stretching one template to fit both is the failure mode behind every confidently wrong RAG demo.
Each shape has its own prompt template and its own claim cap. Hero shape permits breadth across matters with confident multi-source claims. Thin shape says less and signals the narrow base. Known-empty shapes refuse to fabricate substance. Refuse shape stops before synthesis runs at all.
The router is deterministic. The same query in the same corpus state produces the same shape.
§ 06 — Citation enforcement
006 / 008
Every sentence cites a paragraph. The model cannot answer ungrounded.
Synthesis output is a JSON object with two fields: a claims array where each item is one sentence with one or more paragraph citations, and an unsupported_aspects array (which feeds gap channel T4).
Fig 06 · Structured synthesis outputone sentence per claim · one claim, one citation set
{"claims":[{"text":"Three matters establish the relevant precedent.","citations":["HC-2021-MA-0089:matter_opening_memo:1","HC-2023-MA-0142:matter_opening_memo:1","HC-2024-MA-0167:comprehensive_deal_summary:3"]}],"unsupported_aspects":["post-completion enforcement outcomes for the three matters"]}
Step 1
Synthesis emits structured JSON
Step 2 · validate
Each citation key checked against retrieval set
Step 3 · retry
On fail: one retry with stricter prompt reminder
Step 4 · degrade
If retry fails: degraded output, never confidently wrong
One sentence per claim. The key format matter_id:doc_type:paragraph_idx matches the home page demo's hover tooltips.
Every citation key (matter_id:doc_type:paragraph_idx) is validated against the retrieval set before the response is shown to the user. A hallucinated citation, a key that doesn't exist in the actual retrieved paragraphs, fails validation and triggers one retry with a stricter prompt reminder.
If the retry also fails validation, the system surfaces degraded output rather than confidently wrong output. Degraded output is rare in the canonical eval (0 of 30 queries on the last full run).
The one-sentence-per-claim constraint is enforced both at the prompt level and at the validation layer. It's the smallest unit of citation we found that holds: a sentence is short enough to be grounded specifically, long enough to carry meaning on its own.
§ 07 — Refuse discipline
007 / 008
The system refuses what a knowledge tool shouldn't do.
Three categories of question are blocked before synthesis runs:
Prediction
“Will the FCA accept this filing?”“What outcome should we expect?”
Client-letter drafting
“Draft a response to opposing counsel.”“Write the cover letter for this submission.”
Firm strategy
“Should we expand the FSR practice?”“Which clients should we deprioritise?”
Fig 07 · Refuse gatepattern-based · deterministic · no LLM
“Firm Memory retrieves and synthesises firm history; it does not predict, draft, or advise on strategy.”
The gate is deterministic, not LLM-based. Patterns are explicit and testable. A query that triggers the gate gets a single sentence back and nothing more.
This is the cheapest trust signal in the system and the most often missed by competing RAG demos. A research tool that confidently drafts a client letter is a research tool with no boundary. The boundary is the product.
§ 08 — The evaluation kit
008 / 008
30 queries. Six axes. A regression-grade test the product is built around.
Firm Memory's behaviour is locked by an evaluation kit that runs against the corpus before every release. 30 queries spanning hero, mid, gap, OOS, and adversarial categories. Each query is scored on six binary axes:
1Citation integrityEvery claim resolves to a real paragraph.
3Bench fidelityBench counts match the retrieval set.
4Refuse disciplineOOS queries don't get answers.
5Gap surfacingGaps appear in the right channel.
6Opener behaviourMulti-matter queries open plurally.
The kit also runs architectural invariants, checks that hold across the whole run, not per-query:
INV-01 · no false positive on gap mattersINV-04 · hero confidence under hybrid retrievalINV-08 · plural opener on multi-matter hero queriesINV-10 · gap matter named in openerINV-11 · every cited matter has paragraphs in retrieval
Headline score
29 / 30
Canonical run 20260519-153524 · clears both blocker and high-severity invariant budgets.
The one fail · Q20
System correctly refuses to fabricate Russian inbound precedent.
The opener doesn't name the specific gap matter (Project Aurora). Behaviour is right; the prose is one sentence away from a six-axis pass.
Logged as
TD-001
v2 ticket: corpus-agnostic gap-entity rewrite. Open in the engineering log. Real regression worth fixing; not a false alarm.
We publish the eval state because trust architecture is not a slide. It's a number. 29 of 30 on a 30-query adversarial kit, with the one fail documented, is the version of credibility we'd want to see from a vendor ourselves.
Bring three partners. Leave with a working diagnostic.
Forty-five minutes. NDA-first. We come prepared with public information about your firm.
You leave with our written read on where Firm Memory would pay back inside a quarter, and a candid answer on whether we're the right partner for that work.