How Firm Memory works

§ 01 / 08 — TECHNICAL REFERENCE

CANONICAL EVAL · 29/30 · GO WITH CAVEATS · 20260519-153524

A retrieval system designed around Trust.

FirmMemory is built to refuse what it can't ground, surface what it doesn't know, and cite what it does. The architecture is shaped by that constraint first, performance second.

Book a working session →Start at the pipeline →

Pipeline · five stages

Each block links to its section ↓

Cosine · FTS · Graph → RRF

Shape-aware synthesis

Hero · Thin · Empty · Refuse

OUT

Three blocks out

Bench · Answer · Gaps

§ 01 — Hybrid retrieval

001 / 008

Three signals, fused. Not keyword search alone.

Stage-1 retrieval, the step that decides which matters to look at, blends three signals: semantic similarity over matter summaries, full-text search over the ledger, and a one-step walk over the matter cross-reference graph.

The three signals are fused with reciprocal rank fusion (RRF) at k = 60. Rank-1 order is RRF-driven; cosine remains attached to every candidate for transparency and downstream routing.

This matters because cosine alone gets the wrong lead matter on a non-trivial fraction of real queries. RRF lets a matter that is moderately similar on text but strongly connected to the right cluster, or one that the ledger surfaces on a partner name or matter type, beat a noisier cosine winner.

Fig 01 · RRF over three signalsk = 60 · top-5 per signal · fused list at right

cosine · matter summariesw = uniform

01Project Meridian.81

02Project Lyra.79

03Project Falcon.74

04NSI practice note (rev. Q4'24).68

05Project Sable.61

fts · ledgerbm25

01Project Falcon12.4

02Project Lyra11.6

03Project Meridian10.8

04Project Aurora (gap)9.4

05Project Tern7.1

graph · cross-reference walk1-hop

01Project Lyra.92

02Project Meridian.88

03Project Falcon.83

04Project Aurora.71

05NSI practice note.62

Fused rank · RRF

01Project Lyra.0364

02Project Meridian.0357

03Project Falcon.0341

04Project Aurora (gap).0192

05NSI practice note.0184

06Project Sable.0098

rrf_score(matter) = Σ  1 / (k + rank_in_signal_i)
                    i ∈ {cosine, fts, graph}

k = 60

Every retrieval decision Firm Memory makes starts here. The fused order is what stage 2 reads.

The trade-off is deliberate: RRF accepts that no single signal is authoritative. The system reports all three, ranked and fused, so the shortlist is defensible to the lawyer reading it, not just to the model.

§ 02 — Two-stage selection

002 / 008

Matters first. Paragraphs second. With a doc-type diversity floor.

Stage 2 is per-matter. For each shortlisted matter, the system retrieves up to M answer paragraphs (default 3) and K reflection paragraphs (default 2) from the matter's documents.

Answer paragraphs exclude reflection document types (post-matter reviews, research memos) and are subject to a doc-type diversity penalty. The selection avoids stacking three paragraphs from the same memo when other documents in the matter could speak to the question.

Reflection paragraphs are pulled separately and force-included regardless of cosine score, because reflection content (limitation language, hindsight notes, “first-of-its-kind” flags) is what feeds the gap detection layer downstream.

Fig 02 · Stage 1 → Stage 2Reflection pool selected by a different rule

Stage 1 · matter shortlist (top 5)

01Project Lyra2024

02Project Meridian2021

03Project Falcon2023

04Project Aurora (gap)2018

05NSI practice noteQ4'24

Stage 2 · bundle for rank-1 matter (Project Lyra)

Project Lyra · paragraph bundleM = 3 answer · K = 2 reflection

Answer poolcosine-ranked · diversity floor

matter_opening_memo : ¶1 · deal facts & sovereign counterparty
fsr_readiness_assessment : ¶4 · first FSR clearance shape
regulatory_bundle : ¶3 · NSI / FCA s.178 timing

Reflection poolforce-included · feeds T1 gaps

post_matter_review : ¶6 · “first-of-its-kind FSR” flag
research_memo : ¶2 · limitation note on EU NZIA exposure

The bench picture is honest about how many matters were touched because diversity is enforced at retrieval, not begged for at synthesis.

This is why the home-page demo's bench picture is honest about how many matters were touched: the system structurally cannot collapse five matters into one paragraph from one document. Diversity is enforced at retrieval, not begged for at synthesis.

§ 03 — The bench picture

003 / 008

What retrieval actually returned, surfaced before the model writes.

The bench picture is six deterministic fields plus three transparency stats. Built from the matters table and the stage-1 shortlist directly. No LLM involvement.

Six fields: matter count, practice distribution, lead-partner concentration, temporal spread, office distribution, deal value range where known.

Three transparency stats: stage-1 cosine span, cosine gap between rank-1 and rank-N, paragraph depth across the bundle. These let the reader judge whether the shortlist is tight or diffuse without asking the model to self-assess.

Fig 03 · Bench response schemadeterministic · no LLM

{
  "matter_count": 3,
  "practices": ["Corporate / M&A"],
  "lead_partners": {
    "David Okafor": 2,
    "James Whitfield": 1
  },
  "temporal_spread": "2021 – 2024",
  "offices": ["London"],
  "deal_value_range_gbp_m": [180, 285],
  "transparency": {
    "cosine_span_top5": 0.067,
    "cosine_gap_rank1_rank3": 0.041,
    "paragraph_depth": 11
  }
}

The fields a Knowledge Director would derive by hand if they had two hours per query.

The bench picture is non-negotiable. Every response carries it. If retrieval returned only one matter, the bench says so plainly, including the degraded-copy convention (“Firm's only matter on this fact pattern”) that makes thin retrieval visible rather than hidden.

§ 04 — Four-channel gaps

004 / 008

The system doesn't conflate “we don't know” with “we don't know yet.”

The gaps and limits panel runs every response. It's never omitted. The content comes from four distinct channels, surfaced separately so the reader can tell which kind of gap they're looking at.

T1 · Acknowledged limitations

From the firm's own reflection

Limitation language pulled from post-matter reviews and research memos in the corpus.

Eval example · Q08 Lyra“Project Lyra's research memo notes 'no prior firm experience with EU FSR in a live transactional context.'”

T2 · Coverage

When retrieval is thin or clustered

An honest note when retrieval is thin or thematically clustered, with no implied external facts.

Eval example · Q19 Russian inbound“The retrieval slice for 'Russian inbound investment' returned no substantive matter bundles.”

T3 · Corpus adjacency

Named but not present

Matters referenced but not present in the corpus, or gap matters surfaced via cross-reference.

Eval example · Q20 Aurora“Project Aurora (HC-2018-MA-0034) is named in the ledger but holds no documents.”

T4 · Question coverage

Facets the text doesn't cover

Facets of the question the retrieved text doesn't cover, surfaced by the synthesis layer from its own unsupported_aspects field.

Eval example · Q03 enforcement“The retrieval set does not cover post-completion enforcement outcomes for the three matters.”

T1 through T3 are built at retrieval time, deterministically, from the data. T4 is the only channel that the language model contributes to, and the model can only write into T4 if it explicitly flagged a question facet as unsupported in its own structured output.

This is the trust architecture's load-bearing decision. The gap panel cannot say something the retrieval layer hasn't already established. The model cannot hallucinate a gap any more than it can hallucinate an answer.

§ 05 — Shape-aware synthesis

005 / 008

Five response shapes. One pipeline.

Firm Memory routes queries to one of five response shapes before synthesis runs. The router reads stage-1 output (RRF rank, cosine, paragraph counts, gap-matter membership) and picks the shape that matches the epistemic reality of what was found.

Fig 05 · Shape router · five shapesdeterministic on stage-1 output

Shape	When it fires	Claim cap
hero	≥ 3 matters with substance rank-1 cosine ≥ 0.55 OR rank-1 RRF ≥ 0.032 + ≥ 3 ¶s	6 claims
thin	1–2 substantive matters, or tail bundle weak	3 claims
known_empty	rank-1 is a status=gap matter or atomic gap fires	3 claims
known_empty_decomposed	query entity unmatched in retrieved text · theme adjacency available	3 claims
refuse	OOS intent (prediction, drafting, strategy) triggers pre-synthesis refuse gate	1 claim (templated)

Each shape has its own prompt template and claim cap. The router is deterministic.

The shapes exist because the same prompt can't produce the right answer for both “have we advised on this exact fact pattern” and “have we ever done anything like this.” Stretching one template to fit both is the failure mode behind every confidently wrong RAG demo.

Each shape has its own prompt template and its own claim cap. Hero shape permits breadth across matters with confident multi-source claims. Thin shape says less and signals the narrow base. Known-empty shapes refuse to fabricate substance. Refuse shape stops before synthesis runs at all.

The router is deterministic. The same query in the same corpus state produces the same shape.

§ 06 — Citation enforcement

006 / 008

Every sentence cites a paragraph. The model cannot answer ungrounded.

Synthesis output is a JSON object with two fields: a claims array where each item is one sentence with one or more paragraph citations, and an unsupported_aspects array (which feeds gap channel T4).

Fig 06 · Structured synthesis outputone sentence per claim · one claim, one citation set

{
  "claims": [
    {
      "text": "Three matters establish the relevant precedent.",
      "citations": [
        "HC-2021-MA-0089:matter_opening_memo:1",
        "HC-2023-MA-0142:matter_opening_memo:1",
        "HC-2024-MA-0167:comprehensive_deal_summary:3"
      ]
    }
  ],
  "unsupported_aspects": [
    "post-completion enforcement outcomes for the three matters"
  ]
}

Step 1

Synthesis emits structured JSON

Step 2 · validate

Each citation key checked against retrieval set

Step 3 · retry

On fail: one retry with stricter prompt reminder

Step 4 · degrade

If retry fails: degraded output, never confidently wrong

One sentence per claim. The key format matter_id:doc_type:paragraph_idx matches the home page demo's hover tooltips.

Every citation key (matter_id:doc_type:paragraph_idx) is validated against the retrieval set before the response is shown to the user. A hallucinated citation, a key that doesn't exist in the actual retrieved paragraphs, fails validation and triggers one retry with a stricter prompt reminder.

If the retry also fails validation, the system surfaces degraded output rather than confidently wrong output. Degraded output is rare in the canonical eval (0 of 30 queries on the last full run).

The one-sentence-per-claim constraint is enforced both at the prompt level and at the validation layer. It's the smallest unit of citation we found that holds: a sentence is short enough to be grounded specifically, long enough to carry meaning on its own.

§ 07 — Refuse discipline

007 / 008

The system refuses what a knowledge tool shouldn't do.

Three categories of question are blocked before synthesis runs:

Prediction

“Will the FCA accept this filing?”“What outcome should we expect?”

Client-letter drafting

“Draft a response to opposing counsel.”“Write the cover letter for this submission.”

Firm strategy

“Should we expand the FSR practice?”“Which clients should we deprioritise?”

Fig 07 · Refuse gatepattern-based · deterministic · no LLM

incoming query

Gate

regex + intent check (OOS patterns explicit, testable)

green path

passes to shape router

red path

templated refusal · no synthesis · no proxy

Refusal: single sentence, never elaborated

“Firm Memory retrieves and synthesises firm history; it does not predict, draft, or advise on strategy.”

The gate is deterministic, not LLM-based. Patterns are explicit and testable. A query that triggers the gate gets a single sentence back and nothing more.

This is the cheapest trust signal in the system and the most often missed by competing RAG demos. A research tool that confidently drafts a client letter is a research tool with no boundary. The boundary is the product.

§ 08 — The evaluation kit

008 / 008

30 queries. Six axes. A regression-grade test the product is built around.

Firm Memory's behaviour is locked by an evaluation kit that runs against the corpus before every release. 30 queries spanning hero, mid, gap, OOS, and adversarial categories. Each query is scored on six binary axes:

1Citation integrityEvery claim resolves to a real paragraph.

2Shape routingQuery matches expected response shape.

3Bench fidelityBench counts match the retrieval set.

4Refuse disciplineOOS queries don't get answers.

5Gap surfacingGaps appear in the right channel.

6Opener behaviourMulti-matter queries open plurally.

The kit also runs architectural invariants, checks that hold across the whole run, not per-query:

INV-01 · no false positive on gap mattersINV-04 · hero confidence under hybrid retrievalINV-08 · plural opener on multi-matter hero queriesINV-10 · gap matter named in openerINV-11 · every cited matter has paragraphs in retrieval

Headline score

29 / 30

Canonical run 20260519-153524 · clears both blocker and high-severity invariant budgets.

The one fail · Q20

System correctly refuses to fabricate Russian inbound precedent.

The opener doesn't name the specific gap matter (Project Aurora). Behaviour is right; the prose is one sentence away from a six-axis pass.

Logged as

TD-001

v2 ticket: corpus-agnostic gap-entity rewrite. Open in the engineering log. Real regression worth fixing; not a false alarm.

We publish the eval state because trust architecture is not a slide. It's a number. 29 of 30 on a 30-query adversarial kit, with the one fail documented, is the version of credibility we'd want to see from a vendor ourselves.

Bring three partners. Leave with a working diagnostic.

Forty-five minutes. NDA-first. We come prepared with public information about your firm.

You leave with our written read on where Firm Memory would pay back inside a quarter, and a candid answer on whether we're the right partner for that work.

Book a working session →Read the field notes →Back to the demo →