Most legal AI will answer anyway. The useful one won't.
Ask a legal AI a question it cannot answer, and most of them will answer anyway. Not out of malice. They are built to produce a response, and they do, whether or not the grounding for one exists. The answer arrives fluent, structured and confident, and a busy lawyer has little to tell them it rests on nothing.
It is worth being clear about which tools we are talking about, because there are two kinds and they are often discussed as one. Most legal AI answers over the outside world: the published body of case law, statutes and commentary. A smaller and newer category answers over a firm's own work, its matters, memos and precedents. The failures documented so far sit overwhelmingly with the first kind. The argument here is about what happens when the second kind is built without the same honesty about its limits.
The first kind has already taught the profession an expensive lesson. The AI Hallucination Cases Database maintained by Damien Charlotin, a researcher at HEC Paris, has now identified more than 1,600 legal decisions worldwide that turn on AI-generated hallucinated content, most of them fabricated citations. The figure is worth stating as a moving one, because it is still climbing: it sat in the low hundreds a year earlier. The trajectory, more than any single count, is the story.
The failure underneath all of those cases follows one pattern. A system generates an answer. The answer carries structure, authority and confidence. The grounding is not there. The output that does the damage is rarely the obviously broken one; it is the answer that looks good enough for a busy lawyer to rely on.
§ 02 — Architecture
02 / 08
The lesson is architectural.
I should be precise about what FirmMemory is, because the easy version of this argument overclaims. FirmMemory is not a case-law checker. It does not sit between a lawyer and a court filing hunting for invented authorities. It retrieves over a firm's own archive: prior work product, matter material, notes, precedents, the institutional record. So I am not claiming it would catch the citation failures that have been reaching courts.
The point is architectural. The same failure appears any time a system keeps answering after its grounding has run out, whatever the setting. In a court filing it surfaces as invented authority. Inside a firm's own knowledge it surfaces as something quieter and still costly: a fluent answer that implies the firm holds a precedent, a settled position, or real experience, when the archive holds nothing of the sort. A lawyer does not only need an answer that reads well. They need to know whether it rests on the firm's actual experience or on the model's fluency.
Benchmarks describe a world that isn't legal practice.
Part of why this persists is that these systems are usually judged in conditions that look nothing like real legal work. A benchmark can tell you a system performed well against a fixed task set. It cannot tell you how the same system behaves when a lawyer asks something complex, under-specified, or only partly answerable from what exists.
The marketed numbers reflect that gap. Retrieval-based legal tools have been promoted with language as strong as “hallucination-free.” When Stanford researchers tested two of the leading products, Lexis+ AI and Thomson Reuters' Ask Practical Law AI, they found each hallucinated more than 17% of the time, roughly one query in five, even though retrieval did reduce errors relative to a general-purpose model. Retrieval reduces the problem. It does not remove it, and anyone who has worked with firm knowledge will recognise why. Relevance is hard, the material is fragmented and unevenly captured, and the real answer may sit across a precedent, a note, a matter summary and an old email, or may not exist anywhere.
Hallucination cases
1,600+
Charlotin database · worldwide
Retrieval tools
>17%
Lexis+ AI · Practical Law AI
Failure rate
1 in 5
Stanford eval · 2024
Setting
Firm
Archive, not case law
So the question is not only whether the system can find something. It is whether it can tell when the thing it found does not support what is being asked. That is where a firm's trust is won or lost.
Citations are necessary, and not the whole answer.
Citations matter. A system that produces answers without showing its sources asks for too much trust, and in any serious legal setting a user needs to see the material behind a claim and inspect it. Every credible tool now offers this, which is precisely why it has stopped being a differentiator.
A citation can be present and still fail to support the proposition attached to it. A source can be on-topic in general and beside the point in particular. A document can be semantically similar to the query while being legally or commercially inapposite. “Linked to source” is the floor. The harder capability is a system that understands the limits of what its retrieved material can actually support: where support is strong, answer and cite; where it is weak, say so; where there is none, decline to invent. A refusal in that last case is the system doing its job.
“Linked to source” is the floor. Refusal is the capability.
Absence has shapes.
The part that gets least attention is that absence is not one condition. A knowledge system worth trusting should not treat every gap the same way, because no knowledge, thin knowledge and adjacent knowledge are three different states. A good Knowledge Director distinguishes them instinctively. Software usually does not.
The first is the clean gap. The firm has genuinely never done this. The honest response is to say plainly that no relevant internal material was found, which protects the lawyer from assuming an institutional precedent that was never there. Stated well, that is one of the more useful answers a system can give.
The second is the thin touchpoint. One old matter, a passing mention in a note, a clause position that appeared once and was never built into a reusable precedent. Here the system should neither manufacture a full answer nor report nothing. The useful response is more exact: there is limited material, here is what it says, and here is the edge of what it supports. That preserves the lawyer's judgement and gives them a real starting point without overstating the firm's depth.
The third is the subtle one. The firm has a great deal of related work but nothing on the precise point being asked. A conventional search returns a thick set of documents and leaves the lawyer to infer relevance. A careless AI turns that adjacent material into an answer that sounds complete. A system that earns trust separates “we have a lot around this topic” from “we have support for this specific question,” and it does not let the first quietly stand in for the second.
Those are three different knowledge states rather than three points on a single confidence slider, and a system should be able to represent them as such.
§ 06 — Knowledge teams
06 / 08
Why this matters to a knowledge team.
For a knowledge team the consequence runs deeper than safety. A tool that surfaces only positive answers tells you what is already reusable. A system that can name absence tells you where the firm's knowledge is thin, where it is trapped in isolated matters rather than shared resources, and where lawyers keep asking questions the archive cannot yet answer. The silence becomes management information.
That reframes what the tool is for. It stops being only a faster way to find a document and becomes a signal about the state of the firm's knowledge. The question inside a firm was never only whether people can find the right precedent. It is also whether anyone knows where the gaps are. A system honest about absence speaks to both.
What I'm building toward.
I do not think legal AI should be optimised for answer production alone. The thing worth building toward is grounded usefulness: answers where the material supports them, citations that show the basis, limits where the material is thin, and refusal where the archive does not support the conclusion.
That is the posture I am building into FirmMemory. It retrieves over a firm's own knowledge and produces citation-backed answers tied directly to existing work product, and the refusal behaviour carries as much weight as the answers. If the archive does not support a conclusion, the system says so. If it supports only part of one, it shows the edge. If it holds related material but not the precise point, it keeps that distinction visible rather than collapsing it into a confident summary. For a firm deciding whether to let an AI system near its institutional knowledge, that refusal behaviour is where the trust actually lives.
The integrity of silence.
The market will keep competing on answers: faster, longer, more polished. For firms, the trust that matters will come from somewhere quieter. The most useful thing a legal AI says to a lawyer may well be that the firm has not done this before. Said plainly, at the right moment, that one sentence protects judgement in a way no fluent summary can. In legal work, a system's trustworthiness shows less in the quality of its answers than in the integrity of its silences.
Sources. The count of legal decisions involving AI-generated hallucinated content is drawn from the AI Hallucination Cases Database maintained by Damien Charlotin (damiencharlotin.com/hallucinations), which is updated continuously. The figure for hallucination rates in retrieval-based legal research tools is from V. Magesh, F. Surani, M. Dahl, M. Suzgun, C. Manning and D. E. Ho, “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools” (Stanford, 2024), which found that the LexisNexis and Thomson Reuters tools each hallucinated more than 17% of the time.
Previous field note: What we learned generating a 700-document synthetic legal corpus.
See FirmMemory in motion.
FirmMemory is a private, sourced answer layer over every matter, memo, and decision your firm has produced. Always cited, ethically walled, tenant-isolated, and built with you behind your perimeter around how partners actually work.