Anti-hallucination for legal chatbots: 2.8M Romanian documents
How we removed hallucinations from a Romanian legal chatbot using a citation-grounding pipeline indexed over 2.8 million legislative documents.
How we remove hallucinations from a legal chatbot indexed over 2.8 million Romanian documents
A lawyer sent us a screenshot last year. A generic legal chatbot was assuring him, with citation included, that article 187 paragraph (3) of the Romanian Fiscal Code allowed a deduction his client was claiming. The article does not exist. The paragraph, even less so. The citation was fabricated with a confidence that convinced the client the law was on his side — until the first ANAF notice arrived.
That screenshot was the starting point for the Anti-Hallucination pipeline we built into Leta, the legal assistant CAI Technology operates for the Romanian market. In this article we explain how a legal chatbot ends up lying with confidence, and why the answer is not a larger model but an architecture that refuses to answer without a source.
TL;DR
- Legal hallucinations have a precise cause: generative language models complete statistical patterns, and the format „article X paragraph Y” is one of the most predictable patterns in their corpus.
- A larger model does not solve the problem; it shrinks it but also disguises it. The only robust solution is mandatory citation grounding.
- In Leta we deployed a four-gate validation pipeline that refuses to answer if it cannot attach a verbatim text fragment from an indexed official document.
- On a sample of 1,200 real legal questions, the hallucination rate dropped from 14.2% (generic GPT response) to under 0.3% (Leta with the full pipeline).
- The cost: average response time grows by roughly 40%, and coverage (the rate at which the assistant agrees to answer) drops by 9 points. The trade-off is negotiable with the client; hallucination is not.
Why legal chatbots hallucinate
A language model does not „know” articles of law. It statistically reproduces whichever pattern best fits the question. In law, the format Code X, article Y, paragraph Z appears tens of thousands of times in training corpora. The model learns that this format often follows a legal question. If the question has no clear answer in training data, the model fills in the most plausible numerical combination — not the truth.
This is not an implementation bug. It is the correct behaviour of an autoregressive model, applied to a task it was never built for. The model is not lying; it has no way to distinguish between reproducing a fact and fabricating one, because at the token level they are the same operation.
Concretely, we ran the same list of 1,200 real questions collected from lawyer clients through three configurations:
- Generic generative model, no context: 14.2% answers with fabricated citations
- Generative model + simple RAG (top-5 documents, no grounding): 4.8% hallucination
- Leta full pipeline: 0.28% hallucination, manually validated by a senior lawyer
The difference between simple RAG and our pipeline is not an incremental improvement. It is the difference between a system that tries to answer correctly and one that refuses to answer incorrectly.
Architecture: four gates between question and answer
The Anti-Hallucination pipeline has four components arranged as a funnel. A question does not advance to the next gate unless it satisfies the current condition.
Gate 1 — Scope classification
Before searching the corpus, the question is analysed by a classifier trained on Romanian legal request types. If the question is out of scope (for example, „what will the weather be in Cluj tomorrow?”), the pipeline answers directly: „This question is outside my legal expertise.” No substantive answer token is generated without a source.
Gate 2 — Hybrid retrieval
The Leta corpus contains over 2.8 million documents: consolidated legislation, ÎCCJ jurisprudence, court of appeal decisions, ANAF fiscal rulings, ANRMAP guides. Retrieval uses two methods in parallel:
- vector search over dense embeddings, fine-tuned on a Romanian legal corpus;
- BM25 lexical search with tokenisation specific to Romanian inflection (declined nouns, conjugated verbs, diacritics).
Results are merged with reciprocal rank fusion and the top 20 fragments are kept. This dual search is essential for Romanian: vector search captures synonyms and paraphrases („concediere” vs. „desfacerea contractului”), while BM25 captures the exact form of legal articles, where synonyms are not admitted.
Gate 3 — Mandatory grounding
This is where most hallucinations are caught. The generative model receives the question and the retrieved fragments with an explicit instruction: every factual claim must be marked with a fragment identifier. The answer is then parsed and each claim is validated:
for each claim in the answer:
if the claim contains a legal reference (article, paragraph, ruling, decision):
extract the cited text
search the exact text in the source document
if no match ≥ 95% text similarity:
mark the claim as ungrounded
if any claim is ungrounded:
reject the answer, regenerate with a stricter prompt
after 3 failed attempts: respond „I do not have a sufficient source to answer"
Validation is textual, not semantic. It checks whether the cited text actually exists in the indicated document. This decision is deliberate: a semantic validation would let the model „interpret” the source, which is precisely the behaviour we want to block.
Gate 4 — Coherence check
The final gate checks whether internal claims contradict each other. We use a smaller second model trained to detect contradictions between sentence pairs. If the answer states „X is permitted” in one paragraph and „X is not permitted” in another, the answer is returned with an error and regenerated.
What we learnt building this
Across two pipeline generations we eliminated assumptions we had taken for granted at the start.
Larger models are not the solution. We tested open-source versions at various scales, fine-tuned on the Romanian legal corpus. Larger models do hallucinate less — but they hallucinate more convincingly. A lawyer easily detects a crude wrong citation; a fabricated citation with plausible numbering, correct syntax and academic tone is almost indistinguishable from a real one. Model size shifts the risk from „obvious error” to „hidden error”. That is worse, not better.
Pure retrieval is not enough. A RAG system without mandatory grounding reduces hallucinations but does not eliminate them. If a retrieved fragment covers a similar but not identical case, the model will „extrapolate” with confidence. The only reliable barrier is forcing the model to quote verbatim and then verifying the quoted text is in the source.
Tokenisation matters enormously for Romanian. Many models tokenise Romanian words very fragmentally, which damages cosine similarity over embeddings. Fine-tuning on a Romanian legal corpus with a dedicated tokeniser improved vector search recall by more than 20 percentage points. Quotations with correct diacritics must match the corpus exactly; any lossy normalisation (transformation to ASCII) introduces errors.
Refusal is a feature. At first we tried to maximise coverage — we wanted Leta to answer as many questions as possible. Six months in we inverted the goal: we want Leta to answer only when the answer is verifiable. A lawyer prefers „I do not have a sufficient source”, with a suggestion to consult the legislation directly, over a fabricated citation. „I do not know” is a valid legal answer; confident fiction is not.
Real trade-offs
This pipeline is not free:
- Latency. Grounding validation and regeneration on failure add roughly 40% to response time on average. For an assistant that responds in 4 seconds without grounding, we end up at 5.5–6 seconds with the full pipeline.
- Coverage. About 9 in 100 questions that previously received a generic answer now get „I do not have a sufficient source”. For a public platform this would be a UX disaster. For a professional legal assistant, it is the correct profile: lawyers prefer clarity over uncertainty.
- Compute cost. The pipeline runs two parallel searches, a generative model, a citation parser, a textual validator, a coherence model. Per-response cost is roughly 3–4 times that of a generic GPT. For enterprise clients who understand the cost of a legal hallucination, this is an acceptable trade.
What this means for a lawyer evaluating an AI assistant
Three questions that, in our experience, separate professionally usable tools from those that simply demo well:
-
„Can I see the source for every claim?” If the answer cannot be marked fragment by fragment against an official document, it is not professionally usable. Simulated citations at the end of a paragraph are not enough — they can be fabricated.
-
„What does it do when it does not know?” An assistant that always answers with confidence is a risk. Ask to see its behaviour on tendentious questions or on recent legislation it cannot have in its corpus.
-
„How do you manage corpus updates?” Romanian legislation changes weekly. A corpus indexed six months ago will be wrong on recent changes. Re-indexing procedure and the retraction of repealed documents is a marker of operational maturity.
Operational conclusion
Building a Romanian legal assistant without hallucinations is an engineering problem, not magic. The solution is not a larger model or a new prompting trick; it is a conservative pipeline that prefers silence to fiction. For lawyers, investigative journalists and tax consultants, this profile is non-negotiable. For an SEO platform optimising for engagement, it is wrong.
Leta is built explicitly for the first kind of user. If that precision matters in your practice, we invite you to a 30-minute technical demonstration on your firm’s own corpus — test directly against the real cases that gave you trouble with a generic chatbot.
Related articles
- Why our HG907 quotation engine uses no LLM at all
- Pillar Leta — the CAI Technology legal assistant
- Propose-then-act architecture for production AI agents
External sources
- NIST AI Risk Management Framework, AI RMF 1.0 — see the „Map” function for hallucination risk classification
- European Union Agency for Cybersecurity (ENISA), Multilayer Framework for Good Cybersecurity Practices for AI
- Lewis et al., „Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, arXiv:2005.11401 — the technical foundation of RAG
- Robertson & Zaragoza, „The Probabilistic Relevance Framework: BM25 and Beyond”
Next step
If your team is evaluating an AI legal assistant and you need a technical conversation about the architecture, you can contact us directly for a 30-minute consultation at no cost.