CAI Technology
Menu ☰
rag · · 13 min read

Citation grounding: implementing a 4-gate pipeline

A practical citation-grounding pipeline for legal and procurement RAG: retrieve, answer with citations, validate, return. Full pseudocode included.

CAI Technology · Last reviewed: 4/30/2026
Citation grounding: implementing a 4-gate pipeline

Citation grounding implemented in practice: a 4-gate pipeline for legal RAG

The difference between a demo RAG and a production-ready RAG for regulated sectors is one capability: the guarantee that every factual statement in the answer is anchored to an identical text fragment from a verifiable source document. That capability is called citation grounding.

This article describes the practical 4-gate pipeline we use in Leta and in similar client engagements. It includes pseudocode, edge cases, and real measurements on a 1,200-question Romanian legal dataset.

TL;DR

Why textual, not semantic

A common mistake: implementers use a second LLM to check whether the citation „resembles” the source. That validation is semantic. It works in 95% of cases, but fails exactly on the critical ones: when the model fabricates a quote that „looks like” the official text but contains subtle changes (a number swap, a key word change).

Textual validation — exact lookup of the quote in the source document — eliminates this class of error. It demands more discipline in the prompt (the model must quote verbatim, not paraphrase) but provides a guarantee that semantic validation cannot.

The 4-gate architecture

Query


[Gate 1: Retrieve with metadata]
  → top-k fragments + document_id + offset_start + offset_end


[Gate 2: Generate with strict citation format]
  → answer + list (claim_id, document_id, verbatim_quote, offset_estimate)


[Gate 3: Textual validate]
  → for each claim, check quote in document
  → on fail: regenerate with stricter prompt (max 3 retries)


[Gate 4: Return]
  → answer with active citations (links to exact fragments) OR
  → controlled refusal ("no source sufficient for ...")

Gate 1 — Retrieve with metadata

Unlike a simple RAG that returns just text and similarity score, in this pipeline every fragment must carry the metadata required for validation:

def retrieve(query: str, top_k: int = 20) -> list[Fragment]:
    embeddings_query = encoder.encode(query)
    candidates = vector_db.search(embeddings_query, k=top_k * 3)
    bm25_results = bm25_index.search(query, k=top_k * 3)
    fused = rrf_fuse(candidates, bm25_results, k=top_k)
    return [
        Fragment(
            text=f.text,
            document_id=f.document_id,        # critical for validation
            offset_start=f.offset_start,      # critical for validation
            offset_end=f.offset_end,
            score=f.score,
            source_url=f.source_url,
        )
        for f in fused
    ]

Fragments must be indexed with offsets in the original document, not just text. This is the most expensive change versus a simple RAG: the ingest pipeline must preserve every chunk’s position in the original document.

Gate 2 — Generate with strict citation format

The prompt must force the model to quote verbatim and mark every claim with an identifier. Here is a prompt template that works in production:

System: You are a legal assistant who answers STRICTLY based on the supplied fragments.

RULES:
1. Every factual statement MUST be marked with [c1], [c2], etc.
2. At the end, list citations in JSON format:
   [{"claim_id": "c1", "document_id": "...", "verbatim_quote": "..."}]
3. verbatim_quote MUST be copied verbatim from the indicated fragment, no paraphrase.
4. If you cannot answer based on the fragments, return: {"answer": null, "reason": "no_source"}

Available fragments:
[1] document_id=DOC_123: "If the individual employment contract..."
[2] document_id=DOC_456: "Dismissal for reasons attributable to the employee..."
...

User question: {query}

The model produces something like:

{
  "answer": "Disciplinary dismissal is allowed under the Labour Code [c1]. The procedure requires written notice within 30 days [c2].",
  "citations": [
    {"claim_id": "c1", "document_id": "DOC_456", "verbatim_quote": "Dismissal for reasons attributable to the employee..."},
    {"claim_id": "c2", "document_id": "DOC_456", "verbatim_quote": "the deadline for applying the disciplinary sanction is 30 days..."}
  ]
}

Gate 3 — Textual validate

This is where the magic happens. For every citation we validate that verbatim_quote actually appears in document_id:

def validate_citations(answer: dict, fragments: list[Fragment]) -> ValidationResult:
    fragment_by_doc = {f.document_id: f for f in fragments}
    failed = []
    
    for citation in answer["citations"]:
        doc_id = citation["document_id"]
        quote = citation["verbatim_quote"]
        
        if doc_id not in fragment_by_doc:
            failed.append((citation["claim_id"], "doc_not_in_retrieved"))
            continue
        
        source_text = fragment_by_doc[doc_id].text
        
        # Exact validation (case-insensitive but diacritics respected)
        if quote.lower() in source_text.lower():
            continue
        
        # Fuzzy match if the difference is only whitespace/punctuation
        if normalize_text(quote) in normalize_text(source_text):
            continue
        
        # Validate with HIGH similarity threshold (>= 0.95) for OCR-induced noise
        if text_similarity(quote, source_text) >= 0.95:
            continue
        
        failed.append((citation["claim_id"], "quote_not_found"))
    
    return ValidationResult(
        passed=len(failed) == 0,
        failed_claims=failed,
    )

The 0.95 threshold is not for semantics, it is for technical noise (OCR-confused characters, whitespace, ligatures). Below 0.95 we consider that the model paraphrased, which is not acceptable.

Full pipeline with retry

def answer_with_grounding(query: str) -> Response:
    fragments = retrieve(query, top_k=20)
    
    if not fragments:
        return Response.refusal("no_relevant_sources")
    
    for attempt in range(3):
        answer = generate(query, fragments, strictness=attempt)
        validation = validate_citations(answer, fragments)
        
        if validation.passed:
            return Response.success(answer, fragments)
        
        # Regenerate with a stricter prompt where failed claims are flagged
        # as examples of what NOT to do
    
    # After 3 failures, refuse
    return Response.refusal("could_not_ground", failed_claims=validation.failed_claims)

strictness increases with each retry: the first try uses the standard prompt, the second adds negative examples, the third restricts to minimal claims.

Gate 4 — Controlled return

The final answer must carry the citations with it. On the UI, every claim has a link to the exact fragment, and the user can see the full context in 1 click. This discipline is essential for lawyers and auditors: they don’t accept „trust me” on the answer, but they do accept „here is the source, see for yourself”.

For refusal cases the message must be actionable:

"I do not have a sufficiently exact source for this question.
Suggestion: rephrase with more specific terms or consult
[link to raw search results for query] directly."

Critical edge cases

Aggregate claims that span multiple sources: „The Labour Code provides 30-day dismissal for disciplinary reasons”. This combines information from different articles. Solution: force the model to split into granular claims.

Partially overlapping quotes: two citations using common phrases. Solution: indexing with exact offsets, not fuzzy match on the full document.

Repealed documents: legislation changes. A citation valid today may be from a document repealed six months ago. Solution: valid_from / valid_until metadata on every fragment, filter at retrieve time.

Quotes with redactions/anonymisation: anonymised jurisprudence contains [PARTY NAME]. The model must be instructed to keep those markers exactly.

Measured costs and benefits

On an internal benchmark of 1,200 Romanian legal questions:

Token cost is roughly 2.3× more per query (due to retries and longer prompts). For regulated sectors, that surcharge is acceptable; for a generic chatbot, it is not justified.

Pipeline diagram

Query
  → Retrieve (vector + BM25 + RRF, top-20 with offset metadata)
  → Generate (prompt with JSON citation format)
  → Validate (verbatim match >= 0.95, doc_id in retrieved set)
  → if pass: Return with active citations
  → if fail: Regenerate (max 3 tries) with increasing strictness
  → if 3 fails: Controlled refusal with suggested action

Operational conclusion

Citation grounding is not a technology. It is an engineering discipline that reorganises the RAG pipeline around a guarantee: no substantive answer token leaves without a textually verifiable source. The cost is real (latency, tokens, complexity). The benefit is unmonetisable for sectors where hallucination means professional or legal liability.

For CAI Technology clients in legal and procurement verticals, this pipeline is standard. For implementation on your own corpus, we offer structured technical consulting over 4–8 weeks.

External sources

Next step

If your team is evaluating a citation-grounding pipeline on its own corpus, we offer a 30-minute technical session for feasibility assessment.

We start with a 30-minute conversation.

Free AI-readiness audit for companies with 50+ employees. We reply within 24 hours.