CAI Technology
Menu ☰
rag · · 12 min read

Multilingual RAG RO + EN: implementation pattern with BGE-M3

How to build a RAG that answers in Romanian over a mixed RO+EN corpus: cross-lingual retrieval, adaptive prompts, citation language match.

CAI Technology · Last reviewed: 4/30/2026
Multilingual RAG RO + EN: implementation pattern with BGE-M3

Multilingual RAG RO + EN: implementation pattern with BGE-M3

Many Romanian companies have mixed corpora: contracts and legislation in Romanian, but also technical manuals, white papers, and international standards in English. An assistant serving this reality must answer in a single language (usually the user’s Romanian) but be able to retrieve and synthesise information from both languages.

This article describes the technical pattern for a RO + EN multilingual RAG, with BGE-M3 as the encoder, focusing on cross-lingual retrieval, adaptive prompts, and citation language match — the challenges that show up in production.

TL;DR

Why BGE-M3 for cross-lingual

BGE-M3 was trained explicitly on 100+ languages with cross-lingual alignment. That means the vector for „concediere disciplinară” in Romanian and „disciplinary dismissal” in English are close in vector space. Cosine similarity between these vectors is typically above 0.85.

In practice, this gives you:

The alternatives — (a) translating queries to English before retrieval, or (b) two separate indices with a language-detection-based switch — are more complex, slower, and less precise.

Pattern architecture

Query (RO or EN)


Language detection (optional, for output instruction)


BGE-M3 encoder → query vector (1024d)


Vector search on a single mixed RO + EN index → top-50
+
BM25 search per language (two BM25 indices, fused with RRF)


Multilingual cross-encoder reranker (BGE-reranker-v2-m3)


Top-10 fragments (mix RO + EN)


LLM with adaptive prompt (output language instruction + fragments tagged with language)


Answer in user's language + citations in the document's original language

Step 1 — Mixed indexing

Fragments are indexed with a lang metadata marking their original language:

class Fragment:
    text: str
    lang: str           # "ro" | "en"
    document_id: str
    document_title: str
    offset_start: int
    offset_end: int
    embedding: list[float]   # from BGE-M3, 1024 dims

Indexing uses the same encoder (BGE-M3) for both languages. The result: cross-lingually comparable vectors.

Trap: chunking must respect the document structure in its language. Romanian legal documents have a hierarchical structure (title → chapter → section → article → paragraph) different from an English white paper (h1 → h2 → paragraphs). Use semantic per-language chunking, not a universal fixed-length chunker.

Step 2 — BM25 per language

BM25 does not work cross-lingually (it is lexical, not semantic). Solution: two separate BM25 indices (one RO, one EN), queried in parallel, fused with RRF together with the dense results.

def hybrid_search(query: str, lang_hint: str = None, top_k: int = 50):
    dense_results = vector_db.search(encoder.encode(query), k=top_k)
    
    # BM25 per language; the query is not translated, so EN BM25 will be weak if query is RO
    # Accepted: BM25 lifts documents in the query's language
    bm25_ro = bm25_index_ro.search(query, k=top_k)
    bm25_en = bm25_index_en.search(query, k=top_k)
    
    fused = rrf_fuse([dense_results, bm25_ro, bm25_en], k=top_k)
    return fused

Trap: if the query is Romanian and you only run BM25 over the RO corpus, you lose recall on English documents. BGE-M3 dense retrieval compensates for this gap.

Step 3 — Cross-lingual reranking

BGE-reranker-v2-m3 is multilingual by design. It accepts a (RO query, EN document) pair and produces a relevance score. No intermediate translation is needed.

In practice, we have observed that reranker scores on cross-lingual fragments are slightly more conservative (-0.05 to -0.10) versus same-language pairs. To balance this bias, apply a +0.05 boost on cross-lingual scores. Better still, fine-tune the reranker on cross-lingual pairs from your own corpus.

Step 4 — Adaptive prompts

The prompt to the LLM must explicitly contain:

  1. The desired output language (usually the query’s language).
  2. The fragment language — for each fragment.
  3. The instruction not to „translate” citations but to keep the original text.

Sample prompt template:

System: Legal assistant. You answer STRICTLY based on the supplied fragments.

RULES:
1. Your answer MUST be in {output_language}.
2. Citations (verbatim_quote) must remain IN the fragment's original language.
3. DO NOT translate citations. If the fragment is English and you reply in Romanian,
   the citation stays in English.
4. Add a brief note indicating the fragment's language if it differs from the answer language.

Fragments:
[1] (lang=ro, doc_id=DOC_RO_123): "Concedierea disciplinară conform articolului 248..."
[2] (lang=en, doc_id=DOC_EN_456): "Disciplinary dismissal under EU framework..."

User question (lang: ro): {query}

The model produces the answer in Romanian, with citations in the fragment’s original language. This is the correct pattern: the user reads in their language but can verify the citation in the original document.

Step 5 — Citation language match

This is the subtlest rule. Without it, LLMs tend to translate citations for „consistency”. That breaks citation grounding: textual validation requires the cited text to exist in the source document, no translation.

In the validation gate, check:

def validate_citation_language(citation, fragment):
    if fragment.lang == "ro":
        return is_likely_romanian(citation.verbatim_quote)
    if fragment.lang == "en":
        return is_likely_english(citation.verbatim_quote)

is_likely_romanian can be a simple classifier (diacritic frequency, characteristic words). For ambiguous cases (short citations with proper names), accept the ambiguity but log for review.

Practical traps

Inconsistent diacritics. Public RO documents mix correct diacritics (ș, ț) with approximations (s, t). Diacritic normalisation before indexing improves retrieval. But for textual validation, keep both variants (original + normalised) for matching.

Proper names. Institution, person, and company names often appear identically in RO and EN („European Commission”, „Comisia Europeană”). So cross-lingual BM25 sometimes catches matches on proper names — which is fine and useful.

Official translations. For EU directives, parallel official RO and EN versions exist. If you index both, the RAG will surface both as sources, doubling the information. Solution: dedup by (regulation_id, article_number), keeping the language matching the query.

Vocabulary mismatch. „Procurement” in English vs „achiziții publice” in Romanian — both refer to the same concept, but BM25 doesn’t connect them automatically. Dense retrieval with BGE-M3 makes the connection, but if users use the English term inside a Romanian query, ensure the tokeniser handles it.

Evaluation. Your internal benchmark must contain (RO query, ground truth RO + EN fragments) and (EN query, ground truth RO + EN) pairs. Without this, you do not know whether cross-lingual retrieval works in production.

Full pattern diagram

RO document (structured semantic chunking)

    BGE-M3 encoder → unique multilingual vector

EN document (adapted chunking)

Query (RO or EN)
  → Language detect (output instruction)
  → BGE-M3 → dense top-50 cross-lingual
  → BM25 RO + BM25 EN (per language)
  → RRF fusion
  → BGE-reranker-v2-m3 (multilingual cross-encoder)
  → Top-10 (mix RO + EN)
  → LLM with adaptive prompt (output_lang, citation_lang_match)
  → Validation: citations in fragment's original language
  → Answer + active citations linking to fragment

Operational conclusion

A multilingual RO + EN RAG is not a complication, it is a standard requirement for real Romanian corpora. The pattern with BGE-M3 + per-language BM25 + cross-encoder rerank + adaptive prompts removes most cross-lingual problems without requiring duplicate infrastructure.

For CAI Technology clients with mixed corpora (Romanian legal + international standards, RO contracts + EN technical manuals), this pattern is the recommended implementation. The incremental investment over a Romanian-only RAG is below 15% effort but covers 100% of the mixed corpus with no quality loss.

External sources

Next step

For an evaluation of your own corpus with the RO + EN multilingual pattern, we can run a POC over 1,000 mixed documents with a quality benchmark in 2 weeks.

We start with a 30-minute conversation.

Free AI-readiness audit for companies with 50+ employees. We reply within 24 hours.