Fine-tuning LLMs on Romanian corpora: real challenges
Why under 0.3% of frontier models contain Romanian, and how to do continued pretraining + SFT correctly on a legal and procurement corpus.
Fine-tuning LLMs on Romanian corpora: challenges, costs, real outcomes
When a tax-consulting firm asked us why a US frontier model drafts contracts with the phrase „forța majoră” rendered as „force majeure” in the middle of Romanian text, the technical answer is unpleasant: in that model’s pretraining mix, Romanian sits somewhere below 0.3% of total tokens. For languages with this statistical footprint, the model does not „speak” Romanian — it approximates it. And in legal and procurement text, approximation is error.
This article describes what it actually takes, in practice, to fine-tune an LLM on real Romanian content: the data you collect, the order you use it in, what it costs, and what you gain over an English-language prompt.
TL;DR
- Public frontier models (GPT family, Claude, Gemini, Llama) carry under 0.3–0.8% Romanian tokens in pretraining. For Romanian legal, fiscal, or procurement text, the model „transfers” from English and French, which produces systematic errors.
- The practical answer is not pretraining a model from scratch (prohibitive cost) but continued pretraining on a clean Romanian corpus followed by SFT (supervised fine-tuning) on domain instructions.
- Tokenizers matter: tokenizers trained mostly on English split Romanian words into 2–4× more tokens, which corrupts context window and cosine similarity on embeddings.
- A full pass (CPT 30B tokens + SFT 100K examples) for an open-weight 7–14B model costs in the few-thousand-EUR range on owned or rented GPUs. ROI is positive above ~50K queries/month.
- Measured gain: on an 800-question internal legal benchmark, F1 on correct article extraction rises from 0.61 (generic model) to 0.89 (CPT + SFT on Romanian legal corpus).
Why Romanian is under-represented
Pretraining for frontier models uses predominantly English datasets: filtered Common Crawl, GitHub, ArXiv, Wikipedia. In these corpora, the ratio Romanian-tokens-to-total is depressed by two forces:
- Volume of public content: far less indexable Romanian text than English, French, German, or Spanish.
- Quality filtering: deduplication and quality filters (English-centred) disproportionately remove Romanian text, because their quality signals (formatting, structure heuristics) were calibrated on English.
The result: a 70B model trained on 15 trillion tokens sees between 30 and 100 billion tokens in Romanian. That sounds large, but it is under 1% of the mix. For comparison, it sees ~10–12 trillion tokens in English. The gap explains why the model „understands” Romanian but produces subtle errors on specialised terminology.
Three architectural options
When a Romanian company wants an LLM that performs on its corpus, three paths exist:
- Prompt engineering on an English-centric frontier model + RAG over RO documents. Minimal cost, easy to ship, but limited: the model „translates” mentally, mixes French/English terms into output, struggles with inflected forms.
- Continued pretraining (CPT) on an open-weight base (Qwen, Gemma, Mistral, Llama family) using a clean Romanian corpus, followed by SFT on domain instructions. Mid cost, better latency (on-prem model), full control over terminology.
- Pretraining from scratch on a Romanian corpus. Prohibitive cost (>10M EUR for a competitive model), unjustifiable for anyone in 2026.
In practice, option 1 works for POCs and small volumes. Option 2 is the industry standard for companies that want consistent quality on Romanian content. Option 3 is academic.
A practical CPT + SFT pipeline
Here is the pipeline we use, without confidential numbers:
Romanian corpus (legal, finance, procurement, ANRMAP, OG)
→ cleanup (dedup, toxicity filters, PII filters)
→ extended tokenizer (vocabulary with frequent RO terms)
→ CPT on base model (low LR, 70% RO / 30% EN mix)
→ SFT on domain instructions (10K–100K prompt+answer pairs)
→ DPO or RLHF on preferences (optional)
→ eval on internal benchmark (legal, fiscal, procurement)
Step 1 — Corpus collection
For a Romanian legal or procurement domain, public sources include:
- consolidated legislation from Monitorul Oficial (public, with automatic withdrawal of repealed acts);
- jurisprudence from the High Court and courts of appeal (public, after anonymisation);
- guidelines from ANRMAP, ANAF, ASF, BNR (public);
- official forms and contract templates;
- contracting-authority decisions (SEAP, CNSC rulings).
Realistic volumes: a 5–30B-token corpus is sufficient for effective CPT. Quality matters more than quantity: aggressive deduplication, removal of boilerplate, preservation of correct diacritics.
Step 2 — The tokenizer
The most underestimated decision. A tokenizer trained mostly on English splits „desfacerea contractului individual de muncă” into 18–22 tokens; an extended tokenizer with Romanian vocabulary brings it to 9–11. That means:
- larger effective context window (more document fits in the prompt);
- faster training (fewer tokens for the same information);
- more stable embeddings (everyday words don’t fragment into ambiguous sub-tokens).
Extending the tokenizer means adding 1,000–5,000 new tokens covering frequent Romanian words and forms, then initialising new embeddings (mean over existing sub-tokens is a sensible anchor).
Step 3 — Continued pretraining
On the extended base model we run CPT with a 70% RO / 30% EN mix. The ratio is not arbitrary: at 100% RO the model „forgets” English and loses general capabilities (reasoning, code, math). Below 50% RO the gain is too small for the cost.
The learning rate is below half the standard pretraining LR (continued pretraining demands caution to avoid „destroying” existing representations). Number of epochs: 1–2, never more — overfitting on a domain corpus is a common trap.
Step 4 — Supervised fine-tuning
Here come the domain instructions: clean (prompt, answer) pairs from specialists. For a legal assistant, that means 10,000–100,000 structured examples: a real legal question, a correct response with citation, a consistent format.
SFT quality beats any quantity. 10,000 clean examples from senior lawyers beat 200,000 synthetic examples generated by another LLM. Synthetic datasets are useful for augmentation but cannot replace human supervision.
Real costs
For an open-weight 7–14B model, on owned or rented GPUs (A100/H100 family):
- CPT on 20–30B tokens: 800–2,500 GPU-hours
- SFT on 50–100K examples: 30–80 GPU-hours
- Eval + iterations: an extra 20–40%
In absolute numbers, a full pass on a 14B model costs somewhere between 3,000 and 10,000 EUR (depending on provider and utilisation efficiency). On a 70B model, costs scale linearly and quickly exceed 30,000 EUR per pass.
ROI turns positive against frontier APIs when you run:
- over 50,000 queries/month with long prompts (>5,000 tokens), or
- over 200,000 queries/month with short prompts, or
- a sub-500ms latency requirement that an external API cannot meet.
Practical traps
Catastrophic forgetting. CPT tends to degrade code and math capabilities. Solution: keep 10–15% code and 5% math reasoning in the mix, even if your target is legal.
Inconsistent diacritics. Many public Romanian corpora have diacritics with different encodings (ș/ş, ț/ţ — Unicode 0219/015E etc.). Normalisation must happen explicitly, before tokenisation. Otherwise the model learns two variants for the same word.
SFT with English prompts + Romanian answers. If your SFT has English instructions and Romanian answers, the model will respond in Romanian only when asked in English. Make sure you have enough RO–RO pairs.
Weak eval. Quality public Romanian benchmarks are scarce. Invest in internal eval: 500–2,000 real domain questions with answers verified by specialists. Without eval, you do not know whether the fine-tune worked.
Pipeline diagram
Public RO documents
→ Cleanup + deduplication
→ Diacritic normalisation + PII filters
→ Extended tokenizer (RO vocabulary)
→ CPT (mix 70% RO / 30% EN, low LR)
→ SFT (10K–100K domain pairs)
→ DPO/RLHF (optional)
→ Eval on internal benchmark
→ Deployment with prompt-cache
Operational conclusion
Fine-tuning on a Romanian corpus is not a research project. It is an engineering decision with a calculable ROI when volume justifies it. For companies processing tens of thousands of legal, fiscal, or procurement documents in Romanian, the gap between an English-centric frontier model with RAG and an open-weight model with CPT+SFT on RO content is tangible: terminology precision, lower latency, drift control.
If your team is evaluating such a project, we can deliver a feasibility analysis with real volumes and cost estimation against your existing stack.
Related articles
- Pillar RAG — enterprise retrieval architectures
- Pillar Leta — Romanian legal assistant
- BGE-M3 vs OpenAI embeddings on Romanian queries
External sources
- Touvron et al., „Llama 2: Open Foundation and Fine-Tuned Chat Models”
- Bai et al., „Qwen Technical Report”
- Gururangan et al., „Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks”
- BigScience BLOOM Multilingual Report, arXiv:2211.05100
Next step
For a technical evaluation on your own corpus (CPT cost estimate, SFT plan, internal benchmark), the CAI Technology team offers a 30-minute consultation at no charge.