CAI Technology
Menu ☰
rag · · 14 min read

Fine-tuning LLMs on Romanian corpora: real challenges

Why under 0.3% of frontier models contain Romanian, and how to do continued pretraining + SFT correctly on a legal and procurement corpus.

CAI Technology · Last reviewed: 4/30/2026
Fine-tuning LLMs on Romanian corpora: real challenges

Fine-tuning LLMs on Romanian corpora: challenges, costs, real outcomes

When a tax-consulting firm asked us why a US frontier model drafts contracts with the phrase „forța majoră” rendered as „force majeure” in the middle of Romanian text, the technical answer is unpleasant: in that model’s pretraining mix, Romanian sits somewhere below 0.3% of total tokens. For languages with this statistical footprint, the model does not „speak” Romanian — it approximates it. And in legal and procurement text, approximation is error.

This article describes what it actually takes, in practice, to fine-tune an LLM on real Romanian content: the data you collect, the order you use it in, what it costs, and what you gain over an English-language prompt.

TL;DR

Why Romanian is under-represented

Pretraining for frontier models uses predominantly English datasets: filtered Common Crawl, GitHub, ArXiv, Wikipedia. In these corpora, the ratio Romanian-tokens-to-total is depressed by two forces:

The result: a 70B model trained on 15 trillion tokens sees between 30 and 100 billion tokens in Romanian. That sounds large, but it is under 1% of the mix. For comparison, it sees ~10–12 trillion tokens in English. The gap explains why the model „understands” Romanian but produces subtle errors on specialised terminology.

Three architectural options

When a Romanian company wants an LLM that performs on its corpus, three paths exist:

  1. Prompt engineering on an English-centric frontier model + RAG over RO documents. Minimal cost, easy to ship, but limited: the model „translates” mentally, mixes French/English terms into output, struggles with inflected forms.
  2. Continued pretraining (CPT) on an open-weight base (Qwen, Gemma, Mistral, Llama family) using a clean Romanian corpus, followed by SFT on domain instructions. Mid cost, better latency (on-prem model), full control over terminology.
  3. Pretraining from scratch on a Romanian corpus. Prohibitive cost (>10M EUR for a competitive model), unjustifiable for anyone in 2026.

In practice, option 1 works for POCs and small volumes. Option 2 is the industry standard for companies that want consistent quality on Romanian content. Option 3 is academic.

A practical CPT + SFT pipeline

Here is the pipeline we use, without confidential numbers:

Romanian corpus (legal, finance, procurement, ANRMAP, OG)
   → cleanup (dedup, toxicity filters, PII filters)
   → extended tokenizer (vocabulary with frequent RO terms)
   → CPT on base model (low LR, 70% RO / 30% EN mix)
   → SFT on domain instructions (10K–100K prompt+answer pairs)
   → DPO or RLHF on preferences (optional)
   → eval on internal benchmark (legal, fiscal, procurement)

Step 1 — Corpus collection

For a Romanian legal or procurement domain, public sources include:

Realistic volumes: a 5–30B-token corpus is sufficient for effective CPT. Quality matters more than quantity: aggressive deduplication, removal of boilerplate, preservation of correct diacritics.

Step 2 — The tokenizer

The most underestimated decision. A tokenizer trained mostly on English splits „desfacerea contractului individual de muncă” into 18–22 tokens; an extended tokenizer with Romanian vocabulary brings it to 9–11. That means:

Extending the tokenizer means adding 1,000–5,000 new tokens covering frequent Romanian words and forms, then initialising new embeddings (mean over existing sub-tokens is a sensible anchor).

Step 3 — Continued pretraining

On the extended base model we run CPT with a 70% RO / 30% EN mix. The ratio is not arbitrary: at 100% RO the model „forgets” English and loses general capabilities (reasoning, code, math). Below 50% RO the gain is too small for the cost.

The learning rate is below half the standard pretraining LR (continued pretraining demands caution to avoid „destroying” existing representations). Number of epochs: 1–2, never more — overfitting on a domain corpus is a common trap.

Step 4 — Supervised fine-tuning

Here come the domain instructions: clean (prompt, answer) pairs from specialists. For a legal assistant, that means 10,000–100,000 structured examples: a real legal question, a correct response with citation, a consistent format.

SFT quality beats any quantity. 10,000 clean examples from senior lawyers beat 200,000 synthetic examples generated by another LLM. Synthetic datasets are useful for augmentation but cannot replace human supervision.

Real costs

For an open-weight 7–14B model, on owned or rented GPUs (A100/H100 family):

In absolute numbers, a full pass on a 14B model costs somewhere between 3,000 and 10,000 EUR (depending on provider and utilisation efficiency). On a 70B model, costs scale linearly and quickly exceed 30,000 EUR per pass.

ROI turns positive against frontier APIs when you run:

Practical traps

Catastrophic forgetting. CPT tends to degrade code and math capabilities. Solution: keep 10–15% code and 5% math reasoning in the mix, even if your target is legal.

Inconsistent diacritics. Many public Romanian corpora have diacritics with different encodings (ș/ş, ț/ţ — Unicode 0219/015E etc.). Normalisation must happen explicitly, before tokenisation. Otherwise the model learns two variants for the same word.

SFT with English prompts + Romanian answers. If your SFT has English instructions and Romanian answers, the model will respond in Romanian only when asked in English. Make sure you have enough RO–RO pairs.

Weak eval. Quality public Romanian benchmarks are scarce. Invest in internal eval: 500–2,000 real domain questions with answers verified by specialists. Without eval, you do not know whether the fine-tune worked.

Pipeline diagram

Public RO documents
  → Cleanup + deduplication
  → Diacritic normalisation + PII filters
  → Extended tokenizer (RO vocabulary)
  → CPT (mix 70% RO / 30% EN, low LR)
  → SFT (10K–100K domain pairs)
  → DPO/RLHF (optional)
  → Eval on internal benchmark
  → Deployment with prompt-cache

Operational conclusion

Fine-tuning on a Romanian corpus is not a research project. It is an engineering decision with a calculable ROI when volume justifies it. For companies processing tens of thousands of legal, fiscal, or procurement documents in Romanian, the gap between an English-centric frontier model with RAG and an open-weight model with CPT+SFT on RO content is tangible: terminology precision, lower latency, drift control.

If your team is evaluating such a project, we can deliver a feasibility analysis with real volumes and cost estimation against your existing stack.

External sources

Next step

For a technical evaluation on your own corpus (CPT cost estimate, SFT plan, internal benchmark), the CAI Technology team offers a 30-minute consultation at no charge.

We start with a 30-minute conversation.

Free AI-readiness audit for companies with 50+ employees. We reply within 24 hours.