RAG vs BM25 keyword search: when each is the right call
A CTO decision matrix: when fast, cheap BM25 keyword search beats hybrid RAG, and when investing in semantic RAG is justified.
RAG vs BM25 keyword search: when semantic investment pays off and when it does not
One of the most expensive mistakes we see at clients is the implicit assumption that „RAG is superior to classic search”. In specific scenarios, yes. In others, a plain BM25 running on Elasticsearch with a decent tokeniser delivers better, faster, cheaper, and more auditable results than a RAG pipeline with embeddings, vector DB, and reranker.
This article gives executives a clear matrix: which use case calls for classic keyword search, when to move to hybrid, and when full semantic RAG investment is justified.
TL;DR
- BM25 keyword search is fast (1–10 ms), cheap, deterministic, auditable. It works excellently on queries with exact technical terms and on domains with a standardised vocabulary.
- Semantic RAG (dense + reranker) is better on conversational queries, paraphrases, synonyms, but costs 5–20× more in compute and latency.
- Hybrid search (BM25 + dense fused with RRF) combines the advantages and is the industry standard for corpora >100K documents with varied queries.
- For legal chatbots, hybrid RAG is mandatory. For a corporate-website search bar, BM25 with manually configured synonyms is often enough.
- The decision is financial: clear cost-benefit if you measure real volumes and the quality the business actually requires, not via „state-of-the-art”.
Three search types, three different profiles
BM25 (classic keyword search)
BM25 is the standard algorithm in lexical search engines (Elasticsearch, OpenSearch, Solr). It runs on statistical principles: TF (term frequency), IDF (inverse document frequency), document-length normalisation. No embeddings, no neural networks.
Works excellently when:
- Queries contain exact terms that appear in documents (article numbers, IDs, specific names).
- The domain vocabulary is standardised and small.
- You have zero budget for GPU and a tight latency SLA.
- Auditability is a legal requirement — BM25 is fully explainable.
Limitations:
- Fails on synonyms („dismissal” vs „termination of employment”) without a manually configured dictionary.
- Fails on conversational paraphrases.
- Fails on queries with typos or missing diacritics.
Semantic RAG (dense embeddings + LLM)
Semantic RAG uses dense embeddings to capture meaning, not letters. An encoder turns query and documents into vectors; cosine similarity measures semantic closeness.
Works excellently when:
- Queries are conversational, expressed differently from how they appear in documents.
- The domain has many synonyms and paraphrases impossible to enumerate manually.
- You have GPU budget and accept 50–500 ms latency.
- You want to answer „what does X mean for my business” on documents that use different terminology.
Limitations:
- Fails on queries with exact technical terms (article numbers can be semantically „translated” to similar articles).
- Significant compute cost — loading and maintaining a vector index is non-trivial.
- Less auditable — explaining the exact ranking of a document requires specialised tooling.
Hybrid (BM25 + dense, fused)
Hybrid search runs both algorithms in parallel and fuses results using Reciprocal Rank Fusion (RRF) or learned weights. In production this is the industry standard for large corpora and varied queries.
Works excellently when:
- You have enough volume to justify the complexity.
- Queries are mixed (some technical, some conversational).
- Quality matters more than operational simplification.
A concrete decision matrix
Based on experience across more than 30 client search projects, we offer the following guide:
| Use case | Corpus volume | Recommendation |
|---|---|---|
| Corporate-website search bar | 1K–10K pages | BM25 with manual synonyms |
| Help centre with FAQ | 100–1000 articles | BM25 + synonyms + manual redirect |
| Internal knowledge base (Confluence/SharePoint) | 10K–100K pages | Hybrid (BM25 + dense, no reranker) |
| Legal / fiscal chatbot | 100K–10M documents | Full semantic RAG (hybrid + reranker + grounding) |
| E-commerce product search | 10K–1M products | Hybrid + filters + popularity rerank |
| Procurement / SEAP search | 100K–1M notices | Hybrid + filters + reranker |
| Internal code assistant | 10K–100K files | Semantic RAG + AST chunking |
How to compute the ROI
The RAG-vs-BM25 ROI is not abstract. It is computed on three axes:
1. Quality
- On what share of queries does BM25 return the correct answer?
- The incremental gain from RAG (in relevance points) translates into how many extra transactions/clicks/cases?
2. Cost
- BM25: nearly zero marginal cost per query (under 0.001 EUR at scale).
- Hybrid: GPU for the encoder + vector index. Typically 20–200 EUR/month at small-to-mid volume.
- Full RAG with LLM: 1–10 EUR per 1000 queries (depending on prompt length and model).
3. Latency
- BM25: 1–10 ms per query.
- Hybrid: 50–150 ms.
- RAG with LLM: 1.5–8 seconds (LLM generation dominates).
For a system running 100,000 queries/month:
- BM25: ~5–20 EUR/month infra, 99% queries served
- Hybrid: ~150–400 EUR/month
- Full RAG with LLM: ~500–3000 EUR/month
The gap between hybrid and full RAG with LLM is not in search itself — it is in the cost of generating the answer. If users want only the list of relevant documents (classic search), hybrid is enough and cheap. If users want a synthesised answer (chatbot), the LLM cost appears.
Signals from real usage
How do you know it is time to move from BM25 to hybrid or RAG? These are the signals we track with clients:
- „No results” rate above 5%: BM25 is missing common phrasings. Hybrid helps.
- Low click-through rate on top-3 results: ranking is poor. A semantic reranker helps.
- Repeating support tickets like „I can’t find X in search”: users are frustrated. Hybrid with auto-synonyms.
- Long conversational queries (over 8 words): BM25 fails predictably. Dense embeddings help directly.
- High vocabulary diversity in the corpus: the same concept expressed in 10 different ways. Embeddings catch this.
Conversely, signals you do NOT need RAG:
- Expert users using standard terminology.
- Volumes below 100 queries/day — overhead doesn’t pay back.
- Strict auditability requirement (regulated public, financial).
The „state-of-the-art” trap
A common mistake: technical teams want to deploy full RAG because it is „modern” and because they want to use embeddings. The result: increased cost, operational complexity, no business gain.
Our standard recommendation: start with BM25 plus manual synonyms. Measure no-result rate, top-3 CTR, user satisfaction. Move to hybrid when data tells you BM25 is missing. Move to full RAG only when users ask for synthesised answers, not link lists.
Decision diagram
Have a search engine?
├── No → BM25 with Elasticsearch/OpenSearch (start here)
├── BM25 + frequent conversational queries
│ → Add dense encoder + RRF (hybrid)
├── Hybrid + users want synthesised answers
│ → Add LLM with citation grounding (full RAG)
└── Full RAG + critical legal accuracy
→ Add reranker + validation gate
Operational conclusion
RAG is not superior to classic search. It is an option with a different profile. The decision is made on volume, query type, quality requirements, and budget. Teams that decide „on trend” spend 5–20× more without justification. Teams that decide on real usage data get a favourable cost-benefit.
For CAI Technology client engagements, the standard advice is: prototype rapidly with BM25, measure for 4 weeks, decide on upgrade based on signals. This discipline eliminates most RAG projects that fail at cost-benefit.
Related articles
- Pillar RAG — enterprise architectures
- BGE-M3 vs OpenAI embeddings on Romanian queries
- Hybrid search RRF vs Cohere vs cross-encoder
External sources
- Robertson & Zaragoza, „The Probabilistic Relevance Framework: BM25 and Beyond”
- Lewis et al., „Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”
- Cormack et al., „Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods”
- Elastic, „The BM25 algorithm and its variables”
Next step
For a free assessment of your search stack (4-week audit with quantitative metrics), the CAI Technology team offers a 30-minute consultation at no charge.