Hybrid search: RRF vs Cohere Rerank vs cross-encoder
Practical comparison of Reciprocal Rank Fusion, Cohere Rerank, and BGE cross-encoder for hybrid search. Latency, quality, cost — when each one wins.
Hybrid search: RRF vs Cohere Rerank vs BGE cross-encoder — which one wins, when
In a modern RAG pipeline, retrieval is not done with a single algorithm. You combine dense embeddings with BM25 sparse, then send the results through a reranking stage that reorders the top-K before passing it to the LLM. The operational question is: which reranking method do you choose?
This article compares three popular approaches in 2026: Reciprocal Rank Fusion (RRF) as a model-free method, Cohere Rerank as a specialised external API, and BGE-reranker-v2 family as an open-source cross-encoder. It compares them on latency, quality, cost, and sovereignty.
TL;DR
- RRF is a model-free mathematical algorithm, extremely cheap (microseconds), works well as a first dense+sparse fusion step, but does not deeply reorder.
- Cohere Rerank is a specialised API with high multilingual quality, 100–300 ms latency, ~1 EUR / 1000 queries for top-100 candidates.
- A self-hosted BGE-reranker-v2-m3 cross-encoder offers near-Cohere quality, 50–150 ms latency on a mid-range GPU, no recurring cost (only GPU).
- On a Romanian legal corpus, NDCG@10 ranking: BGE-reranker-v2-m3 (0.91) ≈ Cohere Rerank v3 (0.92) >> RRF alone (0.83).
- The standard production configuration: RRF for dense+sparse fusion top-100, followed by a cross-encoder reranker for top-20 → final top-K.
Three methods, three profiles
Reciprocal Rank Fusion (RRF)
RRF is a simple formula: given multiple rankings (from different retrievers), you combine scores as:
RRF_score(d) = sum_i (1 / (k + rank_i(d)))
Where k is a constant (typically 60) and rank_i(d) is the document’s position in retriever i’s ranking.
Advantages:
- Zero training, zero model, zero extra runtime cost
- Works natively over any combination of retrievers (BM25 + dense + sparse + ColBERT etc.)
- Sub-millisecond latency
Limitations:
- Does not modify the retrievers’ original ranking — it only fuses them
- Does not „understand” query-document relevance; it merely combines pre-existing rankings
- On hard queries where all retrievers fail, RRF fixes nothing
Cohere Rerank
Cohere offers a dedicated reranking API. The model is a cross-encoder (input: query + document, output: relevance score). The API accepts a query and up to 1000 documents, returning a score for each.
Advantages:
- High multilingual quality (100+ languages in v3)
- Zero setup — a POST request
- Continuous model updates with no effort on your side
Limitations:
- Recurring cost (~1 EUR / 1000 queries with top-100)
- 100–300 ms p50 latency (region- and congestion-dependent)
- Sends documents to Cohere — privacy implications for regulated sectors
- Vendor lock-in
Self-hosted cross-encoder (BGE-reranker-v2-m3)
BGE-reranker-v2-m3 is BAAI’s open-weight model, trained on a multilingual corpus. It functions as a cross-encoder: concatenated input (query + document), score output.
Advantages:
- Quality close to Cohere across many domains (deficit < 2-3 NDCG points)
- Self-hosted, full data residency control
- Zero recurring cost (only GPUs)
- Tunable: you can fine-tune on your own corpus for further gains
Limitations:
- Requires GPU (CPU is too slow for production)
- Non-trivial operational setup (model serving, batching, caching)
- Model updates require testing and redeployment
Romanian legal corpus benchmark
Setup: 50K legal fragments, 800 real queries, manually annotated ground truth.
| Metric | Dense only (BGE-M3) | + RRF (dense+sparse) | + Cohere Rerank v3 | + BGE-reranker-v2-m3 |
|---|---|---|---|---|
| MRR@10 | 0.79 | 0.83 | 0.91 | 0.90 |
| Recall@10 | 0.85 | 0.88 | 0.93 | 0.92 |
| NDCG@10 | 0.81 | 0.85 | 0.92 | 0.91 |
| Latency p50 | 38 ms | 41 ms | 180 ms | 95 ms |
| Cost/1K queries | 0.05 EUR | 0.05 EUR | 1.10 EUR | 0.20 EUR |
Observations:
- RRF alone is a cheap improvement, but an NDCG below 0.85 is not enough for regulated sectors.
- Cohere and BGE-reranker-v2-m3 are nearly tied on quality. A sub-1-point NDCG gap is not significant on this corpus.
- BGE-reranker self-hosted is 5× cheaper at scale and 2× faster.
Recommended standard configuration
Based on production experience, the configuration delivering the best quality/cost ratio is:
Query
├── Retriever 1: dense BGE-M3 (top-100)
├── Retriever 2: BM25 sparse (top-100)
│
▼
RRF fusion → top-50 candidates
│
▼
Cross-encoder rerank (BGE-reranker-v2-m3) → final top-10
│
▼
LLM with citation grounding
The flow has two distinct stages: cheap fusion on top-100 with RRF, then expensive deep rerank on top-50 with a cross-encoder. The cross-encoder cost applies only to 50 documents (not 100), and final top-10 quality is near optimal.
When to pick Cohere
Cohere Rerank is the right call when:
- You have small volumes (under 50K queries/month) and don’t want to maintain GPUs.
- You have a POC / MVP and want to validate the hypothesis before investing.
- You operate in non-regulated sectors where transferring to Cohere is fine.
- You handle exotic languages where open-source models have weak coverage.
When to pick self-hosted BGE-reranker
BGE-reranker-v2-m3 is the right call when:
- You have large volumes (above 200K queries/month) — savings come quickly.
- You operate in regulated sectors (legal, medical, financial, public).
- You already have GPUs for other components (embeddings, LLM) — the reranker adds at ~10% utilisation.
- You want full stack control and the option to fine-tune on your corpus.
When RRF alone is enough
RRF alone works when:
- The corpus is small and homogeneous (under 10K documents).
- Queries are simple, vocabulary is standardised.
- Zero budget for specialised reranking.
- A prototype that will be enriched later.
In serious production, RRF alone leaves 5–8 NDCG points on the table compared to a reranker. If those matter, move to a reranker.
Operational traps
GPU batching. A cross-encoder is slow if you run one query at a time. Configure batch size 16–64 for efficient GPU utilisation. For individual queries with tight latency, use request coalescing.
Length truncation. Cross-encoder models have a context limit (typically 512 or 1024 tokens). Long documents must be truncated or split into windows. The naive strategy (truncate at first 512) loses tail information.
Cache reranking scores. For repeating queries, (query, doc_id) scores can be cached. 20–40% hit rate on internal corporate search systems.
Calibration across retrievers. RRF assumes ranks are at similar scales. If one retriever typically returns 1000 candidates and another 10, the fusion is unbalanced. Cap top-K per retriever before RRF.
Decision diagram
Quality requirement?
├── NDCG > 0.90 mandatory → cross-encoder (Cohere or BGE)
├── NDCG 0.83-0.90 acceptable → RRF + selective cross-encoder
└── NDCG > 0.80 sufficient → RRF alone
Regulated sector?
├── Yes → BGE self-hosted (data residency)
└── No → Cohere fine for POC
Monthly volume?
├── < 50K → Cohere cheaper operationally
├── 50K-200K → break-even, depends on the team
└── > 200K → BGE self-hosted wins decisively
Operational conclusion
In 2026, hybrid search is not a luxury. It is the industry standard for any corpus above 50K documents with varied queries. The reranker choice depends on real constraints (cost, sovereignty, operations), not on „what’s hot”.
For CAI Technology clients, the default configuration is RRF + self-hosted BGE-reranker-v2-m3, with a deviation to Cohere Rerank for fast POCs. This decision has saved clients tens of thousands of euros annually at volumes above 1M queries.
Related articles
- Pillar RAG — enterprise architectures
- BGE-M3 vs OpenAI embeddings on Romanian queries
- Citation grounding: 4-gate implementation
External sources
- Cormack et al., „Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods”
- Chen et al., „BGE M3-Embedding”
- Cohere, „Rerank documentation”
- Nogueira & Cho, „Passage Re-ranking with BERT”
Next step
For a reranker benchmark on your corpus, we can run the three approaches (RRF / Cohere / BGE) in parallel on 500 real queries and deliver a report in 2 weeks.