rag · April 30, 2026 · 12 min read

Hybrid search: RRF vs Cohere Rerank vs cross-encoder

Practical comparison of Reciprocal Rank Fusion, Cohere Rerank, and BGE cross-encoder for hybrid search. Latency, quality, cost — when each one wins.

CAI Technology · Last reviewed: 4/30/2026

Hybrid search: RRF vs Cohere Rerank vs cross-encoder

Hybrid search: RRF vs Cohere Rerank vs BGE cross-encoder — which one wins, when

In a modern RAG pipeline, retrieval is not done with a single algorithm. You combine dense embeddings with BM25 sparse, then send the results through a reranking stage that reorders the top-K before passing it to the LLM. The operational question is: which reranking method do you choose?

This article compares three popular approaches in 2026: Reciprocal Rank Fusion (RRF) as a model-free method, Cohere Rerank as a specialised external API, and BGE-reranker-v2 family as an open-source cross-encoder. It compares them on latency, quality, cost, and sovereignty.

TL;DR

RRF is a model-free mathematical algorithm, extremely cheap (microseconds), works well as a first dense+sparse fusion step, but does not deeply reorder.
Cohere Rerank is a specialised API with high multilingual quality, 100–300 ms latency, ~1 EUR / 1000 queries for top-100 candidates.
A self-hosted BGE-reranker-v2-m3 cross-encoder offers near-Cohere quality, 50–150 ms latency on a mid-range GPU, no recurring cost (only GPU).
On a Romanian legal corpus, NDCG@10 ranking: BGE-reranker-v2-m3 (0.91) ≈ Cohere Rerank v3 (0.92) >> RRF alone (0.83).
The standard production configuration: RRF for dense+sparse fusion top-100, followed by a cross-encoder reranker for top-20 → final top-K.

Three methods, three profiles

Reciprocal Rank Fusion (RRF)

RRF is a simple formula: given multiple rankings (from different retrievers), you combine scores as:

RRF_score(d) = sum_i (1 / (k + rank_i(d)))

Where k is a constant (typically 60) and rank_i(d) is the document’s position in retriever i’s ranking.

Advantages:

Zero training, zero model, zero extra runtime cost
Works natively over any combination of retrievers (BM25 + dense + sparse + ColBERT etc.)
Sub-millisecond latency

Limitations:

Does not modify the retrievers’ original ranking — it only fuses them
Does not „understand” query-document relevance; it merely combines pre-existing rankings
On hard queries where all retrievers fail, RRF fixes nothing

Cohere Rerank

Cohere offers a dedicated reranking API. The model is a cross-encoder (input: query + document, output: relevance score). The API accepts a query and up to 1000 documents, returning a score for each.

Advantages:

High multilingual quality (100+ languages in v3)
Zero setup — a POST request
Continuous model updates with no effort on your side

Limitations:

Recurring cost (~1 EUR / 1000 queries with top-100)
100–300 ms p50 latency (region- and congestion-dependent)
Sends documents to Cohere — privacy implications for regulated sectors
Vendor lock-in

Self-hosted cross-encoder (BGE-reranker-v2-m3)

BGE-reranker-v2-m3 is BAAI’s open-weight model, trained on a multilingual corpus. It functions as a cross-encoder: concatenated input (query + document), score output.

Advantages:

Quality close to Cohere across many domains (deficit < 2-3 NDCG points)
Self-hosted, full data residency control
Zero recurring cost (only GPUs)
Tunable: you can fine-tune on your own corpus for further gains

Limitations:

Requires GPU (CPU is too slow for production)
Non-trivial operational setup (model serving, batching, caching)
Model updates require testing and redeployment

Romanian legal corpus benchmark

Setup: 50K legal fragments, 800 real queries, manually annotated ground truth.

Metric	Dense only (BGE-M3)	+ RRF (dense+sparse)	+ Cohere Rerank v3	+ BGE-reranker-v2-m3
MRR@10	0.79	0.83	0.91	0.90
Recall@10	0.85	0.88	0.93	0.92
NDCG@10	0.81	0.85	0.92	0.91
Latency p50	38 ms	41 ms	180 ms	95 ms
Cost/1K queries	0.05 EUR	0.05 EUR	1.10 EUR	0.20 EUR

Observations:

RRF alone is a cheap improvement, but an NDCG below 0.85 is not enough for regulated sectors.
Cohere and BGE-reranker-v2-m3 are nearly tied on quality. A sub-1-point NDCG gap is not significant on this corpus.
BGE-reranker self-hosted is 5× cheaper at scale and 2× faster.

Recommended standard configuration

Based on production experience, the configuration delivering the best quality/cost ratio is:

Query
  ├── Retriever 1: dense BGE-M3 (top-100)
  ├── Retriever 2: BM25 sparse (top-100)
  │
  ▼
RRF fusion → top-50 candidates
  │
  ▼
Cross-encoder rerank (BGE-reranker-v2-m3) → final top-10
  │
  ▼
LLM with citation grounding

The flow has two distinct stages: cheap fusion on top-100 with RRF, then expensive deep rerank on top-50 with a cross-encoder. The cross-encoder cost applies only to 50 documents (not 100), and final top-10 quality is near optimal.

When to pick Cohere

Cohere Rerank is the right call when:

You have small volumes (under 50K queries/month) and don’t want to maintain GPUs.
You have a POC / MVP and want to validate the hypothesis before investing.
You operate in non-regulated sectors where transferring to Cohere is fine.
You handle exotic languages where open-source models have weak coverage.

When to pick self-hosted BGE-reranker

BGE-reranker-v2-m3 is the right call when:

You have large volumes (above 200K queries/month) — savings come quickly.
You operate in regulated sectors (legal, medical, financial, public).
You already have GPUs for other components (embeddings, LLM) — the reranker adds at ~10% utilisation.
You want full stack control and the option to fine-tune on your corpus.

When RRF alone is enough

RRF alone works when:

The corpus is small and homogeneous (under 10K documents).
Queries are simple, vocabulary is standardised.
Zero budget for specialised reranking.
A prototype that will be enriched later.

In serious production, RRF alone leaves 5–8 NDCG points on the table compared to a reranker. If those matter, move to a reranker.

Operational traps

GPU batching. A cross-encoder is slow if you run one query at a time. Configure batch size 16–64 for efficient GPU utilisation. For individual queries with tight latency, use request coalescing.

Length truncation. Cross-encoder models have a context limit (typically 512 or 1024 tokens). Long documents must be truncated or split into windows. The naive strategy (truncate at first 512) loses tail information.

Cache reranking scores. For repeating queries, (query, doc_id) scores can be cached. 20–40% hit rate on internal corporate search systems.

Calibration across retrievers. RRF assumes ranks are at similar scales. If one retriever typically returns 1000 candidates and another 10, the fusion is unbalanced. Cap top-K per retriever before RRF.

Decision diagram

Quality requirement?
  ├── NDCG > 0.90 mandatory → cross-encoder (Cohere or BGE)
  ├── NDCG 0.83-0.90 acceptable → RRF + selective cross-encoder
  └── NDCG > 0.80 sufficient → RRF alone

Regulated sector?
  ├── Yes → BGE self-hosted (data residency)
  └── No → Cohere fine for POC

Monthly volume?
  ├── < 50K → Cohere cheaper operationally
  ├── 50K-200K → break-even, depends on the team
  └── > 200K → BGE self-hosted wins decisively

Operational conclusion

In 2026, hybrid search is not a luxury. It is the industry standard for any corpus above 50K documents with varied queries. The reranker choice depends on real constraints (cost, sovereignty, operations), not on „what’s hot”.

For CAI Technology clients, the default configuration is RRF + self-hosted BGE-reranker-v2-m3, with a deviation to Cohere Rerank for fast POCs. This decision has saved clients tens of thousands of euros annually at volumes above 1M queries.

Pillar RAG — enterprise architectures
BGE-M3 vs OpenAI embeddings on Romanian queries
Citation grounding: 4-gate implementation

External sources

Cormack et al., „Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods”
Chen et al., „BGE M3-Embedding”
Cohere, „Rerank documentation”
Nogueira & Cho, „Passage Re-ranking with BERT”

Next step

For a reranker benchmark on your corpus, we can run the three approaches (RRF / Cohere / BGE) in parallel on 500 real queries and deliver a report in 2 weeks.

Hybrid search: RRF vs Cohere Rerank vs BGE cross-encoder — which one wins, when

TL;DR

Three methods, three profiles

Reciprocal Rank Fusion (RRF)

Cohere Rerank

Self-hosted cross-encoder (BGE-reranker-v2-m3)

Romanian legal corpus benchmark

Recommended standard configuration

When to pick Cohere

When to pick self-hosted BGE-reranker

When RRF alone is enough

Operational traps

Decision diagram

Operational conclusion

Related articles

External sources

Next step

We start with a 30-minute conversation.