CAI Technology
Menu ☰
rag · · 12 min read

Hybrid search: RRF vs Cohere Rerank vs cross-encoder

Practical comparison of Reciprocal Rank Fusion, Cohere Rerank, and BGE cross-encoder for hybrid search. Latency, quality, cost — when each one wins.

CAI Technology · Last reviewed: 4/30/2026
Hybrid search: RRF vs Cohere Rerank vs cross-encoder

Hybrid search: RRF vs Cohere Rerank vs BGE cross-encoder — which one wins, when

In a modern RAG pipeline, retrieval is not done with a single algorithm. You combine dense embeddings with BM25 sparse, then send the results through a reranking stage that reorders the top-K before passing it to the LLM. The operational question is: which reranking method do you choose?

This article compares three popular approaches in 2026: Reciprocal Rank Fusion (RRF) as a model-free method, Cohere Rerank as a specialised external API, and BGE-reranker-v2 family as an open-source cross-encoder. It compares them on latency, quality, cost, and sovereignty.

TL;DR

Three methods, three profiles

Reciprocal Rank Fusion (RRF)

RRF is a simple formula: given multiple rankings (from different retrievers), you combine scores as:

RRF_score(d) = sum_i (1 / (k + rank_i(d)))

Where k is a constant (typically 60) and rank_i(d) is the document’s position in retriever i’s ranking.

Advantages:

Limitations:

Cohere Rerank

Cohere offers a dedicated reranking API. The model is a cross-encoder (input: query + document, output: relevance score). The API accepts a query and up to 1000 documents, returning a score for each.

Advantages:

Limitations:

Self-hosted cross-encoder (BGE-reranker-v2-m3)

BGE-reranker-v2-m3 is BAAI’s open-weight model, trained on a multilingual corpus. It functions as a cross-encoder: concatenated input (query + document), score output.

Advantages:

Limitations:

Setup: 50K legal fragments, 800 real queries, manually annotated ground truth.

MetricDense only (BGE-M3)+ RRF (dense+sparse)+ Cohere Rerank v3+ BGE-reranker-v2-m3
MRR@100.790.830.910.90
Recall@100.850.880.930.92
NDCG@100.810.850.920.91
Latency p5038 ms41 ms180 ms95 ms
Cost/1K queries0.05 EUR0.05 EUR1.10 EUR0.20 EUR

Observations:

Based on production experience, the configuration delivering the best quality/cost ratio is:

Query
  ├── Retriever 1: dense BGE-M3 (top-100)
  ├── Retriever 2: BM25 sparse (top-100)


RRF fusion → top-50 candidates


Cross-encoder rerank (BGE-reranker-v2-m3) → final top-10


LLM with citation grounding

The flow has two distinct stages: cheap fusion on top-100 with RRF, then expensive deep rerank on top-50 with a cross-encoder. The cross-encoder cost applies only to 50 documents (not 100), and final top-10 quality is near optimal.

When to pick Cohere

Cohere Rerank is the right call when:

When to pick self-hosted BGE-reranker

BGE-reranker-v2-m3 is the right call when:

When RRF alone is enough

RRF alone works when:

In serious production, RRF alone leaves 5–8 NDCG points on the table compared to a reranker. If those matter, move to a reranker.

Operational traps

GPU batching. A cross-encoder is slow if you run one query at a time. Configure batch size 16–64 for efficient GPU utilisation. For individual queries with tight latency, use request coalescing.

Length truncation. Cross-encoder models have a context limit (typically 512 or 1024 tokens). Long documents must be truncated or split into windows. The naive strategy (truncate at first 512) loses tail information.

Cache reranking scores. For repeating queries, (query, doc_id) scores can be cached. 20–40% hit rate on internal corporate search systems.

Calibration across retrievers. RRF assumes ranks are at similar scales. If one retriever typically returns 1000 candidates and another 10, the fusion is unbalanced. Cap top-K per retriever before RRF.

Decision diagram

Quality requirement?
  ├── NDCG > 0.90 mandatory → cross-encoder (Cohere or BGE)
  ├── NDCG 0.83-0.90 acceptable → RRF + selective cross-encoder
  └── NDCG > 0.80 sufficient → RRF alone

Regulated sector?
  ├── Yes → BGE self-hosted (data residency)
  └── No → Cohere fine for POC

Monthly volume?
  ├── < 50K → Cohere cheaper operationally
  ├── 50K-200K → break-even, depends on the team
  └── > 200K → BGE self-hosted wins decisively

Operational conclusion

In 2026, hybrid search is not a luxury. It is the industry standard for any corpus above 50K documents with varied queries. The reranker choice depends on real constraints (cost, sovereignty, operations), not on „what’s hot”.

For CAI Technology clients, the default configuration is RRF + self-hosted BGE-reranker-v2-m3, with a deviation to Cohere Rerank for fast POCs. This decision has saved clients tens of thousands of euros annually at volumes above 1M queries.

External sources

Next step

For a reranker benchmark on your corpus, we can run the three approaches (RRF / Cohere / BGE) in parallel on 500 real queries and deliver a report in 2 weeks.

We start with a 30-minute conversation.

Free AI-readiness audit for companies with 50+ employees. We reply within 24 hours.