Course Content
Retrieval: Dense, Sparse, and Hybrid Search
When to use semantic search, BM25, and hybrid — and how to combine them
The Retrieval Problem
You have 50,000 chunks in your vector database. A user asks a question. You need to find the 5 most relevant chunks in milliseconds. The entire quality of your RAG system rests on getting this right.
There are two fundamentally different approaches to this problem, and they have different strengths and weaknesses. Dense search understands meaning. Sparse search matches exact terms. Neither is always better. This lesson shows you when each fails, and how to combine them.
Dense Retrieval (Semantic Search)
Dense retrieval is what most people mean when they say “RAG search.” You embed both the query and the chunks into vectors, then find chunks whose vectors are nearest to the query vector.
Why “dense”: The embedding vector has 1536 dimensions, nearly all of which are non-zero. Contrast with sparse vectors where most dimensions are zero.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
# Dense retrieval: embed query, find nearest vectors
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
)
results = retriever.invoke("How do I reset my password?")
# Returns chunks about "account recovery", "forgotten credentials",
# "login issues" — even if those phrases don't contain "reset" or "password"The superpower: Dense search understands that “reset my password” and “recover account access” mean the same thing. It finds conceptually relevant content even when there’s no word overlap.
The failure mode: Dense search cannot reliably find specific identifiers, proper nouns, technical terms, or rare phrases.
The Concrete Failure
Consider this query: “the RLHF paper from Ziegler et al. 2019”
A user is looking for a specific academic paper. Let’s trace what happens with dense search:
Query vector: "RLHF paper Ziegler et al. 2019"
Nearest neighbors by cosine similarity:
0.82: "Reinforcement learning from human feedback has been shown to..."
0.79: "Human feedback mechanisms in deep reinforcement learning..."
0.77: "RLHF techniques demonstrated significant improvements over..."
0.71: "Ziegler et al. proposed a novel approach to fine-tuning..."
0.68: "Fine-tuning language models from human preferences (Ziegler et al., 2019)..."The actual paper citation (“Fine-Tuning Language Models from Human Preferences, Ziegler et al., 2019”) comes in at rank 5, behind three generic RLHF chunks that don’t mention Ziegler at all. With k=3, the user gets no result for this specific citation.
Dense search optimizes for semantic proximity. When the query is asking for a specific thing — a name, a model number, a paper citation, an error code — semantic proximity is the wrong objective.
Sparse Retrieval (BM25)
BM25 (Best Match 25) is the classic information retrieval algorithm that powers most search engines before neural embeddings existed — and still powers many hybrid systems today.
The core idea: BM25 represents documents as sparse vectors where each dimension corresponds to a term in the vocabulary. The value in each dimension is the TF-IDF weight: how important is this term in this document, relative to how common it is across all documents?
A document with “Ziegler” gets a very high weight for that dimension (rare term, appears in this document). A document with “the” gets near-zero weight (common word, appears everywhere). The query “Ziegler et al. 2019” becomes a vector with high weights for “ziegler”, “2019”, and lower weights for “et”, “al”. BM25 scores documents by their dot product with this vector.
# pip install rank-bm25
from rank_bm25 import BM25Okapi
from langchain_core.documents import Document
class BM25Retriever:
def __init__(self, documents: list[Document]):
self.documents = documents
# Tokenize: lowercase, split by whitespace
self.tokenized = [doc.page_content.lower().split() for doc in documents]
self.bm25 = BM25Okapi(self.tokenized)
def get_relevant_documents(self, query: str, k: int = 5) -> list[Document]:
query_tokens = query.lower().split()
scores = self.bm25.get_scores(query_tokens)
# Get top k indices by score
top_k_indices = sorted(
range(len(scores)),
key=lambda i: scores[i],
reverse=True
)[:k]
return [self.documents[i] for i in top_k_indices]
# Build BM25 index over your chunks
bm25_retriever = BM25Retriever(chunks)
# Now find Ziegler's paper
results = bm25_retriever.get_relevant_documents("Ziegler et al. 2019", k=5)
# Chunk 1: "Fine-Tuning Language Models from Human Preferences (Ziegler et al., 2019)..."
# This exact match comes in at rank 1 — exactly where it belongsThe superpower: BM25 is excellent at exact term matching. Names, version numbers, error codes, product SKUs, paper citations — anything where the exact tokens matter.
The failure mode: BM25 fails on paraphrase. If a user asks “how do I fix a broken login?” and your document says “troubleshooting authentication failures”, BM25 finds zero token overlap and scores this document near zero — even though it’s perfectly relevant.
Hybrid Search: The Best of Both Worlds
The insight is simple: dense and sparse search fail in complementary ways. Dense fails on exact terms; sparse fails on paraphrases. Combine them, and you handle both cases.
The standard algorithm for combining ranked lists from two different systems is Reciprocal Rank Fusion (RRF):
RRF_score(document) = Σ 1 / (k + rank_in_each_list)Where k is typically 60. For each document, add up 1/(60 + its rank) from each retrieval system. Documents that rank highly in both systems get the highest combined score.
Why RRF works: A document ranked #1 in dense and #3 in sparse gets:
- From dense: 1/(60+1) = 0.0164
- From sparse: 1/(60+3) = 0.0159
- Total: 0.0323
A document ranked #100 in dense and #1 in sparse (exact term match, not semantically close):
- From dense: 1/(60+100) = 0.0063
- From sparse: 1/(60+1) = 0.0164
- Total: 0.0227
The exact-term-match document gets boosted by its strong sparse performance even though dense search ranks it low.
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from rank_bm25 import BM25Okapi
from langchain_core.documents import Document
def reciprocal_rank_fusion(
ranked_lists: list[list[Document]],
k: int = 60
) -> list[Document]:
"""
Combine multiple ranked lists of documents using RRF.
Returns documents sorted by combined RRF score (highest first).
"""
scores: dict[str, float] = {}
doc_map: dict[str, Document] = {}
for ranked_list in ranked_lists:
for rank, doc in enumerate(ranked_list):
# Use content as a unique identifier
doc_id = doc.page_content[:100]
if doc_id not in scores:
scores[doc_id] = 0.0
doc_map[doc_id] = doc
scores[doc_id] += 1.0 / (k + rank + 1)
# Sort by combined score
sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
return [doc_map[doc_id] for doc_id in sorted_ids]
class HybridRetriever:
def __init__(self, documents: list[Document], embeddings):
self.documents = documents
# Build dense (vector) retriever
self.vectorstore = Chroma.from_documents(documents, embeddings)
# Build sparse (BM25) retriever
tokenized = [doc.page_content.lower().split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
def retrieve(self, query: str, k: int = 5) -> list[Document]:
# Dense retrieval: top 2k results (we'll fuse and take top k)
dense_results = self.vectorstore.similarity_search(query, k=k*2)
# Sparse retrieval: top 2k results
query_tokens = query.lower().split()
bm25_scores = self.bm25.get_scores(query_tokens)
top_sparse_indices = sorted(
range(len(bm25_scores)),
key=lambda i: bm25_scores[i],
reverse=True
)[:k*2]
sparse_results = [self.documents[i] for i in top_sparse_indices]
# Fuse with RRF
fused = reciprocal_rank_fusion([dense_results, sparse_results])
return fused[:k]
# Usage
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
hybrid_retriever = HybridRetriever(chunks, embeddings)
# Test with an exact-term query
results = hybrid_retriever.retrieve("Ziegler et al. 2019 RLHF")
print("Hybrid results for exact citation query:")
for doc in results:
print(f" {doc.page_content[:150]}")
# Test with a semantic query
results = hybrid_retriever.retrieve("how do I fix broken login")
print("\nHybrid results for semantic query:")
for doc in results:
print(f" {doc.page_content[:150]}")LangChain’s EnsembleRetriever
LangChain provides a built-in EnsembleRetriever that wraps this pattern:
from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
# Set up dense retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Set up sparse retriever (LangChain's BM25Retriever)
# pip install rank-bm25
from langchain_community.retrievers import BM25Retriever as LCBm25Retriever
sparse_retriever = LCBm25Retriever.from_documents(chunks)
sparse_retriever.k = 10
# Combine with configurable weights
hybrid_retriever = EnsembleRetriever(
retrievers=[dense_retriever, sparse_retriever],
weights=[0.6, 0.4] # 60% semantic, 40% keyword matching
)
results = hybrid_retriever.invoke("Ziegler et al. 2019 RLHF paper")Tuning the weights: The [0.6, 0.4] split favors semantic understanding. Adjust based on your use case:
- General knowledge Q&A: [0.7, 0.3] — semantic understanding dominates
- Technical documentation with specific terms: [0.5, 0.5] — equal weight
- Code search or exact-identifier lookup: [0.3, 0.7] — exact matching dominates
Benchmarking the Three Approaches
Here is a side-by-side comparison using two queries that stress different failure modes:
queries = [
# Semantic query — no exact term overlap with document
("semantic", "how do I troubleshoot login problems"),
# Exact-term query — specific proper noun
("exact", "Ziegler et al. 2019 fine-tuning paper"),
]
for query_type, query in queries:
print(f"\n{'='*60}")
print(f"Query type: {query_type}")
print(f"Query: {query}")
dense_results = dense_retriever.invoke(query)
sparse_results = sparse_retriever.invoke(query)
hybrid_results = hybrid_retriever.invoke(query)
print(f"\nDense top-1: {dense_results[0].page_content[:120]}")
print(f"Sparse top-1: {sparse_results[0].page_content[:120]}")
print(f"Hybrid top-1: {hybrid_results[0].page_content[:120]}")Expected pattern:
- Semantic query: Dense = excellent, Sparse = poor, Hybrid = excellent
- Exact-term query: Dense = poor, Sparse = excellent, Hybrid = excellent
Hybrid is the safe default that never catastrophically fails at either type.
When to Use What
Use dense-only when:
- Your queries are always natural language questions
- Your documents are conversational or narrative prose
- Latency is critical and you can’t afford two retrievals
- Your users never search for specific identifiers or citations
Use sparse-only when:
- You’re doing legacy keyword search over structured content
- Your queries are mostly product codes, error messages, or exact phrases
- You need to minimize dependencies (no embedding API calls)
Use hybrid (the recommended default) when:
- You don’t know the distribution of user queries in advance
- Your corpus contains a mix of prose and technical content
- You want consistently good performance across query types
- You’re building a production system
The Performance Reality
Hybrid search is not free. You’re running two retrievals instead of one:
| Approach | Latency | Extra Cost | Quality |
|---|---|---|---|
| Dense only | ~10ms | — | Good on semantic queries |
| Sparse (BM25) only | ~1ms | — | Good on exact queries |
| Hybrid | ~12ms | None (BM25 is local) | Good on both |
The 2ms overhead from adding BM25 to a vector search is negligible. The quality improvement — especially eliminating the “specific citation” failure mode — is substantial. There is almost no reason not to use hybrid search in a production RAG system.
In the next lesson, we’ll add the second major quality improvement: re-ranking the retrieved chunks to surface the most relevant ones.
