Retrieval: Dense, Sparse, and Hybrid Search

The Retrieval Problem

You have 50,000 chunks in your vector database. A user asks a question. You need to find the 5 most relevant chunks in milliseconds. The entire quality of your RAG system rests on getting this right.

There are two fundamentally different approaches to this problem, and they have different strengths and weaknesses. Dense search understands meaning. Sparse search matches exact terms. Neither is always better. This lesson shows you when each fails, and how to combine them.

Dense Retrieval (Semantic Search)

Dense retrieval is what most people mean when they say “RAG search.” You embed both the query and the chunks into vectors, then find chunks whose vectors are nearest to the query vector.

Why “dense”: The embedding vector has 1536 dimensions, nearly all of which are non-zero. Contrast with sparse vectors where most dimensions are zero.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)

# Dense retrieval: embed query, find nearest vectors
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

results = retriever.invoke("How do I reset my password?")
# Returns chunks about "account recovery", "forgotten credentials", 
# "login issues" — even if those phrases don't contain "reset" or "password"

The superpower: Dense search understands that “reset my password” and “recover account access” mean the same thing. It finds conceptually relevant content even when there’s no word overlap.

The failure mode: Dense search cannot reliably find specific identifiers, proper nouns, technical terms, or rare phrases.

The Concrete Failure

Consider this query: “the RLHF paper from Ziegler et al. 2019”

A user is looking for a specific academic paper. Let’s trace what happens with dense search:

Query vector: "RLHF paper Ziegler et al. 2019"

Nearest neighbors by cosine similarity:
0.82: "Reinforcement learning from human feedback has been shown to..."
0.79: "Human feedback mechanisms in deep reinforcement learning..."
0.77: "RLHF techniques demonstrated significant improvements over..."
0.71: "Ziegler et al. proposed a novel approach to fine-tuning..."
0.68: "Fine-tuning language models from human preferences (Ziegler et al., 2019)..."

The actual paper citation (“Fine-Tuning Language Models from Human Preferences, Ziegler et al., 2019”) comes in at rank 5, behind three generic RLHF chunks that don’t mention Ziegler at all. With k=3, the user gets no result for this specific citation.

Dense search optimizes for semantic proximity. When the query is asking for a specific thing — a name, a model number, a paper citation, an error code — semantic proximity is the wrong objective.

Sparse Retrieval (BM25)

BM25 (Best Match 25) is the classic information retrieval algorithm that powers most search engines before neural embeddings existed — and still powers many hybrid systems today.

The core idea: BM25 represents documents as sparse vectors where each dimension corresponds to a term in the vocabulary. The value in each dimension is the TF-IDF weight: how important is this term in this document, relative to how common it is across all documents?

A document with “Ziegler” gets a very high weight for that dimension (rare term, appears in this document). A document with “the” gets near-zero weight (common word, appears everywhere). The query “Ziegler et al. 2019” becomes a vector with high weights for “ziegler”, “2019”, and lower weights for “et”, “al”. BM25 scores documents by their dot product with this vector.

# pip install rank-bm25
from rank_bm25 import BM25Okapi
from langchain_core.documents import Document

class BM25Retriever:
    def __init__(self, documents: list[Document]):
        self.documents = documents
        # Tokenize: lowercase, split by whitespace
        self.tokenized = [doc.page_content.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(self.tokenized)
    
    def get_relevant_documents(self, query: str, k: int = 5) -> list[Document]:
        query_tokens = query.lower().split()
        scores = self.bm25.get_scores(query_tokens)
        
        # Get top k indices by score
        top_k_indices = sorted(
            range(len(scores)), 
            key=lambda i: scores[i], 
            reverse=True
        )[:k]
        
        return [self.documents[i] for i in top_k_indices]

# Build BM25 index over your chunks
bm25_retriever = BM25Retriever(chunks)

# Now find Ziegler's paper
results = bm25_retriever.get_relevant_documents("Ziegler et al. 2019", k=5)
# Chunk 1: "Fine-Tuning Language Models from Human Preferences (Ziegler et al., 2019)..."
# This exact match comes in at rank 1 — exactly where it belongs

The superpower: BM25 is excellent at exact term matching. Names, version numbers, error codes, product SKUs, paper citations — anything where the exact tokens matter.

The failure mode: BM25 fails on paraphrase. If a user asks “how do I fix a broken login?” and your document says “troubleshooting authentication failures”, BM25 finds zero token overlap and scores this document near zero — even though it’s perfectly relevant.

Hybrid Search: The Best of Both Worlds

The insight is simple: dense and sparse search fail in complementary ways. Dense fails on exact terms; sparse fails on paraphrases. Combine them, and you handle both cases.

The standard algorithm for combining ranked lists from two different systems is Reciprocal Rank Fusion (RRF):

RRF_score(document) = Σ 1 / (k + rank_in_each_list)

Where k is typically 60. For each document, add up 1/(60 + its rank) from each retrieval system. Documents that rank highly in both systems get the highest combined score.

Why RRF works: A document ranked #1 in dense and #3 in sparse gets:

From dense: 1/(60+1) = 0.0164
From sparse: 1/(60+3) = 0.0159
Total: 0.0323

A document ranked #100 in dense and #1 in sparse (exact term match, not semantically close):

From dense: 1/(60+100) = 0.0063
From sparse: 1/(60+1) = 0.0164
Total: 0.0227

The exact-term-match document gets boosted by its strong sparse performance even though dense search ranks it low.

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from rank_bm25 import BM25Okapi
from langchain_core.documents import Document

def reciprocal_rank_fusion(
    ranked_lists: list[list[Document]], 
    k: int = 60
) -> list[Document]:
    """
    Combine multiple ranked lists of documents using RRF.
    Returns documents sorted by combined RRF score (highest first).
    """
    scores: dict[str, float] = {}
    doc_map: dict[str, Document] = {}
    
    for ranked_list in ranked_lists:
        for rank, doc in enumerate(ranked_list):
            # Use content as a unique identifier
            doc_id = doc.page_content[:100]
            
            if doc_id not in scores:
                scores[doc_id] = 0.0
                doc_map[doc_id] = doc
            
            scores[doc_id] += 1.0 / (k + rank + 1)
    
    # Sort by combined score
    sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [doc_map[doc_id] for doc_id in sorted_ids]


class HybridRetriever:
    def __init__(self, documents: list[Document], embeddings):
        self.documents = documents
        
        # Build dense (vector) retriever
        self.vectorstore = Chroma.from_documents(documents, embeddings)
        
        # Build sparse (BM25) retriever
        tokenized = [doc.page_content.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)
    
    def retrieve(self, query: str, k: int = 5) -> list[Document]:
        # Dense retrieval: top 2k results (we'll fuse and take top k)
        dense_results = self.vectorstore.similarity_search(query, k=k*2)
        
        # Sparse retrieval: top 2k results
        query_tokens = query.lower().split()
        bm25_scores = self.bm25.get_scores(query_tokens)
        top_sparse_indices = sorted(
            range(len(bm25_scores)),
            key=lambda i: bm25_scores[i],
            reverse=True
        )[:k*2]
        sparse_results = [self.documents[i] for i in top_sparse_indices]
        
        # Fuse with RRF
        fused = reciprocal_rank_fusion([dense_results, sparse_results])
        return fused[:k]


# Usage
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
hybrid_retriever = HybridRetriever(chunks, embeddings)

# Test with an exact-term query
results = hybrid_retriever.retrieve("Ziegler et al. 2019 RLHF")
print("Hybrid results for exact citation query:")
for doc in results:
    print(f"  {doc.page_content[:150]}")

# Test with a semantic query
results = hybrid_retriever.retrieve("how do I fix broken login")
print("\nHybrid results for semantic query:")
for doc in results:
    print(f"  {doc.page_content[:150]}")

LangChain’s EnsembleRetriever

LangChain provides a built-in EnsembleRetriever that wraps this pattern:

from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Set up dense retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Set up sparse retriever (LangChain's BM25Retriever)
# pip install rank-bm25
from langchain_community.retrievers import BM25Retriever as LCBm25Retriever
sparse_retriever = LCBm25Retriever.from_documents(chunks)
sparse_retriever.k = 10

# Combine with configurable weights
hybrid_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, sparse_retriever],
    weights=[0.6, 0.4]   # 60% semantic, 40% keyword matching
)

results = hybrid_retriever.invoke("Ziegler et al. 2019 RLHF paper")

Tuning the weights: The [0.6, 0.4] split favors semantic understanding. Adjust based on your use case:

General knowledge Q&A: [0.7, 0.3] — semantic understanding dominates
Technical documentation with specific terms: [0.5, 0.5] — equal weight
Code search or exact-identifier lookup: [0.3, 0.7] — exact matching dominates

Benchmarking the Three Approaches

Here is a side-by-side comparison using two queries that stress different failure modes:

queries = [
    # Semantic query — no exact term overlap with document
    ("semantic", "how do I troubleshoot login problems"),
    # Exact-term query — specific proper noun
    ("exact", "Ziegler et al. 2019 fine-tuning paper"),
]

for query_type, query in queries:
    print(f"\n{'='*60}")
    print(f"Query type: {query_type}")
    print(f"Query: {query}")
    
    dense_results = dense_retriever.invoke(query)
    sparse_results = sparse_retriever.invoke(query)
    hybrid_results = hybrid_retriever.invoke(query)
    
    print(f"\nDense  top-1: {dense_results[0].page_content[:120]}")
    print(f"Sparse top-1: {sparse_results[0].page_content[:120]}")
    print(f"Hybrid top-1: {hybrid_results[0].page_content[:120]}")

Expected pattern:

Semantic query: Dense = excellent, Sparse = poor, Hybrid = excellent
Exact-term query: Dense = poor, Sparse = excellent, Hybrid = excellent

Hybrid is the safe default that never catastrophically fails at either type.

When to Use What

Use dense-only when:

Your queries are always natural language questions
Your documents are conversational or narrative prose
Latency is critical and you can’t afford two retrievals
Your users never search for specific identifiers or citations

Use sparse-only when:

You’re doing legacy keyword search over structured content
Your queries are mostly product codes, error messages, or exact phrases
You need to minimize dependencies (no embedding API calls)

Use hybrid (the recommended default) when:

You don’t know the distribution of user queries in advance
Your corpus contains a mix of prose and technical content
You want consistently good performance across query types
You’re building a production system

The Performance Reality

Hybrid search is not free. You’re running two retrievals instead of one:

Approach	Latency	Extra Cost	Quality
Dense only	~10ms	—	Good on semantic queries
Sparse (BM25) only	~1ms	—	Good on exact queries
Hybrid	~12ms	None (BM25 is local)	Good on both

The 2ms overhead from adding BM25 to a vector search is negligible. The quality improvement — especially eliminating the “specific citation” failure mode — is substantial. There is almost no reason not to use hybrid search in a production RAG system.

In the next lesson, we’ll add the second major quality improvement: re-ranking the retrieved chunks to surface the most relevant ones.

Course Content

The Retrieval Problem

Dense Retrieval (Semantic Search)

The Concrete Failure

Sparse Retrieval (BM25)

Hybrid Search: The Best of Both Worlds

LangChain’s EnsembleRetriever

Benchmarking the Three Approaches

When to Use What

The Performance Reality

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies