Advanced RAG: HyDE, Query Expansion, and Self-RAG

The Quality Gap Between Good and Excellent

A baseline RAG system (recursive chunking + dense retrieval + GPT-4o-mini) typically achieves RAGAS scores around:

Faithfulness: 0.82
Answer Relevancy: 0.78
Context Precision: 0.61

That’s decent — better than nothing, useful for many internal tools. But it has clear failure modes:

Terse queries: User types “SSO setup?” — too ambiguous for good retrieval
Asymmetry problem: The question “How do I configure SSO?” has a different embedding signature than a document section titled “Single Sign-On Configuration Walkthrough”
Wrong answers that sound right: The answer sounds confident but is poorly grounded

This lesson covers three techniques that push those scores to 0.90+. Each adds some complexity; each is worth understanding before deciding whether to implement it.

Technique 1: HyDE — Hypothetical Document Embeddings

The Asymmetry Problem

Here’s a subtle issue with standard RAG: the question and the answer live in different parts of embedding space.

When a user asks “How do I configure SSO?”, that’s a question. The relevant document chunk says “To configure SSO, navigate to Settings > Security and enter your Identity Provider metadata.” That’s an answer.

Questions and answers have different linguistic structures. “How do I…?” has a different embedding signature than “To configure…”. This means the query vector and the relevant chunk vector are not as close as they could be, even though they’re about the same topic.

HyDE’s solution: Instead of embedding the question, generate a hypothetical answer and embed that instead.

The Intuition

Imagine a librarian’s research technique. You come in and ask: “I’m looking for information about configuring SSO.” The librarian doesn’t just search for “configuring SSO.” Instead, they think: “An article about this would probably say something like ‘Single Sign-On (SSO) configuration requires an Identity Provider URL, a certificate, and attribute mappings…’ Let me search for articles that sound like that.”

The hypothetical summary the librarian constructed is much closer in embedding space to real documentation than the original question.

Implementation

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Setup
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)  # slight temperature for variety
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)

# Step 1: Generate a hypothetical answer
hyde_prompt = ChatPromptTemplate.from_template("""
Generate a concise, informative paragraph that would appear in technical documentation 
and directly answer the following question. 
Write as if you are the documentation — authoritative and specific.
Do not say "the documentation says" — just write the content directly.

Question: {question}

Documentation excerpt:""")

hyde_chain = hyde_prompt | llm | StrOutputParser()

def hyde_retrieve(question: str, k: int = 5) -> list:
    """Use HyDE: embed a hypothetical answer instead of the question."""
    
    # Generate hypothetical answer
    hypothetical_answer = hyde_chain.invoke({"question": question})
    print(f"Hypothetical answer: {hypothetical_answer[:200]}...")
    
    # Embed the hypothetical answer (not the question!)
    results = vectorstore.similarity_search(
        hypothetical_answer,  # key difference from standard RAG
        k=k
    )
    
    return results

# Compare standard vs HyDE retrieval
question = "What happens when I exceed rate limits?"

print("=== Standard Retrieval ===")
standard_results = vectorstore.similarity_search(question, k=3)
for doc in standard_results:
    print(f"  {doc.page_content[:150]}")

print("\n=== HyDE Retrieval ===")
hyde_results = hyde_retrieve(question, k=3)
for doc in hyde_results:
    print(f"  {doc.page_content[:150]}")

When HyDE Helps Most

HyDE provides the largest improvement when:

Queries are short and terse (“SSO config?”, “rate limits?”)
The domain has specialized jargon — the hypothetical answer uses the same terminology as the documents
Your documents are structured like documentation (declarative, third-person prose)

HyDE adds one LLM call per query (~50ms, ~$0.0001 at GPT-4o-mini prices). This is almost always worth the cost.

Technique 2: Query Expansion

The Ambiguity Problem

A user types “can’t connect.” To what? The database? The API? The network? Their VPN?

A single query vector for “can’t connect” will land somewhere in the middle of all these meanings and retrieve chunks that are mediocre for all of them.

Query expansion generates multiple reformulations of the query, retrieves for each, then merges the results. You’re widening the search net across multiple semantic angles.

Implementation with Parallel Retrieval

import asyncio
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

expansion_prompt = ChatPromptTemplate.from_template("""
Generate {n} different phrasings of the following question.
Each rephrasing should capture a different interpretation or emphasis.
Return only the questions, one per line, no numbering.

Original question: {question}

Alternative phrasings:""")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.5)
expansion_chain = expansion_prompt | llm | StrOutputParser()

def expand_query(question: str, n: int = 4) -> list[str]:
    """Generate multiple query reformulations."""
    response = expansion_chain.invoke({"question": question, "n": n})
    alternatives = [q.strip() for q in response.strip().split('\n') if q.strip()]
    return [question] + alternatives[:n]  # include original + n alternatives

def deduplicate_docs(doc_lists: list[list]) -> list:
    """Remove duplicate documents across multiple retrieval results."""
    seen_content = set()
    unique_docs = []
    
    for doc_list in doc_lists:
        for doc in doc_list:
            content_key = doc.page_content[:100]  # use first 100 chars as fingerprint
            if content_key not in seen_content:
                seen_content.add(content_key)
                unique_docs.append(doc)
    
    return unique_docs

def query_expansion_retrieve(question: str, k: int = 5) -> list:
    """
    Retrieve with query expansion:
    1. Generate multiple query variants
    2. Retrieve for each variant
    3. Deduplicate and return top results
    """
    # Generate query variants
    queries = expand_query(question, n=3)
    print(f"Expanded queries:")
    for q in queries:
        print(f"  - {q}")
    
    # Retrieve for each query
    all_results = []
    for query in queries:
        results = vectorstore.similarity_search(query, k=k)
        all_results.append(results)
    
    # Deduplicate
    unique_results = deduplicate_docs(all_results)
    
    # Return top k unique results (they're roughly ordered by first-retrieval rank)
    return unique_results[:k * 2]  # return more candidates for re-ranking

# Example
question = "can't connect"
results = query_expansion_retrieve(question, k=5)

# Expanded queries will include:
# - "can't connect"
# - "connection failure troubleshooting"
# - "how to resolve connection errors"
# - "troubleshooting network connectivity issues"
# - "why is my connection being refused"

The Async Version (For Production Throughput)

Running multiple retrieval queries sequentially adds latency. For production, run them in parallel:

async def query_expansion_retrieve_async(question: str, k: int = 5) -> list:
    """Parallel query expansion retrieval."""
    queries = expand_query(question, n=3)
    
    async def retrieve_single(query: str) -> list:
        # Note: most vector stores are synchronous; run in thread pool
        loop = asyncio.get_event_loop()
        return await loop.run_in_executor(
            None,
            lambda: vectorstore.similarity_search(query, k=k)
        )
    
    # Run all retrievals concurrently
    all_results = await asyncio.gather(*[retrieve_single(q) for q in queries])
    
    unique_results = deduplicate_docs(list(all_results))
    return unique_results[:k * 2]

# Usage
results = asyncio.run(query_expansion_retrieve_async("can't connect"))

Parallel retrieval reduces latency from n×T to ~T (the time of a single retrieval), which makes query expansion nearly free in terms of wall-clock time.

Technique 3: Self-RAG

The Verification Problem

Standard RAG answers the question once and stops. But what if the retrieved context doesn’t actually contain the answer? The LLM will either say “I don’t know” (good) or confabulate something (bad).

Self-RAG adds a verification loop: after generating an answer, ask the model to evaluate its own answer. If the answer isn’t well-grounded in the context, retrieve again with a refined query.

The Self-RAG Loop

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.vectorstores import Chroma

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Grounding check prompt
grounding_check_prompt = ChatPromptTemplate.from_template("""
You are a fact-checking assistant. Your job is to determine if an answer is well-grounded 
in the provided context.

Context:
{context}

Question: {question}

Generated Answer: {answer}

Evaluation:
1. Does the answer rely only on information from the context? (Yes/No)
2. Is every specific claim in the answer supported by the context? (Yes/No)
3. If no to either: what information is the answer claiming that isn't in the context?

Verdict (respond with exactly "GROUNDED" or "NOT_GROUNDED"):""")

grounding_chain = grounding_check_prompt | llm | StrOutputParser()

# Query refinement prompt (used when answer is not grounded)
refinement_prompt = ChatPromptTemplate.from_template("""
The following answer to a question was not well-supported by the retrieved context.

Original question: {question}
Poor answer: {answer}
Missing information: {missing_info}

Write a more specific search query that would retrieve the missing information:""")

refinement_chain = refinement_prompt | llm | StrOutputParser()

def self_rag(
    question: str, 
    vectorstore: Chroma,
    max_iterations: int = 3
) -> dict:
    """
    Self-RAG loop: generate → verify → refine → repeat if needed.
    Returns the final answer with iteration count.
    """
    current_query = question
    
    for iteration in range(max_iterations):
        print(f"\nIteration {iteration + 1}: Query = '{current_query}'")
        
        # Retrieve
        docs = vectorstore.similarity_search(current_query, k=5)
        context = "\n\n".join(doc.page_content for doc in docs)
        
        # Generate answer
        answer_prompt = ChatPromptTemplate.from_template("""
Answer based ONLY on the context. Be specific and cite details from the context.
If the context doesn't contain the answer, say "The context does not contain this information."

Context: {context}
Question: {question}
Answer:""")
        
        answer = (answer_prompt | llm | StrOutputParser()).invoke({
            "context": context,
            "question": question
        })
        
        print(f"Answer: {answer[:200]}...")
        
        # Check grounding
        grounding_result = grounding_chain.invoke({
            "context": context,
            "question": question,
            "answer": answer
        })
        
        print(f"Grounding check: {grounding_result[:100]}")
        
        if "GROUNDED" in grounding_result:
            print(f"Answer grounded on iteration {iteration + 1}")
            return {
                "answer": answer,
                "iterations": iteration + 1,
                "final_query": current_query,
                "sources": [doc.metadata.get('source') for doc in docs]
            }
        
        # Answer not grounded — refine the query
        # Extract what's missing from the grounding check
        missing_info = grounding_result.split("NOT_GROUNDED")[0].strip()
        
        current_query = refinement_chain.invoke({
            "question": question,
            "answer": answer,
            "missing_info": missing_info
        })
        
        print(f"Refined query: {current_query}")
    
    # Return best answer after max iterations
    print(f"Reached max iterations ({max_iterations})")
    return {
        "answer": answer,
        "iterations": max_iterations,
        "final_query": current_query,
        "warning": "Max iterations reached; answer may not be fully grounded"
    }

# Usage
result = self_rag(
    question="What is the SLA for the Enterprise tier?",
    vectorstore=vectorstore
)

print(f"\nFinal answer: {result['answer']}")
print(f"Iterations needed: {result['iterations']}")

Self-RAG Trade-offs

Aspect	Standard RAG	Self-RAG
Latency	~300ms	~600-1500ms (per iteration)
LLM calls	1	2-6
Faithfulness	0.82	~0.91
Cost	Low	2-5x higher
Best use case	Real-time chat	High-stakes queries

Self-RAG is most valuable for high-stakes applications where wrong answers have real consequences: medical information systems, legal research tools, financial advice platforms. For casual question-answering where latency matters more, stick with standard RAG or HyDE.

Combining All Three: The Advanced Pipeline

Here is a pipeline that combines all three techniques:

async def advanced_rag_pipeline(
    question: str,
    vectorstore,
    use_hyde: bool = True,
    use_expansion: bool = True,
    use_self_rag: bool = False  # enable for high-stakes queries
) -> str:
    """Full advanced RAG pipeline with configurable techniques."""
    
    # Step 1: HyDE — Generate hypothetical answer for better query embedding
    if use_hyde:
        hypothetical = hyde_chain.invoke({"question": question})
        retrieval_query = hypothetical
    else:
        retrieval_query = question
    
    # Step 2: Query expansion — multiple retrieval angles
    if use_expansion:
        queries = expand_query(retrieval_query, n=3)
        all_docs = []
        for q in queries:
            docs = vectorstore.similarity_search(q, k=8)
            all_docs.extend(docs)
        candidates = deduplicate_docs([all_docs])[:15]
    else:
        candidates = vectorstore.similarity_search(retrieval_query, k=15)
    
    # Step 3: Re-rank (from previous lesson)
    reranked = reranker.rerank(question, candidates, top_k=5)
    top_docs = [doc for doc, score in reranked]
    
    # Step 4: Generate answer
    context = "\n\n".join(doc.page_content for doc in top_docs)
    answer_prompt = ChatPromptTemplate.from_template("""
Answer based ONLY on the context. Be specific and cite document sections.

Context:
{context}

Question: {question}
Answer:""")
    answer = (answer_prompt | llm | StrOutputParser()).invoke({
        "context": context, "question": question
    })
    
    # Step 5: Self-RAG verification (optional)
    if use_self_rag:
        grounding = grounding_chain.invoke({
            "context": context, "question": question, "answer": answer
        })
        if "NOT_GROUNDED" in grounding:
            result = self_rag(question, vectorstore, max_iterations=2)
            return result["answer"]
    
    return answer

# Usage
answer = asyncio.run(advanced_rag_pipeline(
    "What are the retry policies for API calls?",
    vectorstore=vectorstore,
    use_hyde=True,
    use_expansion=True,
    use_self_rag=False
))

When to Implement Each Technique

Technique	RAGAS Improvement	Latency Cost	Complexity	Implement When
HyDE	+0.05-0.10 faithfulness	+50ms	Low	Always — easy win
Query expansion	+0.05-0.08 context precision	+0ms (parallel)	Medium	Queries are often terse
Self-RAG	+0.08-0.15 faithfulness	+300-1000ms	High	High-stakes accuracy required

The capstone project combines all of these into a single coherent system. Let’s build it.

Course Content