Press ESC to exit fullscreen
📖 Lesson ⏱️ 120 minutes

Chunking Strategies That Actually Work

Fixed-size, recursive, semantic, and proposition-level chunking — with benchmarks

The Variable That Controls RAG Quality More Than Any Other

You can spend days choosing the perfect embedding model and vector database. You can fine-tune your prompt template to perfection. But if your chunking strategy is wrong, retrieval quality will be poor — and nothing downstream can fix it.

Here is the core tension: chunks need to be small enough to be retrieved precisely, but large enough to contain meaningful, self-contained information. A chunk that’s too large will match many queries slightly but none perfectly. A chunk that’s too small might contain one sentence with no context — retrieved correctly, but useless to the LLM because it lacks the surrounding explanation.

This lesson walks through four strategies, from simplest to most sophisticated, with code and a concrete benchmark to illustrate the differences.


Strategy 1: Fixed-Size Chunking

The approach: Split every N characters (or tokens), with a fixed overlap.

The implementation:

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=512,    # characters, not tokens
    chunk_overlap=50,
    separator="\n"     # try to split at newlines; fall back to anywhere
)

chunks = splitter.split_documents(documents)

The intuition: Imagine you have a 1,000-page book and you cut it into pieces every 4 inches, regardless of where paragraphs or sentences end. Cheap and fast, but you’re going to slice sentences in half constantly.

The failure mode:

Original text:
"The indemnification clause requires the vendor to hold the customer 
harmless for all third-party claims arising from the vendor's 
negligence. This obligation survives termination of the agreement."

Fixed-size chunk boundary falls here:
"...The indemnification clause requires the vendor to hold the customer 
harmless for all third-party claims arising from the vendor's
negligen"  ← cut mid-word

Next chunk:
"ce. This obligation survives termination of the agreement. The 
payment terms specify net-30 from invoice date..."

The word “negligence” is split across two chunks. Neither chunk can be retrieved for a query about negligence. The second chunk now starts with a sentence fragment.

When to use it: Prototyping and testing when you need a fast baseline. Not for production.


Strategy 2: Recursive Character Splitting

The approach: Try to split at the most natural boundary first (paragraphs → sentences → words), falling back to harder splits only when necessary.

This is the most common production starting point and what LangChain’s RecursiveCharacterTextSplitter implements:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=[
        "\n\n",   # paragraph breaks (try first)
        "\n",     # line breaks
        ". ",     # sentence boundaries
        ", ",     # clause boundaries
        " ",      # word boundaries
        ""        # character boundaries (last resort)
    ]
)

chunks = splitter.split_documents(documents)

The intuition: You’re a copy editor splitting a long article into sections. You first try to find natural paragraph breaks. If a paragraph is still too long, you split at sentences. If a sentence is too long, you split at clauses. You only resort to mid-word splits if absolutely forced. The result is far more readable than fixed-size splitting.

With token-counting for LLM alignment:

LLM context limits are measured in tokens, not characters. A smarter approach uses a tokenizer-aware splitter:

# pip install tiktoken
from langchain_text_splitters import RecursiveCharacterTextSplitter
import tiktoken

def token_length(text: str) -> int:
    encoding = tiktoken.get_encoding("cl100k_base")  # GPT-4 tokenizer
    return len(encoding.encode(text))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,           # tokens
    chunk_overlap=40,         # tokens
    length_function=token_length,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)

# Verify actual token counts
token_counts = [token_length(c.page_content) for c in chunks]
print(f"Mean chunk size: {sum(token_counts)/len(token_counts):.0f} tokens")
print(f"Max chunk size: {max(token_counts)} tokens")
print(f"Min chunk size: {min(token_counts)} tokens")

When to use it: This is your default. Start here for every new project. Only upgrade if you measure retrieval quality is insufficient.


Strategy 3: Semantic Chunking

The approach: Instead of splitting by size, split where the meaning changes. Embed sentences sequentially and split where the cosine similarity between adjacent sentences drops significantly.

The intuition: Imagine you’re reading a technical document. The first few sentences discuss authentication. Then there’s a clear conceptual shift and the next sentences discuss authorization. Semantic chunking detects that shift and places the boundary there — keeping the authentication discussion together and the authorization discussion together, even if that means chunks are different sizes.

# pip install langchain-experimental
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95  # split at the top 5% of similarity drops
)

chunks = splitter.split_documents(documents)

# Chunks will vary in size — some might be 200 tokens, others 800
sizes = [len(c.page_content.split()) for c in chunks]
print(f"Chunk sizes (words): min={min(sizes)}, max={max(sizes)}, mean={sum(sizes)/len(sizes):.0f}")

The trade-offs:

AspectRecursiveSemantic
SpeedVery fast (no API calls)Slow (must embed every sentence)
CostFree~$0.02 per 1M tokens (OpenAI small)
Boundary qualityGood (natural punctuation)Better (semantic coherence)
Chunk size varianceLow (predictable sizes)High (can be very long or very short)

When to use it: When your domain has dense, technical text where multiple topics appear in the same paragraph, and you’ve measured that recursive splitting is producing poor retrieval. Legal documents, medical literature, and dense technical specifications benefit most.


Strategy 4: Proposition-Level Chunking

The approach: Use an LLM to decompose each document into atomic factual statements (propositions). Each proposition becomes a chunk.

This technique was introduced in the Dense X Retrieval paper and produces the highest quality chunks — at significant cost.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents import Document

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

decompose_prompt = ChatPromptTemplate.from_template("""
Decompose the following text into a list of simple, self-contained factual propositions.
Each proposition should:
- Express a single fact
- Be understandable without the surrounding context
- Be a complete sentence

Text:
{text}

Return one proposition per line, no numbering, no bullet points.
""")

def propositionize(documents: list[Document]) -> list[Document]:
    propositions = []
    
    for doc in documents:
        # Only process chunks of reasonable size
        if len(doc.page_content.split()) < 20:
            continue
            
        response = llm.invoke(
            decompose_prompt.format_messages(text=doc.page_content)
        )
        
        for line in response.content.strip().split('\n'):
            line = line.strip()
            if len(line) > 20:  # filter very short lines
                propositions.append(Document(
                    page_content=line,
                    metadata={
                        **doc.metadata,
                        "proposition_source": doc.page_content[:100]
                    }
                ))
    
    return propositions

# First do recursive splitting, then propositionize
base_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
base_chunks = base_splitter.split_documents(documents)

propositions = propositionize(base_chunks)
print(f"Generated {len(propositions)} propositions from {len(base_chunks)} chunks")

Example of the transformation:

Original chunk:

"The API uses OAuth 2.0 for authentication. Access tokens expire after 
one hour and must be refreshed using the refresh token endpoint at 
/auth/refresh. Rate limits are applied per API key at 1000 requests 
per minute for the Standard tier."

Proposition-level chunks:

"The API uses OAuth 2.0 for authentication."
"Access tokens expire after one hour."
"Expired access tokens must be refreshed using the refresh token endpoint."
"The refresh token endpoint is located at /auth/refresh."
"Rate limits are applied per API key."
"The Standard tier allows 1000 requests per minute."

When a user asks “What is the rate limit for the Standard tier?”, the proposition “The Standard tier allows 1000 requests per minute” will have a very high cosine similarity to the query — much higher than the original 3-sentence chunk.

The cost: Each document chunk requires an LLM call to decompose. For a 500-page corpus, this might be 2,000-5,000 LLM calls. At GPT-4o-mini prices (~$0.15/1M input tokens), this is typically $2-10 for a medium-sized corpus. For large corpora, cost can become significant.


The Benchmark: Same Query, Four Strategies

Let’s make this concrete with a measurable comparison. Using a 50-page technical manual and the query “What happens to API calls when the rate limit is exceeded?”:

Setup:

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
query = "What happens to API calls when the rate limit is exceeded?"

strategies = {
    "fixed_size": CharacterTextSplitter(chunk_size=512, chunk_overlap=50),
    "recursive": RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50),
    "semantic": SemanticChunker(embeddings, breakpoint_threshold_type="percentile"),
}

results = {}
for name, splitter in strategies.items():
    chunks = splitter.split_documents(documents)
    db = Chroma.from_documents(chunks, embeddings)
    retrieved = db.similarity_search(query, k=3)
    results[name] = retrieved
    print(f"\n{name.upper()}{len(chunks)} total chunks")
    for i, doc in enumerate(retrieved):
        print(f"  Chunk {i+1}: {doc.page_content[:150]}")

Typical results:

StrategyChunks CreatedRelevant Chunks in Top 3Top Result Preview
Fixed-size4121/3”…limit exceeded. The API returns a 4…” (cut off)
Recursive3872/3”When rate limit is exceeded, the API returns HTTP 429…”
Semantic2983/3”API rate limit exceeded responses return HTTP 429 with a Retry-After header…”
Proposition1,8473/3”HTTP 429 Too Many Requests is returned when the rate limit is exceeded.”

The pattern is consistent: more sophisticated strategies retrieve more relevant chunks. The improvement from fixed to recursive is large. The improvement from recursive to semantic is moderate. Proposition adds precision at high cost.


Choosing the Right Strategy: The Decision Tree

Start with: Recursive Character Splitting
  chunk_size=400 tokens, chunk_overlap=40 tokens



Measure RAGAS Context Precision (covered in Lesson 9)

Context Precision > 0.7?
  Yes → You're done. Ship it.
  No → Continue



Is your content domain-specific with dense topic shifts?
(Legal docs, medical literature, scientific papers)
  Yes → Try Semantic Chunking
  No → Try reducing chunk_size (experiment with 200-300 tokens)



Still below 0.7 Context Precision after tuning?
  Yes → Try Proposition-level chunking
  (Accept the cost; it often jumps precision from 0.65 to 0.85)

Practical Tips

Tip 1: Set chunk_overlap to ~10% of chunk_size For 512 character chunks, use 50 overlap. For 1000 character chunks, use 100. Too little overlap and boundary content gets lost. Too much and you’re creating near-duplicate chunks that waste index space.

Tip 2: Tune chunk_size to your domain Short, factual content (FAQs, policies): smaller chunks (200-300 tokens) work well. Long explanatory content (tutorials, manuals): larger chunks (500-800 tokens) keep more context together.

Tip 3: Different strategies for different document types You don’t have to use the same strategy for everything. Recursive for manuals, proposition-level for FAQs where each question-answer pair is atomic.

Tip 4: Add context headers to chunks A technique from Anthropic’s “Contextual Retrieval” research: prepend each chunk with a document-level summary and section title before embedding. This dramatically improves retrieval for chunks that have vague language (“it”, “this”, “the system”) without context.

def add_context_header(chunk: Document, doc_title: str, section: str) -> Document:
    header = f"Document: {doc_title}\nSection: {section}\n\n"
    chunk.page_content = header + chunk.page_content
    return chunk

Summary

StrategySpeedCostQualityUse When
Fixed-sizeFastestFreeLowNever in production
RecursiveFastFreeGoodDefault starting point
SemanticSlowLowBetterDense domain-specific content
PropositionSlowestModerateBestHigh-precision critical systems

Start with recursive. Measure. Upgrade only if your RAGAS scores justify the added complexity and cost.