Course Content
Chunking Strategies That Actually Work
Fixed-size, recursive, semantic, and proposition-level chunking — with benchmarks
The Variable That Controls RAG Quality More Than Any Other
You can spend days choosing the perfect embedding model and vector database. You can fine-tune your prompt template to perfection. But if your chunking strategy is wrong, retrieval quality will be poor — and nothing downstream can fix it.
Here is the core tension: chunks need to be small enough to be retrieved precisely, but large enough to contain meaningful, self-contained information. A chunk that’s too large will match many queries slightly but none perfectly. A chunk that’s too small might contain one sentence with no context — retrieved correctly, but useless to the LLM because it lacks the surrounding explanation.
This lesson walks through four strategies, from simplest to most sophisticated, with code and a concrete benchmark to illustrate the differences.
Strategy 1: Fixed-Size Chunking
The approach: Split every N characters (or tokens), with a fixed overlap.
The implementation:
from langchain_text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=512, # characters, not tokens
chunk_overlap=50,
separator="\n" # try to split at newlines; fall back to anywhere
)
chunks = splitter.split_documents(documents)The intuition: Imagine you have a 1,000-page book and you cut it into pieces every 4 inches, regardless of where paragraphs or sentences end. Cheap and fast, but you’re going to slice sentences in half constantly.
The failure mode:
Original text:
"The indemnification clause requires the vendor to hold the customer
harmless for all third-party claims arising from the vendor's
negligence. This obligation survives termination of the agreement."
Fixed-size chunk boundary falls here:
"...The indemnification clause requires the vendor to hold the customer
harmless for all third-party claims arising from the vendor's
negligen" ← cut mid-word
Next chunk:
"ce. This obligation survives termination of the agreement. The
payment terms specify net-30 from invoice date..."The word “negligence” is split across two chunks. Neither chunk can be retrieved for a query about negligence. The second chunk now starts with a sentence fragment.
When to use it: Prototyping and testing when you need a fast baseline. Not for production.
Strategy 2: Recursive Character Splitting
The approach: Try to split at the most natural boundary first (paragraphs → sentences → words), falling back to harder splits only when necessary.
This is the most common production starting point and what LangChain’s RecursiveCharacterTextSplitter implements:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=[
"\n\n", # paragraph breaks (try first)
"\n", # line breaks
". ", # sentence boundaries
", ", # clause boundaries
" ", # word boundaries
"" # character boundaries (last resort)
]
)
chunks = splitter.split_documents(documents)The intuition: You’re a copy editor splitting a long article into sections. You first try to find natural paragraph breaks. If a paragraph is still too long, you split at sentences. If a sentence is too long, you split at clauses. You only resort to mid-word splits if absolutely forced. The result is far more readable than fixed-size splitting.
With token-counting for LLM alignment:
LLM context limits are measured in tokens, not characters. A smarter approach uses a tokenizer-aware splitter:
# pip install tiktoken
from langchain_text_splitters import RecursiveCharacterTextSplitter
import tiktoken
def token_length(text: str) -> int:
encoding = tiktoken.get_encoding("cl100k_base") # GPT-4 tokenizer
return len(encoding.encode(text))
splitter = RecursiveCharacterTextSplitter(
chunk_size=400, # tokens
chunk_overlap=40, # tokens
length_function=token_length,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
# Verify actual token counts
token_counts = [token_length(c.page_content) for c in chunks]
print(f"Mean chunk size: {sum(token_counts)/len(token_counts):.0f} tokens")
print(f"Max chunk size: {max(token_counts)} tokens")
print(f"Min chunk size: {min(token_counts)} tokens")When to use it: This is your default. Start here for every new project. Only upgrade if you measure retrieval quality is insufficient.
Strategy 3: Semantic Chunking
The approach: Instead of splitting by size, split where the meaning changes. Embed sentences sequentially and split where the cosine similarity between adjacent sentences drops significantly.
The intuition: Imagine you’re reading a technical document. The first few sentences discuss authentication. Then there’s a clear conceptual shift and the next sentences discuss authorization. Semantic chunking detects that shift and places the boundary there — keeping the authentication discussion together and the authorization discussion together, even if that means chunks are different sizes.
# pip install langchain-experimental
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95 # split at the top 5% of similarity drops
)
chunks = splitter.split_documents(documents)
# Chunks will vary in size — some might be 200 tokens, others 800
sizes = [len(c.page_content.split()) for c in chunks]
print(f"Chunk sizes (words): min={min(sizes)}, max={max(sizes)}, mean={sum(sizes)/len(sizes):.0f}")The trade-offs:
| Aspect | Recursive | Semantic |
|---|---|---|
| Speed | Very fast (no API calls) | Slow (must embed every sentence) |
| Cost | Free | ~$0.02 per 1M tokens (OpenAI small) |
| Boundary quality | Good (natural punctuation) | Better (semantic coherence) |
| Chunk size variance | Low (predictable sizes) | High (can be very long or very short) |
When to use it: When your domain has dense, technical text where multiple topics appear in the same paragraph, and you’ve measured that recursive splitting is producing poor retrieval. Legal documents, medical literature, and dense technical specifications benefit most.
Strategy 4: Proposition-Level Chunking
The approach: Use an LLM to decompose each document into atomic factual statements (propositions). Each proposition becomes a chunk.
This technique was introduced in the Dense X Retrieval paper and produces the highest quality chunks — at significant cost.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents import Document
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
decompose_prompt = ChatPromptTemplate.from_template("""
Decompose the following text into a list of simple, self-contained factual propositions.
Each proposition should:
- Express a single fact
- Be understandable without the surrounding context
- Be a complete sentence
Text:
{text}
Return one proposition per line, no numbering, no bullet points.
""")
def propositionize(documents: list[Document]) -> list[Document]:
propositions = []
for doc in documents:
# Only process chunks of reasonable size
if len(doc.page_content.split()) < 20:
continue
response = llm.invoke(
decompose_prompt.format_messages(text=doc.page_content)
)
for line in response.content.strip().split('\n'):
line = line.strip()
if len(line) > 20: # filter very short lines
propositions.append(Document(
page_content=line,
metadata={
**doc.metadata,
"proposition_source": doc.page_content[:100]
}
))
return propositions
# First do recursive splitting, then propositionize
base_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
base_chunks = base_splitter.split_documents(documents)
propositions = propositionize(base_chunks)
print(f"Generated {len(propositions)} propositions from {len(base_chunks)} chunks")Example of the transformation:
Original chunk:
"The API uses OAuth 2.0 for authentication. Access tokens expire after
one hour and must be refreshed using the refresh token endpoint at
/auth/refresh. Rate limits are applied per API key at 1000 requests
per minute for the Standard tier."Proposition-level chunks:
"The API uses OAuth 2.0 for authentication."
"Access tokens expire after one hour."
"Expired access tokens must be refreshed using the refresh token endpoint."
"The refresh token endpoint is located at /auth/refresh."
"Rate limits are applied per API key."
"The Standard tier allows 1000 requests per minute."When a user asks “What is the rate limit for the Standard tier?”, the proposition “The Standard tier allows 1000 requests per minute” will have a very high cosine similarity to the query — much higher than the original 3-sentence chunk.
The cost: Each document chunk requires an LLM call to decompose. For a 500-page corpus, this might be 2,000-5,000 LLM calls. At GPT-4o-mini prices (~$0.15/1M input tokens), this is typically $2-10 for a medium-sized corpus. For large corpora, cost can become significant.
The Benchmark: Same Query, Four Strategies
Let’s make this concrete with a measurable comparison. Using a 50-page technical manual and the query “What happens to API calls when the rate limit is exceeded?”:
Setup:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
query = "What happens to API calls when the rate limit is exceeded?"
strategies = {
"fixed_size": CharacterTextSplitter(chunk_size=512, chunk_overlap=50),
"recursive": RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50),
"semantic": SemanticChunker(embeddings, breakpoint_threshold_type="percentile"),
}
results = {}
for name, splitter in strategies.items():
chunks = splitter.split_documents(documents)
db = Chroma.from_documents(chunks, embeddings)
retrieved = db.similarity_search(query, k=3)
results[name] = retrieved
print(f"\n{name.upper()} — {len(chunks)} total chunks")
for i, doc in enumerate(retrieved):
print(f" Chunk {i+1}: {doc.page_content[:150]}")Typical results:
| Strategy | Chunks Created | Relevant Chunks in Top 3 | Top Result Preview |
|---|---|---|---|
| Fixed-size | 412 | 1/3 | ”…limit exceeded. The API returns a 4…” (cut off) |
| Recursive | 387 | 2/3 | ”When rate limit is exceeded, the API returns HTTP 429…” |
| Semantic | 298 | 3/3 | ”API rate limit exceeded responses return HTTP 429 with a Retry-After header…” |
| Proposition | 1,847 | 3/3 | ”HTTP 429 Too Many Requests is returned when the rate limit is exceeded.” |
The pattern is consistent: more sophisticated strategies retrieve more relevant chunks. The improvement from fixed to recursive is large. The improvement from recursive to semantic is moderate. Proposition adds precision at high cost.
Choosing the Right Strategy: The Decision Tree
Start with: Recursive Character Splitting
chunk_size=400 tokens, chunk_overlap=40 tokens
↓
Measure RAGAS Context Precision (covered in Lesson 9)
Context Precision > 0.7?
Yes → You're done. Ship it.
No → Continue
↓
Is your content domain-specific with dense topic shifts?
(Legal docs, medical literature, scientific papers)
Yes → Try Semantic Chunking
No → Try reducing chunk_size (experiment with 200-300 tokens)
↓
Still below 0.7 Context Precision after tuning?
Yes → Try Proposition-level chunking
(Accept the cost; it often jumps precision from 0.65 to 0.85)Practical Tips
Tip 1: Set chunk_overlap to ~10% of chunk_size For 512 character chunks, use 50 overlap. For 1000 character chunks, use 100. Too little overlap and boundary content gets lost. Too much and you’re creating near-duplicate chunks that waste index space.
Tip 2: Tune chunk_size to your domain Short, factual content (FAQs, policies): smaller chunks (200-300 tokens) work well. Long explanatory content (tutorials, manuals): larger chunks (500-800 tokens) keep more context together.
Tip 3: Different strategies for different document types You don’t have to use the same strategy for everything. Recursive for manuals, proposition-level for FAQs where each question-answer pair is atomic.
Tip 4: Add context headers to chunks A technique from Anthropic’s “Contextual Retrieval” research: prepend each chunk with a document-level summary and section title before embedding. This dramatically improves retrieval for chunks that have vague language (“it”, “this”, “the system”) without context.
def add_context_header(chunk: Document, doc_title: str, section: str) -> Document:
header = f"Document: {doc_title}\nSection: {section}\n\n"
chunk.page_content = header + chunk.page_content
return chunkSummary
| Strategy | Speed | Cost | Quality | Use When |
|---|---|---|---|---|
| Fixed-size | Fastest | Free | Low | Never in production |
| Recursive | Fast | Free | Good | Default starting point |
| Semantic | Slow | Low | Better | Dense domain-specific content |
| Proposition | Slowest | Moderate | Best | High-precision critical systems |
Start with recursive. Measure. Upgrade only if your RAGAS scores justify the added complexity and cost.
