· AI Engineering · 5 min read
📋 Prerequisites
- Completed Domain 2 — Prompt Engineering
- Active Anthropic API access
🎯 What You'll Learn
- Apply the context decision tree to choose between in-context, cached, and RAG approaches
- Implement prompt caching correctly with proper block placement and size thresholds
- Design conversation history management for long-running multi-turn applications
- Estimate the cost and latency impact of caching decisions
Domain 3 Overview
Context management is where the most expensive architectural mistakes happen. Exam weight: approximately 20% (~12 questions). The exam tests whether you understand the real costs of context decisions and can apply the right pattern for a given scenario.
The Context Decision Tree
Start every architecture decision with this tree:
Does the task involve a large document or knowledge base?
│
├── YES: How large?
│ ├── Fits in 200K context (< ~150K tokens) AND same session reuses it?
│ │ └── → Pass in context + enable prompt caching
│ │
│ ├── Too large for context OR accessed across many independent sessions?
│ │ └── → RAG (chunk, embed, retrieve relevant passages)
│ │
│ └── Large but accessed once per query with no repetition?
│ └── → Consider summarization first, then in-context
│
└── NO: Just a conversation?
└── → Manage history with sliding window or periodic summarizationThe exam’s most common trap: Defaulting to RAG when the document fits in context. RAG adds latency, retrieval failures, chunking artifacts, and complexity. Use in-context when the document fits and prompt caching eliminates the re-processing cost.
Prompt Caching — The Highest-Leverage Cost Tool
Prompt caching stores a stable prefix in Anthropic’s infrastructure. Subsequent requests that share the same prefix pay approximately 10% of the normal input token cost for the cached portion.
Rules You Must Know for the Exam
| Rule | Value |
|---|---|
| Minimum cacheable block (Sonnet/Opus) | 1,024 tokens |
| Minimum cacheable block (Haiku) | 2,048 tokens |
| Cache TTL | 5 minutes (refreshed on each cache hit) |
| Cost on cache hit | ~10% of normal input token cost |
| Latency on cache hit | ~15% of normal input token processing time |
| Cache placement | End of the stable prefix — before any dynamic content |
What to Cache
Cache anything that:
- Is large (> 1K tokens) AND
- Repeats across multiple requests
Good candidates: system prompts, reference documents, few-shot example blocks, knowledge bases loaded per session.
Do NOT put dynamic content (the user’s current question, per-request variables) inside a cached block.
Implementation
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = "..." # Your large system prompt (> 1,024 tokens)
REFERENCE_DOCUMENT = "..." # The document users query against (> 1,024 tokens)
def answer_question(user_question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT + "\n\n" + REFERENCE_DOCUMENT,
"cache_control": {"type": "ephemeral"}, # Cache this prefix
}
],
messages=[
{
"role": "user",
"content": user_question, # Dynamic — NOT cached
}
],
)
# Check cache status in usage
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache read tokens: {getattr(usage, 'cache_read_input_tokens', 0)}")
print(f"Cache creation tokens: {getattr(usage, 'cache_creation_input_tokens', 0)}")
return response.content[0].textReading Cache Metrics
After implementing caching, check the usage field to confirm it’s working:
cache_creation_input_tokens> 0 → cache was written (first request)cache_read_input_tokens> 0 → cache was hit (subsequent requests) ← this is what saves money- Both 0 → the block is below the minimum token threshold (cache is not activating)
When RAG Is Better Than In-Context
RAG (Retrieval-Augmented Generation) is the right choice when:
| Scenario | Why RAG Wins |
|---|---|
| Knowledge base > 150K tokens | Does not fit in context |
| Thousands of independent users, each querying different documents | Caching won’t help across independent sessions |
| Very high retrieval precision needed from a massive corpus | Semantic search on embeddings outperforms asking Claude to find in 200K tokens |
| Documents change frequently | Embedding pipeline handles updates; in-context would need full reload |
When RAG Is Worse Than In-Context
The exam tests this direction too:
| Scenario | Why In-Context Wins |
|---|---|
| 50-page document (~20K tokens), same document used for all queries in a session | Fits in context + caching is cheaper and has no retrieval failures |
| Cross-document reasoning required | Chunked retrieval loses cross-document context; full context preserves it |
| Low latency required | RAG adds embedding lookup + retrieval round-trip latency |
| Precise quotes needed | Chunking artifacts and retrieval misses degrade precision |
Conversation History Management
In a multi-turn application, do not pass the full conversation history forever. It grows without bound, eventually exceeding the context window, and significantly increases cost.
Strategy 1 — Sliding Window
Keep the N most recent message pairs. Simple to implement. Risk: loses important context from early in the conversation.
def trim_history_sliding(messages: list[dict], max_pairs: int = 10) -> list[dict]:
"""Keep the last max_pairs of user/assistant turns."""
# Keep system prompt (if separate) and last N*2 messages (N user + N assistant)
return messages[-(max_pairs * 2):]Strategy 2 — Periodic Summarization
Every K turns, ask Claude to summarize the conversation so far. Replace the history with the summary. Preserves important context; costs one extra Claude call per K turns.
def summarize_history(messages: list[dict]) -> str:
history_text = "\n".join(
f"{m['role'].upper()}: {m['content']}" for m in messages
)
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Use Haiku — cheap summarization
max_tokens=500,
messages=[{
"role": "user",
"content": f"Summarize this conversation, preserving key facts, decisions, and open questions:\n\n{history_text}"
}],
)
return response.content[0].text
def maybe_summarize(messages: list[dict], summarize_every: int = 20) -> list[dict]:
if len(messages) >= summarize_every:
summary = summarize_history(messages)
return [{"role": "user", "content": f"[Conversation summary]\n{summary}"},
{"role": "assistant", "content": "Understood. I have the context from our earlier conversation."}]
return messagesWhich Strategy for the Exam
| Use Case | Recommended Strategy |
|---|---|
| Short sessions (< 20 turns) | Sliding window |
| Long-running sessions with important early context | Periodic summarization |
| Compliance/audit requirement to preserve full history | Store full history externally; pass sliding window to Claude |
Token Counting
Before building a caching strategy, know your token sizes. Use the Anthropic token counting API:
response = client.messages.count_tokens(
model="claude-sonnet-4-6",
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": "test"}],
)
print(f"Input tokens: {response.input_tokens}")If your system prompt is only 800 tokens, caching will not activate on Sonnet (minimum: 1,024). You would need to expand it or combine it with a reference document.
Key Facts for the Exam
- 200K tokens ≈ 150,000 words ≈ ~500 pages
- Caching minimum: 1,024 tokens (Sonnet/Opus), 2,048 tokens (Haiku)
- Cache TTL: 5 minutes — refreshed on each hit
- Cache savings: ~90% on cached tokens
- RAG is NOT better by default — in-context + caching often wins for single-document use cases
- Sliding window = simple, loses early context; summarization = preserves context, costs extra call
cache_read_input_tokensinusagetells you if caching is working
Proceed to the Domain 3 Lab to implement caching and measure its cost impact.