Press ESC to exit fullscreen
📖 Lesson ⏱️ 75 minutes

Domain 3 — Context, Memory, and Caching

200K context window strategy, prompt caching, conversation history management, and RAG integration

Domain 3 Overview

Context management is where the most expensive architectural mistakes happen. Exam weight: approximately 20% (~12 questions). The exam tests whether you understand the real costs of context decisions and can apply the right pattern for a given scenario.


The Context Decision Tree

Start every architecture decision with this tree:

Does the task involve a large document or knowledge base?

├── YES: How large?
│   ├── Fits in 200K context (< ~150K tokens) AND same session reuses it?
│   │   └── → Pass in context + enable prompt caching
│   │
│   ├── Too large for context OR accessed across many independent sessions?
│   │   └── → RAG (chunk, embed, retrieve relevant passages)
│   │
│   └── Large but accessed once per query with no repetition?
│       └── → Consider summarization first, then in-context

└── NO: Just a conversation?
    └── → Manage history with sliding window or periodic summarization

The exam’s most common trap: Defaulting to RAG when the document fits in context. RAG adds latency, retrieval failures, chunking artifacts, and complexity. Use in-context when the document fits and prompt caching eliminates the re-processing cost.


Prompt Caching — The Highest-Leverage Cost Tool

Prompt caching stores a stable prefix in Anthropic’s infrastructure. Subsequent requests that share the same prefix pay approximately 10% of the normal input token cost for the cached portion.

Rules You Must Know for the Exam

RuleValue
Minimum cacheable block (Sonnet/Opus)1,024 tokens
Minimum cacheable block (Haiku)2,048 tokens
Cache TTL5 minutes (refreshed on each cache hit)
Cost on cache hit~10% of normal input token cost
Latency on cache hit~15% of normal input token processing time
Cache placementEnd of the stable prefix — before any dynamic content

What to Cache

Cache anything that:

  1. Is large (> 1K tokens) AND
  2. Repeats across multiple requests

Good candidates: system prompts, reference documents, few-shot example blocks, knowledge bases loaded per session.

Do NOT put dynamic content (the user’s current question, per-request variables) inside a cached block.

Implementation

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = "..." # Your large system prompt (> 1,024 tokens)
REFERENCE_DOCUMENT = "..." # The document users query against (> 1,024 tokens)

def answer_question(user_question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT + "\n\n" + REFERENCE_DOCUMENT,
                "cache_control": {"type": "ephemeral"},  # Cache this prefix
            }
        ],
        messages=[
            {
                "role": "user",
                "content": user_question,  # Dynamic — NOT cached
            }
        ],
    )

    # Check cache status in usage
    usage = response.usage
    print(f"Input tokens: {usage.input_tokens}")
    print(f"Cache read tokens: {getattr(usage, 'cache_read_input_tokens', 0)}")
    print(f"Cache creation tokens: {getattr(usage, 'cache_creation_input_tokens', 0)}")

    return response.content[0].text

Reading Cache Metrics

After implementing caching, check the usage field to confirm it’s working:

  • cache_creation_input_tokens > 0 → cache was written (first request)
  • cache_read_input_tokens > 0 → cache was hit (subsequent requests) ← this is what saves money
  • Both 0 → the block is below the minimum token threshold (cache is not activating)

When RAG Is Better Than In-Context

RAG (Retrieval-Augmented Generation) is the right choice when:

ScenarioWhy RAG Wins
Knowledge base > 150K tokensDoes not fit in context
Thousands of independent users, each querying different documentsCaching won’t help across independent sessions
Very high retrieval precision needed from a massive corpusSemantic search on embeddings outperforms asking Claude to find in 200K tokens
Documents change frequentlyEmbedding pipeline handles updates; in-context would need full reload

When RAG Is Worse Than In-Context

The exam tests this direction too:

ScenarioWhy In-Context Wins
50-page document (~20K tokens), same document used for all queries in a sessionFits in context + caching is cheaper and has no retrieval failures
Cross-document reasoning requiredChunked retrieval loses cross-document context; full context preserves it
Low latency requiredRAG adds embedding lookup + retrieval round-trip latency
Precise quotes neededChunking artifacts and retrieval misses degrade precision

Conversation History Management

In a multi-turn application, do not pass the full conversation history forever. It grows without bound, eventually exceeding the context window, and significantly increases cost.

Strategy 1 — Sliding Window

Keep the N most recent message pairs. Simple to implement. Risk: loses important context from early in the conversation.

def trim_history_sliding(messages: list[dict], max_pairs: int = 10) -> list[dict]:
    """Keep the last max_pairs of user/assistant turns."""
    # Keep system prompt (if separate) and last N*2 messages (N user + N assistant)
    return messages[-(max_pairs * 2):]

Strategy 2 — Periodic Summarization

Every K turns, ask Claude to summarize the conversation so far. Replace the history with the summary. Preserves important context; costs one extra Claude call per K turns.

def summarize_history(messages: list[dict]) -> str:
    history_text = "\n".join(
        f"{m['role'].upper()}: {m['content']}" for m in messages
    )
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Use Haiku — cheap summarization
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation, preserving key facts, decisions, and open questions:\n\n{history_text}"
        }],
    )
    return response.content[0].text

def maybe_summarize(messages: list[dict], summarize_every: int = 20) -> list[dict]:
    if len(messages) >= summarize_every:
        summary = summarize_history(messages)
        return [{"role": "user", "content": f"[Conversation summary]\n{summary}"},
                {"role": "assistant", "content": "Understood. I have the context from our earlier conversation."}]
    return messages

Which Strategy for the Exam

Use CaseRecommended Strategy
Short sessions (< 20 turns)Sliding window
Long-running sessions with important early contextPeriodic summarization
Compliance/audit requirement to preserve full historyStore full history externally; pass sliding window to Claude

Token Counting

Before building a caching strategy, know your token sizes. Use the Anthropic token counting API:

response = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": "test"}],
)
print(f"Input tokens: {response.input_tokens}")

If your system prompt is only 800 tokens, caching will not activate on Sonnet (minimum: 1,024). You would need to expand it or combine it with a reference document.


Key Facts for the Exam

  • 200K tokens ≈ 150,000 words ≈ ~500 pages
  • Caching minimum: 1,024 tokens (Sonnet/Opus), 2,048 tokens (Haiku)
  • Cache TTL: 5 minutes — refreshed on each hit
  • Cache savings: ~90% on cached tokens
  • RAG is NOT better by default — in-context + caching often wins for single-document use cases
  • Sliding window = simple, loses early context; summarization = preserves context, costs extra call
  • cache_read_input_tokens in usage tells you if caching is working

Proceed to the Domain 3 Lab to implement caching and measure its cost impact.