Domain 3 — Context and Caching Lab

Hands-on lab: implement prompt caching on a real document Q&A system and build a basic RAG pipeline. Measure cost impact before and after caching.

🚀 advanced
⏱️ 60 minutes
👤 SuperML Team

· AI Engineering · 5 min read

📋 Prerequisites

  • Completed Domain 3 — Context, Memory, and Caching lesson
  • Active Anthropic API access
  • Python with anthropic SDK installed

🎯 What You'll Learn

  • Implement prompt caching and verify it is activating via usage metrics
  • Measure the cost difference between cached and uncached requests
  • Build a minimal RAG pipeline and understand when it outperforms in-context
  • Implement conversation history management with both sliding window and summarization

Lab Overview

This lab has three parts. Complete them in order — each builds on the previous.

  1. Part A — Implement prompt caching and verify it works
  2. Part B — Build a minimal RAG pipeline and compare to in-context
  3. Part C — Implement conversation history management

Part A — Prompt Caching Implementation and Verification

Goal: Implement caching on a document Q&A system and confirm cache hits via usage metrics.

Step 1 — Create a test document

import anthropic

client = anthropic.Anthropic()

# Simulate a large reference document (must be > 1,024 tokens to cache on Sonnet)
# In production this would be loaded from a file
REFERENCE_DOCUMENT = """
[Product Manual — DataFlow Pipeline Platform v3.2]

Chapter 1: Getting Started
DataFlow is a cloud-native data pipeline platform that connects to over 200 data sources...
[Continue writing until you have a document of at least 2,000 words / ~2,500 tokens]
[You can use any real document you have access to — internal docs, a README, a PDF, etc.]
""" * 10  # Repeat to ensure we exceed the 1,024 token threshold

SYSTEM_PROMPT = """You are a technical support assistant for the DataFlow platform.
Answer questions using only the reference documentation provided.
If the answer is not in the documentation, say so clearly.
Keep responses concise and cite the relevant section when possible."""

Step 2 — First request (cache write)

def query_with_caching(user_question: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT + "\n\n" + REFERENCE_DOCUMENT,
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=[{"role": "user", "content": user_question}],
    )

    usage = response.usage
    return {
        "answer": response.content[0].text,
        "input_tokens": usage.input_tokens,
        "cache_creation_tokens": getattr(usage, "cache_creation_input_tokens", 0),
        "cache_read_tokens": getattr(usage, "cache_read_input_tokens", 0),
    }

# First call — should write cache
result1 = query_with_caching("How do I connect to Snowflake?")
print("First call (cache write):")
print(f"  Cache creation tokens: {result1['cache_creation_tokens']}")
print(f"  Cache read tokens: {result1['cache_read_tokens']}")

Expected: cache_creation_tokens > 0, cache_read_tokens == 0

Step 3 — Second request (cache hit, within 5 minutes)

import time
time.sleep(2)  # Brief pause but stay within 5-minute TTL

result2 = query_with_caching("What databases does DataFlow support?")
print("\nSecond call (cache read):")
print(f"  Cache creation tokens: {result2['cache_creation_tokens']}")
print(f"  Cache read tokens: {result2['cache_read_tokens']}")

Expected: cache_read_tokens > 0, cache_creation_tokens == 0

Step 4 — Compute cost savings

SONNET_INPUT_COST_PER_TOKEN = 3.0 / 1_000_000      # $3 per million
SONNET_CACHE_READ_COST_PER_TOKEN = 0.3 / 1_000_000  # $0.30 per million (90% cheaper)

doc_tokens = result1["cache_creation_tokens"]

cost_uncached = doc_tokens * SONNET_INPUT_COST_PER_TOKEN
cost_cached = doc_tokens * SONNET_CACHE_READ_COST_PER_TOKEN
savings_pct = (1 - cost_cached / cost_uncached) * 100

print(f"\nDocument size: {doc_tokens:,} tokens")
print(f"Cost per query uncached: ${cost_uncached:.4f}")
print(f"Cost per query cached:   ${cost_cached:.4f}")
print(f"Savings per query:       {savings_pct:.0f}%")
print(f"Daily savings (500 queries): ${(cost_uncached - cost_cached) * 500:.2f}")

Lab checkpoint: You should see ~90% cost reduction on the document portion of each query.


Part B — Minimal RAG Pipeline

Goal: Build a basic RAG pipeline and understand concretely when it beats in-context and when it doesn’t.

Step 1 — Chunking and embedding

import json

def chunk_document(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by word count."""
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        i += chunk_size - overlap
    return chunks

def embed_chunks(chunks: list[str]) -> list[dict]:
    """
    In production: use an embedding model (OpenAI, Cohere, Voyage AI, etc.)
    For this lab: use a simple keyword-overlap retrieval as a stand-in.
    """
    return [{"text": chunk, "index": i} for i, chunk in enumerate(chunks)]

chunks = chunk_document(REFERENCE_DOCUMENT)
embedded = embed_chunks(chunks)
print(f"Document split into {len(chunks)} chunks")

Step 2 — Retrieval (keyword-based for the lab)

def retrieve(query: str, chunks: list[dict], top_k: int = 3) -> list[str]:
    """Simple keyword overlap retrieval for lab purposes."""
    query_words = set(query.lower().split())
    scored = []
    for chunk in chunks:
        chunk_words = set(chunk["text"].lower().split())
        overlap = len(query_words & chunk_words)
        scored.append((overlap, chunk["text"]))
    scored.sort(reverse=True)
    return [text for _, text in scored[:top_k]]

def query_with_rag(user_question: str) -> str:
    retrieved = retrieve(user_question, embedded)
    context = "\n\n".join(f"<passage>\n{p}\n</passage>" for p in retrieved)

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system="Answer questions using only the passages provided. If the answer isn't there, say so.",
        messages=[{
            "role": "user",
            "content": f"{context}\n\nQuestion: {user_question}"
        }],
    )
    return response.content[0].text

Step 3 — Compare in-context vs. RAG

test_question = "What are the system requirements for DataFlow?"

print("=== IN-CONTEXT ANSWER ===")
in_ctx_result = query_with_caching(test_question)
print(in_ctx_result["answer"])

print("\n=== RAG ANSWER ===")
rag_answer = query_with_rag(test_question)
print(rag_answer)

Observe: For questions that require synthesizing across multiple sections, in-context typically produces more complete answers. For a large corpus where only one section is relevant, RAG reduces cost by sending only the relevant chunk.


Part C — Conversation History Management

def chat_with_sliding_window(
    conversation: list[dict],
    new_message: str,
    max_pairs: int = 5,
) -> str:
    """Add new message, trim to last max_pairs turns, get response."""
    conversation.append({"role": "user", "content": new_message})
    trimmed = conversation[-(max_pairs * 2):]

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=trimmed,
    )
    reply = response.content[0].text
    conversation.append({"role": "assistant", "content": reply})
    return reply

# Test: after 12 turns, only the last 5 pairs are sent
history = []
for i in range(12):
    user_msg = f"Turn {i+1}: Tell me something interesting about data engineering."
    reply = chat_with_sliding_window(history, user_msg)
    print(f"Turn {i+1}: {reply[:80]}...")

print(f"\nTotal turns in history object: {len(history)//2}")
print(f"Max turns sent to API: 5")

Lab checkpoint: Confirm that after 12 turns the history object grows but only the last 5 pairs are sent to the API.


Lab Completion Checklist

  • Cache hit confirmed via cache_read_input_tokens > 0 on the second request
  • Cost savings computed — approximately 90% on the document portion
  • RAG pipeline retrieves relevant passages and answers the test question
  • You can articulate one scenario where RAG would outperform in-context for this document
  • Sliding window limits API input to max_pairs * 2 messages regardless of history length

Once complete, proceed to Domain 3 Practice Questions.

Back to Tutorials

Related Tutorials

🚀advanced ⏱️ 75 minutes

Domain 3 — Context, Memory, and Caching

Master the 200K context window strategy, prompt caching implementation, conversation history management, and the in-context vs. RAG decision for the Claude Certified Architect exam.

AI Engineering5 min read
claudeanthropiccertification +3
🚀advanced ⏱️ 30 minutes

Domain 3 — Context and Caching Practice Questions

10 scenario-based practice questions on prompt caching, in-context vs. RAG decisions, context window strategy, and conversation history management.

AI Engineering7 min read
claudeanthropiccertification +3
🚀advanced ⏱️ 60 minutes

Domain 2 — Prompt Engineering Hands-On Lab

Three real-world prompt engineering scenarios to build, test, and iterate in the Claude API. Complete this lab before attempting Domain 2 practice questions.

AI Engineering5 min read
claudeanthropiccertification +2
🚀advanced ⏱️ 90 minutes

Domain 4 — Agent Pipeline Lab

Build a working two-agent pipeline with tool use, schema validation, and prompt injection testing. The hands-on foundation for Domain 4 exam questions.

AI Engineering6 min read
claudeanthropiccertification +3