Domain 3 — Context and Caching Lab

Lab Overview

This lab has three parts. Complete them in order — each builds on the previous.

Part A — Implement prompt caching and verify it works
Part B — Build a minimal RAG pipeline and compare to in-context
Part C — Implement conversation history management

Part A — Prompt Caching Implementation and Verification

Goal: Implement caching on a document Q&A system and confirm cache hits via usage metrics.

Step 1 — Create a test document

import anthropic

client = anthropic.Anthropic()

# Simulate a large reference document (must be > 1,024 tokens to cache on Sonnet)
# In production this would be loaded from a file
REFERENCE_DOCUMENT = """
[Product Manual — DataFlow Pipeline Platform v3.2]

Chapter 1: Getting Started
DataFlow is a cloud-native data pipeline platform that connects to over 200 data sources...
[Continue writing until you have a document of at least 2,000 words / ~2,500 tokens]
[You can use any real document you have access to — internal docs, a README, a PDF, etc.]
""" * 10  # Repeat to ensure we exceed the 1,024 token threshold

SYSTEM_PROMPT = """You are a technical support assistant for the DataFlow platform.
Answer questions using only the reference documentation provided.
If the answer is not in the documentation, say so clearly.
Keep responses concise and cite the relevant section when possible."""

Step 2 — First request (cache write)

def query_with_caching(user_question: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT + "\n\n" + REFERENCE_DOCUMENT,
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=[{"role": "user", "content": user_question}],
    )

    usage = response.usage
    return {
        "answer": response.content[0].text,
        "input_tokens": usage.input_tokens,
        "cache_creation_tokens": getattr(usage, "cache_creation_input_tokens", 0),
        "cache_read_tokens": getattr(usage, "cache_read_input_tokens", 0),
    }

# First call — should write cache
result1 = query_with_caching("How do I connect to Snowflake?")
print("First call (cache write):")
print(f"  Cache creation tokens: {result1['cache_creation_tokens']}")
print(f"  Cache read tokens: {result1['cache_read_tokens']}")

Expected: cache_creation_tokens > 0, cache_read_tokens == 0

Step 3 — Second request (cache hit, within 5 minutes)

import time
time.sleep(2)  # Brief pause but stay within 5-minute TTL

result2 = query_with_caching("What databases does DataFlow support?")
print("\nSecond call (cache read):")
print(f"  Cache creation tokens: {result2['cache_creation_tokens']}")
print(f"  Cache read tokens: {result2['cache_read_tokens']}")

Expected: cache_read_tokens > 0, cache_creation_tokens == 0

Step 4 — Compute cost savings

SONNET_INPUT_COST_PER_TOKEN = 3.0 / 1_000_000      # $3 per million
SONNET_CACHE_READ_COST_PER_TOKEN = 0.3 / 1_000_000  # $0.30 per million (90% cheaper)

doc_tokens = result1["cache_creation_tokens"]

cost_uncached = doc_tokens * SONNET_INPUT_COST_PER_TOKEN
cost_cached = doc_tokens * SONNET_CACHE_READ_COST_PER_TOKEN
savings_pct = (1 - cost_cached / cost_uncached) * 100

print(f"\nDocument size: {doc_tokens:,} tokens")
print(f"Cost per query uncached: ${cost_uncached:.4f}")
print(f"Cost per query cached:   ${cost_cached:.4f}")
print(f"Savings per query:       {savings_pct:.0f}%")
print(f"Daily savings (500 queries): ${(cost_uncached - cost_cached) * 500:.2f}")

Lab checkpoint: You should see ~90% cost reduction on the document portion of each query.

Part B — Minimal RAG Pipeline

Goal: Build a basic RAG pipeline and understand concretely when it beats in-context and when it doesn’t.

Step 1 — Chunking and embedding

import json

def chunk_document(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by word count."""
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        i += chunk_size - overlap
    return chunks

def embed_chunks(chunks: list[str]) -> list[dict]:
    """
    In production: use an embedding model (OpenAI, Cohere, Voyage AI, etc.)
    For this lab: use a simple keyword-overlap retrieval as a stand-in.
    """
    return [{"text": chunk, "index": i} for i, chunk in enumerate(chunks)]

chunks = chunk_document(REFERENCE_DOCUMENT)
embedded = embed_chunks(chunks)
print(f"Document split into {len(chunks)} chunks")

Step 2 — Retrieval (keyword-based for the lab)

def retrieve(query: str, chunks: list[dict], top_k: int = 3) -> list[str]:
    """Simple keyword overlap retrieval for lab purposes."""
    query_words = set(query.lower().split())
    scored = []
    for chunk in chunks:
        chunk_words = set(chunk["text"].lower().split())
        overlap = len(query_words & chunk_words)
        scored.append((overlap, chunk["text"]))
    scored.sort(reverse=True)
    return [text for _, text in scored[:top_k]]

def query_with_rag(user_question: str) -> str:
    retrieved = retrieve(user_question, embedded)
    context = "\n\n".join(f"<passage>\n{p}\n</passage>" for p in retrieved)

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system="Answer questions using only the passages provided. If the answer isn't there, say so.",
        messages=[{
            "role": "user",
            "content": f"{context}\n\nQuestion: {user_question}"
        }],
    )
    return response.content[0].text

Step 3 — Compare in-context vs. RAG

test_question = "What are the system requirements for DataFlow?"

print("=== IN-CONTEXT ANSWER ===")
in_ctx_result = query_with_caching(test_question)
print(in_ctx_result["answer"])

print("\n=== RAG ANSWER ===")
rag_answer = query_with_rag(test_question)
print(rag_answer)

Observe: For questions that require synthesizing across multiple sections, in-context typically produces more complete answers. For a large corpus where only one section is relevant, RAG reduces cost by sending only the relevant chunk.

Part C — Conversation History Management

def chat_with_sliding_window(
    conversation: list[dict],
    new_message: str,
    max_pairs: int = 5,
) -> str:
    """Add new message, trim to last max_pairs turns, get response."""
    conversation.append({"role": "user", "content": new_message})
    trimmed = conversation[-(max_pairs * 2):]

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=trimmed,
    )
    reply = response.content[0].text
    conversation.append({"role": "assistant", "content": reply})
    return reply

# Test: after 12 turns, only the last 5 pairs are sent
history = []
for i in range(12):
    user_msg = f"Turn {i+1}: Tell me something interesting about data engineering."
    reply = chat_with_sliding_window(history, user_msg)
    print(f"Turn {i+1}: {reply[:80]}...")

print(f"\nTotal turns in history object: {len(history)//2}")
print(f"Max turns sent to API: 5")

Lab checkpoint: Confirm that after 12 turns the history object grows but only the last 5 pairs are sent to the API.

Lab Completion Checklist

Cache hit confirmed via cache_read_input_tokens > 0 on the second request
Cost savings computed — approximately 90% on the document portion
RAG pipeline retrieves relevant passages and answers the test question
You can articulate one scenario where RAG would outperform in-context for this document
Sliding window limits API input to max_pairs * 2 messages regardless of history length

Once complete, proceed to Domain 3 Practice Questions.

Domain 3 — Context and Caching Lab

📋 Prerequisites

🎯 What You'll Learn

Lab Overview

Part A — Prompt Caching Implementation and Verification

Step 1 — Create a test document

Step 2 — First request (cache write)

Step 3 — Second request (cache hit, within 5 minutes)

Step 4 — Compute cost savings

Part B — Minimal RAG Pipeline

Step 1 — Chunking and embedding

Step 2 — Retrieval (keyword-based for the lab)

Step 3 — Compare in-context vs. RAG

Part C — Conversation History Management

Lab Completion Checklist

Related Tutorials

Domain 3 — Context, Memory, and Caching

Domain 3 — Context and Caching Practice Questions

Domain 2 — Prompt Engineering Hands-On Lab

Domain 4 — Agent Pipeline Lab

Domain 3 — Context and Caching Lab

📋 Prerequisites

🎯 What You'll Learn

Lab Overview

Part A — Prompt Caching Implementation and Verification

Step 1 — Create a test document

Step 2 — First request (cache write)

Step 3 — Second request (cache hit, within 5 minutes)

Step 4 — Compute cost savings

Part B — Minimal RAG Pipeline

Step 1 — Chunking and embedding

Step 2 — Retrieval (keyword-based for the lab)

Step 3 — Compare in-context vs. RAG

Part C — Conversation History Management

Lab Completion Checklist

Related Tutorials

Domain 3 — Context, Memory, and Caching

Domain 3 — Context and Caching Practice Questions

Domain 2 — Prompt Engineering Hands-On Lab

Domain 4 — Agent Pipeline Lab

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies