· AI Engineering · 7 min read
Instructions
Attempt each question before reading the answer. Target: 8/10 or better.
Q1. A legal tech platform processes the same 80-page contract template (~32K tokens) for 500 client queries per day. The system prompt is 8K tokens. What is the most cost-efficient architecture?
A) Use RAG — retrieve only the relevant contract sections per query
B) Summarize the contract once and pass the summary in context
C) Pass the full contract in context with prompt caching on the system prompt + contract
D) Use Claude Opus — it handles long documents more accurately
Answer and Explanation
Answer: C
32K tokens fits comfortably within the 200K context window. The contract is reused across all 500 daily queries — an ideal caching scenario. The system prompt (8K) and contract (32K) together total 40K tokens, well above the 1,024-token minimum. After the first request each day, queries 2–500 hit the cache at ~10% of normal input cost. RAG (A) adds unnecessary complexity and retrieval latency. Summarization (B) loses precision. Model selection (D) is irrelevant to cost efficiency here.
Q2. A developer adds cache_control: {"type": "ephemeral"} to a system prompt block that is 800 tokens. On the second request (within 5 minutes), the cache_read_input_tokens field in the response is 0. What is the most likely cause?
A) The cache TTL expired before the second request
B) The system prompt block is below the 1,024-token minimum for caching on Sonnet
C) Prompt caching requires enabling a feature flag in the API settings
D) The second request used a different model
Answer and Explanation
Answer: B
The minimum cacheable block for Sonnet and Opus is 1,024 tokens. An 800-token system prompt does not meet the threshold, so no cache is written and no cache read occurs. The fix is to expand the system prompt or combine it with a reference document to exceed 1,024 tokens. The 5-minute TTL (A) is not the issue since the second request came within 5 minutes. There is no feature flag (C). Using the same model for both requests means D is not the issue.
Q3. What is the cache TTL for prompt caching, and what resets it?
A) 1 hour; reset by any API call to the account
B) 5 minutes; reset by each cache hit
C) 24 hours; reset at midnight UTC
D) 5 minutes; it does not reset — you must re-create the cache
Answer and Explanation
Answer: B
Cache TTL is 5 minutes. Each cache hit resets the 5-minute timer, so a cache that is actively used stays warm indefinitely. A cache that receives no hits for 5 minutes expires and must be recreated on the next request (which becomes a cache write, not a cache read).
Q4. A customer asks about a 10-page internal policy document during a support session. The session involves 8 follow-up questions about the same document. What is the optimal strategy?
A) Use RAG — retrieve the relevant policy section for each of the 8 questions
B) Pass the full document in context on the first question; enable prompt caching
C) Summarize the document and pass the summary for all 8 questions
D) Pass the full document on every request — do not use caching
Answer and Explanation
Answer: B
A 10-page document (~4K tokens) fits easily in context. Reusing it across 8 questions within one session is the ideal caching use case — 7 of the 8 requests hit the cache at ~10% input token cost. RAG (A) adds latency and retrieval complexity for a document this size. Summarization (C) reduces accuracy. Passing without caching (D) pays full input cost 8 times.
Q5. When is RAG the better choice over passing a full document in context?
A) When the document is less than 50 pages
B) When the knowledge base is too large to fit in the context window
C) When using Claude Haiku instead of Sonnet
D) When the user is asking a specific factual question
Answer and Explanation
Answer: B
RAG is the correct choice when the knowledge base exceeds the context window (200K tokens) — you cannot pass it all in-context regardless of caching. RAG is also appropriate for large corpora accessed by many independent users where caching provides no benefit. Document size alone (A) is not the deciding factor — it is whether the document fits in context and whether caching makes in-context economical. Model tier (C) does not determine the RAG vs. in-context decision. Factual questions (D) can be answered well in both architectures.
Q6. A conversational agent has been running for 50 turns. The context is approaching its limit. The session contains important decisions made in the first 5 turns. What is the correct history management strategy?
A) Sliding window — keep the last 20 turns
B) Truncate all history — start fresh
C) Periodic summarization — condense old turns into a summary that preserves key decisions
D) Increase max_tokens to allow more history
Answer and Explanation
Answer: C
When important context exists early in a long conversation, periodic summarization preserves it while freeing token budget. Sliding window (A) would drop the critical first-5-turn decisions. Truncating (B) loses all context. max_tokens (D) controls output length, not context input — it is irrelevant to history management.
Q7. Where should the cache breakpoint be placed in a prompt?
A) At the beginning of the system prompt
B) At the end of the dynamic user message
C) At the end of the stable prefix — after all static content, before any dynamic content
D) Immediately before the most recent user message
Answer and Explanation
Answer: C
The cache captures everything from the start of the prompt up to (and including) the cache breakpoint. Placing it at the end of the stable prefix (system prompt + reference docs + few-shot examples) means all of that static content is cached. Dynamic content (the user’s current question) must come after the breakpoint so it is not cached — it changes on every request.
Q8. A developer implements sliding window history management with max_pairs=10. After 25 conversation turns, how many tokens of history are sent to the API on turn 26?
A) All 25 turns of history
B) Exactly 10 user messages and 10 assistant messages (20 messages total)
C) The last 5 turns only
D) It depends on the token length of each message
Answer and Explanation
Answer: B
The sliding window with max_pairs=10 keeps the last 10 user/assistant pairs — 20 messages total. It trims by message count, not token count (though in production you would add a token budget guard as well). The actual token cost depends on message length (D is partially correct) but the message count is 20.
Q9. A team is debating whether to use RAG or in-context for a task that requires comparing and finding contradictions across three different documents (each ~15K tokens, total ~45K tokens). Which architecture is correct?
A) RAG — retrieve the most relevant chunks from each document
B) In-context — pass all three documents in full, enable prompt caching
C) Summarize each document first, then compare the summaries
D) Use Opus — it can handle contradictions better than Sonnet
Answer and Explanation
Answer: B
Cross-document contradiction analysis requires seeing all three documents simultaneously. RAG (A) would chunk each document independently, destroying the cross-document context needed to find contradictions. 45K total tokens fits comfortably within 200K. Prompt caching is appropriate if this same document set is queried multiple times. Summarization (C) compresses away the very detail needed to find contradictions. Model selection (D) is independent of the architecture choice here.
Q10. Which of the following is a valid use of the cache_creation_input_tokens field in the API response?
A) It tells you how many tokens were read from cache on this request
B) It tells you how many tokens were written to cache on this request
C) It tells you the total cost of the request
D) It tells you whether the model supports caching
Answer and Explanation
Answer: B
cache_creation_input_tokens is the count of tokens written to cache on the current request (cache miss — first time this prefix is seen). cache_read_input_tokens is what you check to confirm a cache hit. Neither field gives you the direct cost (C) — you calculate cost from token counts and pricing. D is not a valid use of this field.
Score Interpretation
| Score | Readiness |
|---|---|
| 9–10 / 10 | Domain 3 ready — move to Domain 4 |
| 7–8 / 10 | Review caching mechanics and the RAG vs. in-context decision tree |
| < 7 / 10 | Complete the Domain 3 lab — measure real cache hit rates before retesting |