Prompt Caching

Prompt caching is a provider-side optimization that reuses computation from shared prompt prefixes, reducing latency and inference cost for subsequent requests that start with the same token sequence. Unlike semantic caching, which returns stored responses for semantically similar queries without invoking the model, prompt caching still runs inference but skips redundant processing of the shared prefix.

Details

During inference, the model first processes all input tokens (the prefill phase), building up internal key-value states (KV-cache) for each layer. Prompt caching stores this KV-cache keyed by the exact token prefix. When a later request starts with the same prefix, the inference provider loads the cached state instead of recomputing it, producing faster time to first token and lower cost. Cache hits typically require an exact token-level prefix match - even small changes early in the prompt invalidate the cache - though some providers apply more flexible matching strategies.

This prefix-matching behavior has practical implications for context engineering: placing stable content (system prompt, tool definitions, static instructions) at the beginning of the prompt and variable content (user messages, retrieved documents) at the end maximizes cache reuse. Some providers apply prompt caching automatically; others require explicit opt-in or cache-control hints in the API request.

Examples

  • A system prompt and tool definitions cached across many user requests, so only the per-user message triggers new prefill computation.
  • A multi-turn conversation where the growing message history shares a common prefix with the previous turn, allowing incremental prefill.
  • A prompt restructured to move a frequently changing retrieval block from the middle to the end, improving cache hit rates.

Synonyms

context caching, KV-cache reuse, prefix caching