Latency

Latency in LLM systems refers to the time-based performance dimensions of inference: how quickly a model begins responding and how fast it generates output.

Details

The two primary metrics are time to first token (TTFT) - the delay between sending a request and receiving the first generated token - and inter-token latency (generation speed), often expressed as tokens per second. TTFT is dominated by the prefill phase, where the model processes all input tokens; it scales with context size and model size. Generation speed depends on the per-token decode step and is relatively constant per model configuration. End-to-end latency combines both phases and depends additionally on output length.

Key factors that affect latency include model size and architecture, input length, hardware (GPU type, memory bandwidth), and model quantization (lower precision reduces compute per token).

Optimization techniques that reduce latency include prompt caching (skipping redundant prefill computation), semantic caching (returning cached responses for semantically similar queries, bypassing inference entirely), prompt compaction (reducing input token count), speculative decoding (using a fast draft model to propose tokens verified in parallel by the target model, increasing effective tokens per forward pass), and provider-side batching and load. Streaming does not reduce actual generation time but lowers perceived latency by delivering tokens as they are produced. Latency, inference cost, and output quality form the primary tradeoff space for model routing and inference provider choice; observability systems track these metrics in production to detect regressions and inform optimization.

Examples

  • A chatbot optimized for low TTFT so users see a response begin within a few hundred milliseconds.
  • A coding agent where end-to-end latency matters more than TTFT because the full response is needed before the next tool call can execute.
  • Switching to a small language model or quantized model to reduce latency at an acceptable quality cost.

Synonyms

inference latency, response time, time to first token, TTFT