Inference

Inference is running a trained LLM on inputs to produce outputs (for example generating tokens from a prompt), without updating the LLM's weights.

Details

In typical text generation, the input text is converted to tokens by a tokenizer, the model produces a distribution over possible next tokens, and generated tokens are decoded back into text. Sampling parameters control how tokens are selected from this distribution at generation time. APIs can optionally return logprobs for generated tokens, exposing the model's probability assignments for downstream use such as confidence estimation and classification.

In standard generation, compute per token is roughly fixed - the model does one forward pass per output token. Speculative decoding improves on this by having a fast draft model propose multiple tokens that the target model verifies in a single forward pass, increasing effective throughput without changing the output distribution.

Test-time compute (also called test-time scaling) is a strategy where the model spends additional inference-time compute to improve output quality, typically by generating intermediate reasoning steps (chain-of-thought) before producing the final answer. Models trained for this approach (often called reasoning models) produce extended reasoning traces that can be substantially longer than the visible response, making inference cost and latency variable and dependent on problem difficulty. This creates a direct tradeoff between compute budget and output quality at inference time, managed in practice through model routing, reasoning effort parameters, or token budgets that cap how much thinking a model can do per request.

In production, inference is commonly provided via an API by an inference provider or run on self-hosted infrastructure using open-weight models; it is distinct from training, which changes the model to alter behavior (for example via fine-tuning). Inference can be synchronous (real-time, low-latency responses) or asynchronous via batch inference, where requests are submitted in bulk and processed on a flexible schedule at reduced cost.

Examples

  • Calling a chat completion API to generate an answer from a prompt
  • Running a local LLM to generate text for an agent

Synonyms

serving, model serving (these terms sometimes refer more broadly to the infrastructure of hosting and operating models, not just the forward pass)