Inference Cost

Inference cost is the monetary cost of running LLM requests, determined primarily by token volume, model pricing tier, and whether tokens are input, output, or reasoning tokens. Because every token of inference costs money, cost pressure shapes decisions from model selection to user-facing access controls, favoring architectures that minimize unnecessary inference.

Details

Inference providers typically charge per token, with separate rates for input tokens (the prompt), output tokens (the generated response), and - for reasoning models - reasoning tokens (intermediate thinking steps that may not appear in the final output). Output tokens are generally more expensive than input tokens, and reasoning tokens can dominate total cost on complex tasks because reasoning traces are often much longer than the visible response. Cached input tokens (via prompt caching) are charged at a reduced rate, creating a direct incentive to structure prompts for cache reuse.

Cost scales with model capability: frontier models are significantly more expensive per token than small language models or distilled variants. This creates a cost-quality-latency tradeoff triangle that shapes most production architecture decisions. For a given quality target, the goal is to minimize cost and latency; for a given budget, the goal is to maximize quality. Batch inference offers a further discount (often 50% or more) for workloads that tolerate asynchronous processing, making it a significant cost lever for offline tasks like evals and synthetic data generation. In agent systems, these costs compound across multi-step loops where each reasoning step, tool call interpretation, and result evaluation requires additional token generation.

The most direct architectural response to this cost pressure is model routing: directing requests to the cheapest model that can handle them. Simple tasks go to small language models or distilled variants; complex tasks escalate to frontier models. The large cost gap between tiers means routing even a fraction of traffic to cheaper models produces significant savings. The high relative cost of reasoning tokens creates additional pressure to avoid reasoning models for tasks that do not require multi-step deduction, to cap reasoning effort via model parameters where available, or to use distillation to bake reasoning patterns into cheaper non-reasoning models.

Token volume reduction is another recurring pattern. Prompt compaction condenses context to use fewer input tokens. Prompt caching reduces the effective cost of repeated prefixes. Structuring prompts so that stable content (system instructions, tool definitions) sits at the front maximizes cache reuse across requests.

Semantic caching goes further by skipping inference entirely for queries that are semantically similar to previously cached responses, eliminating token costs altogether for cache hits. Speculative decoding does not reduce token cost directly but improves throughput per GPU, which lowers the serving cost per token for inference providers and can translate into lower pricing or higher capacity on the same hardware.

Offloading work from inference to tools is a cost optimization with structural consequences. CPU-based computation (arithmetic, text search, data retrieval, code execution) is orders of magnitude cheaper per unit of work than GPU-based inference. An agent that delegates arithmetic to a calculator or data processing to a code execution runtime spends a small number of tokens orchestrating the tool call instead of many tokens reasoning through the answer.

On the operational side, cost pressure drives user-facing controls: per-user or per-tenant budgets, usage tiers, rate limiting, and throttling. These are typically enforced through AI gateways or application-level logic. Without these controls, denial of service attacks (including "denial of wallet" variants) can inflict significant financial damage.

Examples

A reasoning model spending 10x more tokens on its thinking trace than on the visible response, making reasoning tokens the dominant cost factor.
Cached input tokens costing 90% less than uncached tokens, incentivizing stable prompt prefixes.
A frontier model costing 30-60x more per token than a small model, making model routing a primary cost optimization lever.
A coding agent using a fast, cheap model for file reads and reformatting, and a reasoning model only for architectural decisions.
Capping reasoning effort to "low" for classification tasks where chain-of-thought adds cost without improving accuracy.
An agent using a calculator tool for arithmetic instead of spending tokens on multi-step multiplication via inference.
An agent loop accumulating costs across dozens of inference calls per user request, making per-request cost monitoring essential.
A SaaS product offering tiered access where free users get a small model and paying users get a frontier model.
Structuring all prompts with a stable system prefix to maximize prompt caching hit rates across requests.

Synonyms

token cost, LLM cost, API cost, token economics