Batch Inference
Batch inference is an inference mode where multiple requests are submitted together as a group and processed asynchronously, trading latency for lower inference cost.
Details
Unlike real-time inference where each request is processed individually with low-latency expectations, batch inference allows inference providers to schedule work flexibly - filling idle GPU capacity, optimizing memory usage, and processing requests during off-peak periods. This scheduling flexibility is the primary reason providers offer batch pricing at a significant discount (often 50% or more) compared to synchronous API calls.
Batch inference is suited to workloads where results are not needed immediately: running evals across large datasets, generating synthetic data, bulk classification or extraction tasks, pre-computing embeddings, and offline analysis. Turnaround times are typically measured in hours rather than milliseconds, with providers committing to completion within a defined window (e.g. 24 hours). Requests within a batch are independent - each has its own prompt and parameters - but they are submitted and tracked as a single job.
Most major inference providers expose batch APIs (OpenAI Batch API, Anthropic Message Batches, Google Gemini batch endpoints) that accept a file or list of requests and return results asynchronously. These APIs support the same model capabilities as their real-time counterparts (structured output, tool definitions, vision inputs) but do not support streaming.
Examples
- Running an eval suite across thousands of test cases overnight at half the per-token cost.
- Generating synthetic training data by processing a large prompt set in a single batch job.
- Pre-computing embeddings for a document corpus before indexing into a vector database.
- Bulk-classifying support tickets using an LLM where results are needed within hours, not seconds.
Synonyms
batch API, batch processing, offline inference, asynchronous inference