Inference Provider

An inference provider is an organization that runs LLMs to generate outputs (inference) and exposes them via an API or hosted service. It may serve its own LLMs or third-party LLMs.

Details

Inference providers sit between model developers (who create model weights) and application developers (who integrate models into end-user products). Many inference providers host popular open-weight models alongside proprietary ones, offering application developers the flexibility to switch between provider-hosted and self-hosted deployments of the same model.

They typically handle API access, scaling, rate limiting, billing, and latency optimization. Provider APIs commonly expose sampling parameters, structured output constraints, streaming, prompt caching, and batch inference endpoints for asynchronous bulk processing at reduced cost. Pricing is typically per-token with different rates for input and output, making provider choice a key factor in inference cost.

Note: many organizations are both model developers and inference providers; these are roles, not mutually exclusive categories.

Examples

  • OpenAI - serves proprietary models (GPT series, o-series); pioneered the chat completions API format widely adopted by other providers.
  • Anthropic - serves proprietary Claude models; emphasizes safety-oriented design.
  • Google (Gemini API) - serves Gemini models; deep integration with Google Cloud.
  • AWS Bedrock - multi-model gateway offering models from Anthropic, Meta, Mistral, and others under a unified AWS API with IAM-based access control.
  • Together AI - specializes in hosting open-weight models with performance-optimized inference.
  • Fireworks AI - specializes in low-latency inference for open-weight models; offers function calling and structured output optimizations.