Rate Limiting

Rate limiting is a control mechanism that restricts the frequency or volume of requests to an AI system within a defined time window, preventing abuse, controlling costs, and ensuring fair resource distribution across consumers.

Details

In AI engineering, rate limits are commonly expressed in both requests-per-minute and tokens-per-minute, reflecting the token-based cost structure of LLM inference. A single large request can consume far more resources than many small ones, so token-based rate limiting provides a more accurate proxy for actual resource consumption than request counting alone.

Rate limiting is enforced at multiple layers. Inference providers impose per-API-key or per-organization limits on their model endpoints. AI gateways add a centralized enforcement point where operators can apply rate limits across multiple providers, with model-aware policies that account for token volume and model pricing tier. Application-level and agent runtime rate limiting adds finer-grained controls such as per-user or per-feature limits.

Agentic workloads challenge traditional rate limiting designs. Agent traffic is bursty, high-concurrency, and machine-speed, with request patterns that resemble automated attacks rather than human browsing. Rate limiting built around human signals (session cookies, CAPTCHAs) or fixed integration patterns becomes ineffective against agent traffic. Agent-aware rate policies - using API keys, OAuth scopes, or agent identity metadata - are a common adaptation.

Rate limiting is distinct from quota management, which enforces resource budgets over longer periods (daily, monthly, billing cycle). Rate limiting controls throughput within short time windows; quotas control cumulative consumption. Both are often enforced together - a user may have a tokens-per-minute rate limit and a tokens-per-month quota.

Examples

  • An inference provider enforcing 100,000 tokens-per-minute per API key, returning HTTP 429 responses when the limit is exceeded.
  • An AI gateway applying different rate limits per model tier, with higher limits for cheaper models and stricter limits for frontier models.
  • An eval runner implementing rate-limit backoff to avoid exceeding provider limits during large evaluation runs.
  • A web service blocking legitimate agent access because agent request patterns trigger abuse detection designed for human traffic.

Synonyms

throttling, request throttling