Skip to main content

Rate limiting

The gateway supports rate limiting in three layers.

HTTP middleware (per IP)โ€‹

Enable per IP limits with environment variables:

export RATE_LIMIT_RPS=20
export RATE_LIMIT_BURST=40

Requests beyond the limit receive a 429 with an OpenAI style error response.

Plugin rate limiting (per request)โ€‹

Add the rate-limit plugin to apply global token-bucket limits before traffic hits a provider:

plugins:
- name: rate-limit
type: ratelimit
stage: before_request
enabled: true
config:
requests_per_second: 50
burst: 100

Per-key and per-user rate limitingโ€‹

Extend the rate-limit plugin with key_rpm and user_rpm to enforce per-identity limits in addition to (or instead of) the global bucket:

plugins:
- name: rate-limit
type: ratelimit
stage: before_request
enabled: true
config:
requests_per_second: 100 # global limit (optional)
burst: 200
key_rpm: 60 # max 60 req/min per API key
user_rpm: 30 # max 30 req/min per user ID

Rate checks execute in order: global โ†’ per-key โ†’ per-user. The request is rejected at the first exceeded limiter with a distinct reason string so you can distinguish which limit was hit in your logs.

Config keyGranularitySource field
requests_per_second + burstGlobal (all traffic)โ€”
key_rpmPer API keypctx.Metadata["api_key"]
user_rpmPer user IDRequest.User

Requests without an API key skip the per-key check. Requests without a user field skip the per-user check. All three limits are independent โ€” configure any combination.