Concepts

OpenAI-compatible API

The gateway speaks the OpenAI wire format for chat completions, embeddings, images, and model listing. Any client that works with OpenAI will work with the gateway after changing only the base_url. Provider credentials, model routing, and policy enforcement happen inside the gateway — your application code is unaffected.

Routing strategies

The strategy controls which provider target receives each request. Configure it with the strategy.mode key in your config file.

Strategy	Description
`single`	Always route to the first target. Simplest setup.
`fallback`	Try targets in order; retry the next on failure with exponential backoff.
`loadbalance`	Distribute requests across targets by weight.
`conditional`	Evaluate rules (model name, model prefix) to pick a target per request.
`least-latency`	Route to the target with the lowest P50 latency, using a rolling tracker.
`cost-optimized`	Estimate input cost from the model catalog and route to the cheapest compatible target.
`content-based`	Route based on prompt content using substring match or regex. First rule match wins.
`ab-test`	Split traffic across labeled variants by weight for comparison testing.

See Routing policies for YAML examples of each strategy.

Providers and targets

A provider is a registered AI API backend (e.g., OpenAI, Anthropic). The gateway supports 29 providers. Providers are enabled by setting environment variables — no code needed.

A target is a reference to a provider in your config. Targets carry optional fields: weight (for load balancing), retry (attempts + status codes), and circuit_breaker (thresholds + timeout).

Model aliases

Aliases map short names to full model IDs. They are resolved before routing, so cheap can map to gemini-1.5-flash and every request to model: cheap is transparently sent to Gemini.

Plugins

Plugins run at three lifecycle stages: before_request, after_request, and on_error.

OSS plugins

These 6 plugins ship with the open-source gateway:

Plugin	Stage	Purpose
`word-filter`	before_request	Block requests containing banned words
`max-token`	before_request	Enforce token and message count limits
`response-cache`	before_request	Cache exact-match responses in memory
`request-logger`	before_request	Emit structured logs, optionally persist to SQLite/Postgres
`rate-limit`	before_request	Token-bucket rate limiting (global, per-key, per-user)
`budget`	before_request + after_request	Track and enforce per-key spend limits

Ferro Labs Managed plugins

These 5 plugins require Ferro Labs Managed because they depend on ML inference services that run server-side:

Plugin	Stage	Purpose
`pii-redact`	before_request	Detect and redact PII entities using NER models
`secret-scan`	before_request	Block requests containing leaked API keys or credentials
`prompt-shield`	before_request	Score and block prompt injection attempts
`schema-guard`	after_request	Validate model JSON output against a JSON Schema
`regex-guard`	before_request	Block requests matching configurable regex patterns

Ferro Labs Managed feature

The 5 enterprise plugins require a Ferro Labs Managed account. Join the waitlist →

See Plugins for full configuration examples.

MCP integration

Model Context Protocol (MCP) lets you connect external tool servers to the gateway. When mcp_servers are configured, the gateway injects available tools into every chat completion request and runs an agentic loop when the model returns tool_calls. This works transparently — clients receive the final text response without needing to implement the tool loop themselves.

MCP Phase 1 (tool injection + agentic loop) shipped in v0.8.0. Streaming support for MCP requests was added in v1.0.0.

See MCP integration for setup and examples.

Ferro Labs Managed

Ferro Labs Managed is the managed, multi-tenant version of the AI Gateway hosted by Ferro Labs. It wraps the same OSS engine with per-tenant isolation, a management dashboard, durable billing, semantic caching, SSO/SAML, audit logs, and the 5 enterprise security plugins listed above. See OSS vs Ferro Labs Managed for a full comparison.

Observability

Prometheus metrics — scraped at /metrics. Includes request counts, latency histograms, token usage, and cache hit rates.
Structured JSON logs — emitted to stdout with a per-request trace_id for log correlation.
Health endpoint — GET /health returns per-provider availability with latency measurements.

See Observability and Monitoring for details.

OpenAI-compatible API​

Routing strategies​

Providers and targets​

Model aliases​

Plugins​

OSS plugins​

Ferro Labs Managed plugins​

MCP integration​

Ferro Labs Managed​

Observability​