Skip to main content

Frequently Asked Questions

Setup & Compatibility

Do I need to change my SDK or application code?

No. The gateway uses the OpenAI wire format. Change only base_url in your client:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="any")

The api_key field is required by the OpenAI SDK but the gateway ignores it — provider credentials are set via environment variables on the server.

Which model names should I use?

Use the model IDs native to each provider — claude-3-5-sonnet-20241022 for Anthropic, gemini-1.5-pro for Gemini, gpt-4o for OpenAI, etc. You can also define model aliases like fast or smart in your config to decouple your application from specific model names.

Can I run it locally?

Yes. Use Docker or build from source:

# Docker
docker run -p 8080:8080 -e OPENAI_API_KEY=sk-... ghcr.io/ferro-labs/ai-gateway:latest

# From source
git clone https://github.com/ferro-labs/ai-gateway && cd ai-gateway && make run
Does it support streaming?

Yes. Set stream: true in your request body. The gateway streams the response from the provider to your client using Server-Sent Events (SSE), identical to the OpenAI streaming format. This works across all supported providers.

What Go version is required?

Go 1.21 or later. The gateway binary is statically compiled and has no runtime dependencies. Pre-built binaries and a Docker image are available on the GitHub releases page.

Providers

How do I enable a provider?

Set the provider's environment variable before starting the gateway. For example:

export ANTHROPIC_API_KEY=sk-ant-...
export GROQ_API_KEY=gsk_...

No config file changes are needed to enable providers — setting the environment variable is sufficient. See Provider configuration for all variables.

Can I use multiple providers at the same time?

Yes. Set credentials for as many providers as you want. The gateway discovers enabled providers at startup based on which environment variables are set. Use routing strategies to control which provider gets each request — combine multiple providers in a single route for fallback, load balancing, or cost optimization.

Can I use Ollama or other self-hosted models?

Yes. Set OLLAMA_HOST=http://localhost:11434 and OLLAMA_MODELS=llama3.2,mistral. Ollama requires no API key. The gateway supports any provider that exposes an OpenAI-compatible HTTP API via the openai-compatible provider type — set CUSTOM_PROVIDER_BASE_URL and optionally CUSTOM_PROVIDER_API_KEY.

Can I force a specific provider for one request?

Yes. Add the X-Provider header to your request:

curl http://localhost:8080/v1/chat/completions \
-H "X-Provider: anthropic" \
-d '{"model": "claude-3-5-sonnet-20241022", "messages": [...]}'

This overrides the routing strategy for that single request.

Routing

What happens if a provider is down?

With the fallback strategy, the gateway automatically retries the next configured target with exponential backoff. With single, the error is returned to the client immediately. Circuit breakers automatically exclude failing providers across all strategies once the error threshold is exceeded — they self-recover after the configured timeout.

How does cost-optimized routing work?

The gateway ships with a catalog of 2,500+ models with input/output cost data per million tokens. For each request, it estimates the cost for each configured target based on the model name and routes to the cheapest compatible provider. No external API calls are made for cost data — it ships embedded in the binary.

What is the difference between least-latency and fallback?

fallback is reactive — it switches providers only after a failure or timeout. least-latency is proactive — it continuously measures P50 latency from successful requests and always prefers the fastest available provider, even when all providers are healthy. Use least-latency when minimizing response time is more important than cost.

Can I route different request types to different providers?

Yes. The conditional strategy allows routing based on request attributes — model name, request headers, or custom metadata. For example, you can route requests with X-Tier: premium to GPT-4o and all others to a cheaper model. See Routing policies for full examples.

Plugins & Safety

Do plugins affect latency?

Guardrail plugins (before_request) add minimal overhead — typically under 1ms for pattern-matching plugins (word-filter, regex-guard). pii-redact and prompt-shield may add 1–5ms depending on content size. response-cache can dramatically reduce latency on cache hits by returning responses without hitting any provider.

Can I write custom plugins?

Yes. Implement the plugin.Plugin interface in Go and register it with the plugin manager. The interface requires Name(), Type(), Init(), and Execute() methods. See the examples directory for a working example.

Will plugins block my requests in production?

Yes, if a guardrail with action: block is triggered, the request is rejected immediately with a 400 or 403 response. You can also configure action: warn to log without blocking. Test your guardrail configuration in a staging environment and review logs at /admin/logs before enabling in production.

MCP (Model Context Protocol)

What is MCP?

Model Context Protocol (MCP) is an open standard for connecting AI models to external tools and data sources. The gateway implements MCP as a client — it connects to your MCP tool servers, injects available tools into chat completion requests, and runs the agentic tool-calling loop automatically. See the MCP guide.

Do my clients need to implement the tool loop?

No. The gateway handles the full agentic loop internally. Your client sends a standard chat completion request and receives a final text response. All intermediate tool calls happen transparently inside the gateway. This means any OpenAI-compatible client automatically gains tool-use capability without code changes.

Which MCP server implementations are supported?

The gateway supports MCP servers using the 2025-11-25 Streamable HTTP transport. Popular compatible servers include the MCP filesystem server, PostgreSQL server, and Fetch server. See MCP integration for full setup instructions.

Operations

How do I know which providers are active?

Call GET /health for per-provider health status, latency, and model counts. Call GET /v1/models for all available models grouped by provider. Both endpoints are unauthenticated and suitable for readiness probes.

Is there an admin API?

Yes. The gateway exposes an admin API at /admin/* for managing API keys, querying request logs, viewing provider status, and hot-reloading config. Protect it with ADMIN_API_KEY. See the Admin auth guide and interactive API reference.

Can I use PostgreSQL instead of SQLite?

Yes. For request logs, set backend: postgres and dsn: postgres://... in the request-logger plugin config. For the admin API key store, set STORE_BACKEND=postgres and STORE_DSN=postgres://.... See Server settings for all environment variables.

Is it production-ready?

Yes. The gateway is used in production with circuit breakers, retries with exponential backoff, Prometheus monitoring, structured request logging, and an admin API. See Monitoring for recommended Prometheus alert rules and a Grafana dashboard layout.