Routing policies
Most AI gateways offer 2โ3 routing modes. Ferro Labs ships 8 โ covering everything from simple single-provider setups to content-aware routing and live A/B testing. Set strategy.mode in config.yaml to choose one.
Singleโ
Always routes to the first target. Best for single-provider setups or when you want explicit control.
Use this when: you have one provider and want the simplest possible config.
strategy:
mode: single
targets:
- virtual_key: openai
Single is the lightest strategy โ zero overhead beyond the proxy hop. Ideal for latency-sensitive single-provider deployments.
Fallbackโ
Tries targets in order. On failure (error or retryable status code), the next target is attempted with exponential backoff. Use this for high-availability setups.
Use this when: uptime matters more than anything โ your chatbot must always respond, even if the primary provider is down.
strategy:
mode: fallback
targets:
- virtual_key: openai
retry:
attempts: 3
retry_on_status: [429, 502, 503, 504]
- virtual_key: anthropic
retry:
attempts: 2
- virtual_key: gemini
If all targets fail, the last error is returned to the client.
Combine fallback with circuit breakers to skip providers that are consistently failing, rather than waiting for retries to timeout on every request.
Weighted load balancingโ
Distributes requests across targets by weight. Weights are relative โ a weight of 70 and 30 sends 70% to the first target and 30% to the second.
Use this when: you want to spread load across providers for cost or capacity reasons.
strategy:
mode: loadbalance
targets:
- virtual_key: openai
weight: 70
- virtual_key: anthropic
weight: 30
Only targets that support the requested model are candidates for selection.
Weight evaluation adds negligible overhead โ a single random number generation per request. Equivalent to single-strategy latency for practical purposes.
Conditionalโ
Evaluates rules in order. The first matching rule determines the target. Use model for exact match or model_prefix for prefix match.
Use this when: different models should route to specific providers โ e.g., all GPT models to OpenAI, all Claude models to Anthropic.
strategy:
mode: conditional
conditions:
- key: model
value: gpt-4o
target_key: openai
- key: model
value: gpt-4o-mini
target_key: openai
- key: model_prefix
value: claude
target_key: anthropic
- key: model_prefix
value: gemini
target_key: gemini
targets:
- virtual_key: openai
- virtual_key: anthropic
- virtual_key: gemini
If no rule matches, the request falls through to the first target.
Conditional routing pairs well with model aliases. Alias smart โ claude-3-5-sonnet-20241022, then add a conditional rule for model_prefix: claude.
Least-latencyโ
Routes to the target with the lowest P50 latency as measured by a rolling latency tracker. On a cold start (no latency data yet) it picks a target randomly.
Use this when: you have multiple fast providers and want to minimise time-to-first-token automatically.
strategy:
mode: least-latency
targets:
- virtual_key: openai
- virtual_key: groq
- virtual_key: anthropic
Adds a mutex read on the rolling latency map per request โ typically under 1ยตs. The latency tracker updates asynchronously after each response, so it does not add to request latency.
Cost-optimizedโ
Uses the built-in model catalog (2,500+ entries with pricing data) to estimate the input token cost for each target, then routes to the cheapest compatible provider. Falls back to the first compatible target if cost data is unavailable.
Use this when: you want to minimize spend without manually choosing models โ let the catalog handle it.
strategy:
mode: cost-optimized
targets:
- virtual_key: openai
- virtual_key: together
- virtual_key: deepseek
- virtual_key: gemini
Cost estimation uses the model name in the request and the catalog's input_cost_per_token field. The target with the lowest estimated cost for the matched model wins.
Combine cost-optimized with fallback by adding retry to each target โ if the cheapest provider fails, the gateway retries with the next cheapest.
Content-basedโ
Routes based on the content of the user's messages. Rules are evaluated in order; the first match wins. If no rule matches, the request falls through to the first target.
Use this when: different types of queries should go to different specialized models โ code to a coding model, translation to a translation service, general chat to a cost-efficient default.
Three condition types are supported:
| Type | Behavior |
|---|---|
prompt_contains | Case-insensitive substring match on any user message |
prompt_not_contains | Matches when NO user message contains the value |
prompt_regex | Go regular-expression match on any user message |
Regex patterns are compiled at gateway startup. An invalid regex causes a startup error โ there is no silent misrouting.
strategy:
mode: content-based
content_conditions:
- type: prompt_contains
value: "translate"
target_key: deepl-provider
- type: prompt_regex
value: "(?i)(code|function|class|def |import )"
target_key: openai
- type: prompt_contains
value: "summarize"
target_key: anthropic
targets:
- virtual_key: deepl-provider
- virtual_key: openai
- virtual_key: anthropic
Substring matching (prompt_contains) is near-zero cost. Regex matching adds overhead proportional to pattern complexity, but patterns are pre-compiled at startup so the hot path is a single regexp.MatchString call.
A/B testโ
Splits traffic across variants by configured weights. Every routed request is tagged with the label field for downstream observability (e.g., in request-logger output or analytics pipelines).
Use this when: you want to compare quality, latency, or cost between two providers on live traffic without client-side changes.
strategy:
mode: ab-test
ab_variants:
- target_key: openai
weight: 70
label: control
- target_key: anthropic
weight: 30
label: challenger
targets:
- virtual_key: openai
- virtual_key: anthropic
Weights are relative โ 70 and 30 send 70% of traffic to openai and 30% to anthropic. If a weight is 0, the variant is treated as weight 1 (equal distribution with remaining variants). Negative weights are rejected at gateway startup.
The label field is emitted in every gateway.request.completed event so you can aggregate results per variant.
Combine A/B test with the request-logger plugin persisting to Postgres, then query SELECT ab_variant, AVG(latency_ms), AVG(total_tokens) FROM requests GROUP BY ab_variant to compare variants.
Combining strategies with circuit breakersโ
All strategies respect per-target circuit breakers. A target whose circuit breaker is open is excluded from selection.
targets:
- virtual_key: openai
circuit_breaker:
failure_threshold: 5
success_threshold: 2
timeout: "30s"
The circuit breaker opens after failure_threshold consecutive failures, stays open for timeout, then enters half-open state where it allows one probe request. After success_threshold successes it closes again.
Related pagesโ
- Configuration reference โ full YAML reference for all strategy modes
- Use cases โ recipe-style configurations for common scenarios
- Benchmarks โ performance data for different routing strategies
- Plugins โ combine routing with safety and observability plugins