Routing policies

Most AI gateways offer 2–3 routing modes. Ferro Labs ships 8 — covering everything from simple single-provider setups to content-aware routing and live A/B testing. Set strategy.mode in config.yaml to choose one.

Single

Always routes to the first target. Best for single-provider setups or when you want explicit control.

Use this when: you have one provider and want the simplest possible config.

strategy:
  mode: single

targets:
  - virtual_key: openai

Performance

Single is the lightest strategy — zero overhead beyond the proxy hop. Ideal for latency-sensitive single-provider deployments.

Fallback

Tries targets in order. On failure (error or retryable status code), the next target is attempted with exponential backoff. Use this for high-availability setups.

Use this when: uptime matters more than anything — your chatbot must always respond, even if the primary provider is down.

strategy:
  mode: fallback

targets:
  - virtual_key: openai
    retry:
      attempts: 3
      retry_on_status: [429, 502, 503, 504]
  - virtual_key: anthropic
    retry:
      attempts: 2
  - virtual_key: gemini

If all targets fail, the last error is returned to the client.

Pro tip

Combine fallback with circuit breakers to skip providers that are consistently failing, rather than waiting for retries to timeout on every request.

Weighted load balancing

Distributes requests across targets by weight. Weights are relative — a weight of 70 and 30 sends 70% to the first target and 30% to the second.

Use this when: you want to spread load across providers for cost or capacity reasons.

strategy:
  mode: loadbalance

targets:
  - virtual_key: openai
    weight: 70
  - virtual_key: anthropic
    weight: 30

Only targets that support the requested model are candidates for selection.

Performance

Weight evaluation adds negligible overhead — a single random number generation per request. Equivalent to single-strategy latency for practical purposes.

Conditional

Evaluates rules in order. The first matching rule determines the target. Use model for exact match or model_prefix for prefix match.

Use this when: different models should route to specific providers — e.g., all GPT models to OpenAI, all Claude models to Anthropic.

strategy:
  mode: conditional
  conditions:
    - key: model
      value: gpt-4o
      target_key: openai
    - key: model
      value: gpt-4o-mini
      target_key: openai
    - key: model_prefix
      value: claude
      target_key: anthropic
    - key: model_prefix
      value: gemini
      target_key: gemini

targets:
  - virtual_key: openai
  - virtual_key: anthropic
  - virtual_key: gemini

If no rule matches, the request falls through to the first target.

Pro tip

Conditional routing pairs well with model aliases. Alias smart → claude-3-5-sonnet-20241022, then add a conditional rule for model_prefix: claude.

Least-latency

Routes to the target with the lowest P50 latency as measured by a rolling latency tracker. On a cold start (no latency data yet) it picks a target randomly.

Use this when: you have multiple fast providers and want to minimise time-to-first-token automatically.

strategy:
  mode: least-latency

targets:
  - virtual_key: openai
  - virtual_key: groq
  - virtual_key: anthropic

Performance

Adds a mutex read on the rolling latency map per request — typically under 1µs. The latency tracker updates asynchronously after each response, so it does not add to request latency.

Cost-optimized

Uses the built-in model catalog (2,500+ entries with pricing data) to estimate the input token cost for each target, then routes to the cheapest compatible provider. Falls back to the first compatible target if cost data is unavailable.

Use this when: you want to minimize spend without manually choosing models — let the catalog handle it.

strategy:
  mode: cost-optimized

targets:
  - virtual_key: openai
  - virtual_key: together
  - virtual_key: deepseek
  - virtual_key: gemini

Cost estimation uses the model name in the request and the catalog's input_cost_per_token field. The target with the lowest estimated cost for the matched model wins.

Pro tip

Combine cost-optimized with fallback by adding retry to each target — if the cheapest provider fails, the gateway retries with the next cheapest.

Content-based

Routes based on the content of the user's messages. Rules are evaluated in order; the first match wins. If no rule matches, the request falls through to the first target.

Use this when: different types of queries should go to different specialized models — code to a coding model, translation to a translation service, general chat to a cost-efficient default.

Three condition types are supported:

Type	Behavior
`prompt_contains`	Case-insensitive substring match on any user message
`prompt_not_contains`	Matches when NO user message contains the value
`prompt_regex`	Go regular-expression match on any user message

Regex patterns are compiled at gateway startup. An invalid regex causes a startup error — there is no silent misrouting.

strategy:
  mode: content-based
  content_conditions:
    - type: prompt_contains
      value: "translate"
      target_key: deepl-provider
    - type: prompt_regex
      value: "(?i)(code|function|class|def |import )"
      target_key: openai
    - type: prompt_contains
      value: "summarize"
      target_key: anthropic

targets:
  - virtual_key: deepl-provider
  - virtual_key: openai
  - virtual_key: anthropic

Performance

Substring matching (prompt_contains) is near-zero cost. Regex matching adds overhead proportional to pattern complexity, but patterns are pre-compiled at startup so the hot path is a single regexp.MatchString call.

A/B test

Splits traffic across variants by configured weights. Every routed request is tagged with the label field for downstream observability (e.g., in request-logger output or analytics pipelines).

Use this when: you want to compare quality, latency, or cost between two providers on live traffic without client-side changes.

strategy:
  mode: ab-test
  ab_variants:
    - target_key: openai
      weight: 70
      label: control
    - target_key: anthropic
      weight: 30
      label: challenger

targets:
  - virtual_key: openai
  - virtual_key: anthropic

Weights are relative — 70 and 30 send 70% of traffic to openai and 30% to anthropic. If a weight is 0, the variant is treated as weight 1 (equal distribution with remaining variants). Negative weights are rejected at gateway startup.

The label field is emitted in every gateway.request.completed event so you can aggregate results per variant.

Pro tip

Combine A/B test with the request-logger plugin persisting to Postgres, then query SELECT ab_variant, AVG(latency_ms), AVG(total_tokens) FROM requests GROUP BY ab_variant to compare variants.

Combining strategies with circuit breakers

All strategies respect per-target circuit breakers. A target whose circuit breaker is open is excluded from selection.

targets:
  - virtual_key: openai
    circuit_breaker:
      failure_threshold: 5
      success_threshold: 2
      timeout: "30s"

The circuit breaker opens after failure_threshold consecutive failures, stays open for timeout, then enters half-open state where it allows one probe request. After success_threshold successes it closes again.

Configuration reference — full YAML reference for all strategy modes
Use cases — recipe-style configurations for common scenarios
Benchmarks — performance data for different routing strategies
Plugins — combine routing with safety and observability plugins

Single​

Fallback​

Weighted load balancing​

Conditional​

Least-latency​

Cost-optimized​

Content-based​

A/B test​

Combining strategies with circuit breakers​

Related pages​

Single

Fallback

Weighted load balancing

Conditional

Least-latency

Cost-optimized

Content-based

A/B test

Combining strategies with circuit breakers

Related pages