Performance Benchmarks
Go-native means fast. Here's the proof.
Every number on this page comes from reproducible, open-source benchmarks. No synthetic micro-benchmarks โ real gateway overhead measured under sustained load against a mock upstream with constant latency.
Methodologyโ
Hardware: Dedicated Linux VM โ 4 vCPU, 8 GB RAM
Load generator: k6 (Grafana)
Upstream: Mock server returning a fixed response after a constant 50 ms latency. This isolates gateway overhead from provider variance.
RPS levels tested: 100, 200, 500, 1 000, 2 000
Duration: 60 seconds sustained at each level
What was measured: Gateway-added overhead only (total response time minus the 50 ms upstream latency). Memory sampled via docker stats at 1 s intervals.
All tests run five times; tables report median of medians (p50) and median of p99s.
Ferro Labs AI Gateway vs Python-Based Alternatives at 500 RPSโ
Head-to-head at 500 requests per second sustained for 60 seconds. This is the threshold where Python-based gateways begin showing consistent p99 degradation beyond 200 RPS.
| Metric | AI Gateway (Go) | Python-based gateways |
|---|---|---|
| p50 overhead | <0.5 ms | ~3 ms |
| p99 overhead | <1 ms | ~15 ms+ |
| Throughput | 500 RPS sustained | Degraded (drops under target) |
| Memory | ~120 MB | ~400 MB+ |
| Success rate | 100% | <99.5% |
The Go AI gateway latency advantage compounds at scale. At 500 RPS, the p99 gap is over 15x โ and Python alternatives start dropping requests entirely.
Scaling Behavior: AI Gateway Across RPS Levelsโ
The fastest AI gateway should add near-zero overhead regardless of load. Here is how AI Gateway scales from 100 to 2 000 RPS:
| RPS | p50 overhead | p99 overhead | Throughput | Memory | Success rate |
|---|---|---|---|---|---|
| 100 | 0.2 ms | 0.4 ms | 100 RPS sustained | ~45 MB | 100% |
| 200 | 0.3 ms | 0.5 ms | 200 RPS sustained | ~60 MB | 100% |
| 500 | 0.4 ms | 0.8 ms | 500 RPS sustained | ~120 MB | 100% |
| 1 000 | 0.5 ms | 0.9 ms | 1 000 RPS sustained | ~180 MB | 100% |
| 2 000 | 0.6 ms | 1.1 ms | 2 000 RPS sustained | ~250 MB | 100% |
Sub-millisecond p99 overhead holds through 1 000 RPS. Even at 2 000 RPS the gateway adds just over 1 ms at the tail โ well within noise for any LLM API call that takes 200 msโ2 s on the provider side.
Why Go?โ
- Goroutines for concurrency โ thousands of in-flight requests multiplexed onto a small thread pool with near-zero scheduling cost. No thread-per-request overhead.
- No GIL โ every CPU core does real work in parallel. Python's Global Interpreter Lock serializes CPU-bound gateway logic (auth, routing, logging) across all requests.
- Low-GC overhead with careful allocation โ arena-style buffering and sync.Pool reuse keep heap churn minimal. GC pauses stay under 0.5 ms even at 2 000 RPS.
- Single static binary, zero dependencies at runtime โ no interpreter, no virtualenv, no pip install at deploy time. One
COPYin your Dockerfile.
Reproduce the Benchmarksโ
The full benchmark suite is open source. Clone it, run it, verify every number on this page. This is a LiteLLM alternative performance comparison you can audit yourself.
# Clone the benchmark repository
git clone https://github.com/ferro-labs/ai-gateway-performance-benchmarks.git
cd ai-gateway-performance-benchmarks
# Start the mock upstream and gateway containers
docker compose up -d
# Run the full benchmark suite (100โ2000 RPS, 60s each)
./run-benchmarks.sh
# Or run a single RPS level
k6 run --env RPS=500 --env DURATION=60s scripts/gateway-overhead.js
Results are written to results/ as JSON. The included plot.py script generates comparison charts.
# Generate comparison plots
python3 plot.py results/
Related pagesโ
- Why Ferro Labs AI Gateway? โ architecture decisions behind these numbers
- Quickstart โ deploy AI Gateway in under 5 minutes
- Routing policies โ configure the routing layer benchmarked above