Skip to main content

Performance Benchmarks

Go-native means fast. Here's the proof.

Every number on this page comes from reproducible, open-source benchmarks. No synthetic micro-benchmarks โ€” real gateway overhead measured under sustained load against a mock upstream with constant latency.


Methodologyโ€‹

Test Environment & Parameters

Hardware: Dedicated Linux VM โ€” 4 vCPU, 8 GB RAM

Load generator: k6 (Grafana)

Upstream: Mock server returning a fixed response after a constant 50 ms latency. This isolates gateway overhead from provider variance.

RPS levels tested: 100, 200, 500, 1 000, 2 000

Duration: 60 seconds sustained at each level

What was measured: Gateway-added overhead only (total response time minus the 50 ms upstream latency). Memory sampled via docker stats at 1 s intervals.

All tests run five times; tables report median of medians (p50) and median of p99s.


Ferro Labs AI Gateway vs Python-Based Alternatives at 500 RPSโ€‹

Head-to-head at 500 requests per second sustained for 60 seconds. This is the threshold where Python-based gateways begin showing consistent p99 degradation beyond 200 RPS.

MetricAI Gateway (Go)Python-based gateways
p50 overhead<0.5 ms~3 ms
p99 overhead<1 ms~15 ms+
Throughput500 RPS sustainedDegraded (drops under target)
Memory~120 MB~400 MB+
Success rate100%<99.5%

The Go AI gateway latency advantage compounds at scale. At 500 RPS, the p99 gap is over 15x โ€” and Python alternatives start dropping requests entirely.


Scaling Behavior: AI Gateway Across RPS Levelsโ€‹

The fastest AI gateway should add near-zero overhead regardless of load. Here is how AI Gateway scales from 100 to 2 000 RPS:

RPSp50 overheadp99 overheadThroughputMemorySuccess rate
1000.2 ms0.4 ms100 RPS sustained~45 MB100%
2000.3 ms0.5 ms200 RPS sustained~60 MB100%
5000.4 ms0.8 ms500 RPS sustained~120 MB100%
1 0000.5 ms0.9 ms1 000 RPS sustained~180 MB100%
2 0000.6 ms1.1 ms2 000 RPS sustained~250 MB100%

Sub-millisecond p99 overhead holds through 1 000 RPS. Even at 2 000 RPS the gateway adds just over 1 ms at the tail โ€” well within noise for any LLM API call that takes 200 msโ€“2 s on the provider side.


Why Go?โ€‹

Why a Go AI gateway outperforms Python-based LLM proxies
  • Goroutines for concurrency โ€” thousands of in-flight requests multiplexed onto a small thread pool with near-zero scheduling cost. No thread-per-request overhead.
  • No GIL โ€” every CPU core does real work in parallel. Python's Global Interpreter Lock serializes CPU-bound gateway logic (auth, routing, logging) across all requests.
  • Low-GC overhead with careful allocation โ€” arena-style buffering and sync.Pool reuse keep heap churn minimal. GC pauses stay under 0.5 ms even at 2 000 RPS.
  • Single static binary, zero dependencies at runtime โ€” no interpreter, no virtualenv, no pip install at deploy time. One COPY in your Dockerfile.

Reproduce the Benchmarksโ€‹

The full benchmark suite is open source. Clone it, run it, verify every number on this page. This is a LiteLLM alternative performance comparison you can audit yourself.

# Clone the benchmark repository
git clone https://github.com/ferro-labs/ai-gateway-performance-benchmarks.git
cd ai-gateway-performance-benchmarks
# Start the mock upstream and gateway containers
docker compose up -d
# Run the full benchmark suite (100โ€“2000 RPS, 60s each)
./run-benchmarks.sh
# Or run a single RPS level
k6 run --env RPS=500 --env DURATION=60s scripts/gateway-overhead.js

Results are written to results/ as JSON. The included plot.py script generates comparison charts.

# Generate comparison plots
python3 plot.py results/