Semantic Caching
The Problem with Exact-Match Cachingโ
Traditional caches key on the exact request body. That means two queries asking the same thing in slightly different words โ "What's the weather?" versus "Tell me the weather" โ produce separate cache misses and separate billable calls to the upstream provider.
Semantic caching solves this by comparing the meaning of the query, not its literal text.
How It Worksโ
- Embed the query โ The incoming prompt is passed through a lightweight embedding model to produce a dense vector.
- Similarity search โ The vector is compared against cached entries using an HNSW index in PostgreSQL with pgvector.
- Serve or forward โ If a cached entry exceeds the similarity threshold, the cached response is returned immediately. Otherwise the request is forwarded to the AI provider and the response is cached for future hits.
Configuration Optionsโ
| Option | Type | Default | Description |
|---|---|---|---|
similarity_threshold | float | 0.92 | Minimum cosine similarity (0.0 โ 1.0) required to treat a cached entry as a hit. Higher values are stricter. |
ttl_seconds | int | 3600 | Time-to-live for cached entries in seconds. Expired entries are evicted on the next write cycle. |
vector_dimensions | int | 1536 | Dimensionality of the embedding vectors. Must match the embedding model output. |
embedding_model | string | text-embedding-3-small | The model used to generate query embeddings. |
Cost Savingsโ
Semantic caching directly reduces the number of billable tokens sent to upstream providers.
Example: If 30 % of your queries are semantically similar and you serve 10,000 requests per day, that is 3,000 cached responses. At an average cost of $0.002 per request, you save approximately $6 per day โ or roughly $180 per month โ without any change to your application code.
Actual savings depend on your traffic patterns, similarity threshold, and the cost of the models you use.
Semantic caching is available on the Pro plan and above. It is not included in the open-source gateway or the Community and Starter plans.
Join the waitlist to get access.