Slash LLM API Costs with Semantic Caching in Spring AI and pgvector

AI-driven applications often trigger redundant LLM API calls simply because a user phrases their request differently. For example, asking “How do I reset my password?” after an earlier query about “Password reset instructions” forces the system to reprocess the same intent. This inefficiency drains budgets and inflates latency. A more sophisticated approach—semantic caching—matches similar queries by meaning instead of literal text. When implemented with Spring AI and pgvector, it can reduce API costs dramatically while maintaining real-time performance.

Why Traditional Caching Fails for Semantic Queries

Most caching strategies rely on exact string matching. Redis or Memcached store responses under keys derived from the precise wording of a user’s prompt. While effective for static data, this method collapses when phrasing varies but intent remains identical. Developers often compound the problem by embedding vectors in application code, executing heavy computations outside the database layer. Others send queries to external embedding services before checking the cache, adding network latency that defeats the purpose of caching.

These inefficiencies stem from three common mistakes:

Using literal key-value stores that ignore semantic equivalence.
Offloading vector math to the JVM instead of leveraging PostgreSQL’s native pgvector engine.
Embedding user input before checking the cache, creating a circular dependency that slows down responses.

How Semantic Caching Works in Spring AI

The solution intercepts LLM calls at the framework level using Spring AI’s Advisor pattern. Instead of waiting for a query to reach an external API, the system first checks a pgvector-backed cache for semantically similar past requests. If a match is found, the cached answer is returned immediately. This approach preserves latency goals while reducing API costs.

To implement semantic caching effectively:

Deploy a custom CallAroundAdvisor to transparently intercept prompts before they hit the LLM.
Run embeddings locally using a lightweight ONNX model like all-MiniLM-L6-v2, which generates vectors in under 5 milliseconds without network calls.
Store embeddings in PostgreSQL with an HNSW index to enable fast cosine similarity searches at scale.
Apply a strict similarity threshold—typically above 0.96—to ensure cache hits reflect identical intent, not just close phrasing.

Code Example: Building a Reusable Semantic Cache

Below is a concise implementation of a semantic cache advisor for Spring AI. It uses PgVectorStore to manage embeddings and intercepts prompts with a custom advisor.

public class SemanticCacheAdvisor implements CallAroundAdvisor {
    private final PgVectorStore vectorStore;
    private final double similarityThreshold = 0.96;

    @Override
    public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {
        String query = request.getPrompt().getInstructions().get(0).getContent();

        List<Document> matches = vectorStore.similaritySearch(
            SearchRequest.query(query)
                .withSimilarityThreshold(similarityThreshold)
                .withTopK(1)
        );

        if (!matches.isEmpty()) {
            return AdvisedResponse.from(
                matches.get(0).getMetadata().get("cached_response").toString()
            );
        }

        AdvisedResponse response = chain.nextAroundCall(request);
        Document cachedDoc = new Document(query, Map.of("cached_response", response.getMessage()));
        vectorStore.add(List.of(cachedDoc));

        return response;
    }
}

This advisor handles the entire lifecycle: vector search, cache lookup, and optional storage of new responses. By integrating with Spring AI’s advisor chain, it remains decoupled from business logic, making it reusable across services.

Best Practices for Production-Grade Semantic Caching

To scale semantic caching without compromising accuracy or speed, follow these guidelines:

Enable HNSW Indexing: PostgreSQL’s pgvector supports Hierarchical Navigable Small World indexes for sub-10ms similarity searches even with millions of cached queries.
Set High Similarity Thresholds: Use thresholds above 0.95 to minimize false positives—where unrelated intents are incorrectly matched.
Monitor Cache Hit Rates: Track how often cached responses are served versus new LLM calls. A hit rate above 60% indicates strong efficiency gains.
Rotate Embedding Models Gradually: Update local ONNX models only after validating their performance on your specific query set to avoid drift.

As LLM usage grows across enterprises, optimizing API consumption becomes critical. Semantic caching with Spring AI and pgvector offers a proven path to reduce costs, improve latency, and maintain scalability—without sacrificing accuracy or control. The next evolution may involve federated search across distributed caches, but for now, this pattern delivers immediate value to teams looking to optimize their AI infrastructure.

AI summary

Redis’in yetersiz kaldığı LLM önbellekleme için Spring AI ve pgvector kullanarak %70’e varan maliyet tasarrufu sağlayın. Semantik benzerlik arama ve HNSW indekslemeyle performansı artırın.

Slash LLM API Costs with Semantic Caching in Spring AI and pgvector

Why Traditional Caching Fails for Semantic Queries

How Semantic Caching Works in Spring AI

Code Example: Building a Reusable Semantic Cache

Best Practices for Production-Grade Semantic Caching

Comments

How a 13-year-old built a software company in one month

How Public Code Contributions Transformed My Development Career

Transform your software idea into a market-ready product