iToverDose/Software· 7 MAY 2026 · 20:11

How pgvector slashes LLM costs with intelligent caching

Repeated LLM analysis drains margins faster than growth can compensate. Discover how embedding caching in Postgres with pgvector turns every customer query into a cost-saving opportunity without sacrificing speed or accuracy.

DEV Community4 min read0 Comments

AI-powered features often hide a hidden tax: repeated LLM calls on the same data. A lead qualification tool, a weekly pipeline report, and a monthly audit might all process identical customer messages, each generating fresh charges while yielding little new insight. This duplication doesn’t just inflate costs—it erodes the very margins that AI promises to boost.

Enter pgvector, a Postgres extension that transforms embeddings into a cost-efficient cache embedded directly in your database. By storing vector representations of incoming messages once and reusing them across multiple features, teams can reduce LLM spend by 60–80% without sacrificing performance or accuracy. The approach isn’t experimental—it’s a production-grade strategy already saving companies thousands per month in cloud bills.

From wasteful re-analysis to embedded intelligence

The problem starts innocently enough. You build a feature that analyzes customer messages using an LLM, then charge users for access. Revenue scales with usage, but so do costs—often faster. A single message might be processed by multiple tools: a real-time intent classifier, a weekly report generator, and a monthly deep-dive audit. Each call incurs a charge, even though the underlying text hasn’t changed.

The solution lies in decoupling analysis from processing. Instead of running the LLM every time a message needs to be understood, pre-generate a vector embedding when the message arrives and store it alongside the raw text.

INSERT INTO dm_analysis (dm_id, embedding)
VALUES ($1, $2);

This one-time cost—just $0.000003 per message using text-embedding-3-small—creates a reusable asset. The embedding becomes the foundation for future queries, enabling semantic search, intent scoring, and summarization without additional LLM calls.

Designing a cache that lives in the database

The schema mirrors this strategy: a single table stores both the vector and the derived insights, ensuring consistency and eliminating synchronization overhead.

CREATE TABLE dm_analysis (
  analysis_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  dm_id UUID NOT NULL UNIQUE REFERENCES dms(dm_id),
  
  -- Expensive fields, populated only when LLM runs:
  intent VARCHAR(50),
  qualification_score DECIMAL(3,2),
  fit_label VARCHAR(20),
  urgency_level VARCHAR(20),
  summary TEXT,
  
  -- Always populated, free:
  embedding VECTOR(1536),
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_analysis_embedding ON dm_analysis USING hnsw (embedding vector_cosine_ops);

The table’s structure enforces a clear contract: if summary is null, the message hasn’t been enriched. If it’s populated, the LLM has already analyzed it once—and the result is cached for reuse. Queries can filter by both structured fields and vector similarity, all within a single database transaction.

Target vectors: guiding queries without redundant analysis

When generating a report—say, “Hot Leads This Week”—the goal isn’t to feed every message back into an LLM. That would replicate the inefficiency you’re trying to eliminate. Instead, craft a target vector that represents the ideal message for the report’s intent.

const reportVector = await embed(
  "I'm ready to buy this week. We have budget approved. Need to make a decision soon."
);

This target vector acts as a semantic filter. Postgres uses the HNSW index to quickly retrieve the 50 most relevant messages, comparing their embeddings to the target using cosine distance.

SELECT d.dm_id, da.summary, da.qualification_score
FROM dms d
JOIN dm_analysis da ON da.dm_id = d.dm_id
WHERE d.tenant_id = $1
  AND d.received_at >= $2 -- this week
ORDER BY da.embedding <=> $3::vector
LIMIT 50;

The query executes in under 50 milliseconds at scale, costing nothing beyond standard database operations. Only the messages that haven’t been enriched require LLM calls.

The compounding effect: cheaper queries over time

The strategy’s real power emerges over repeated usage. Each time a customer runs a report, the system retrieves messages and splits them into two groups:

  • Enriched messages: Already processed by a previous feature. Their intent, score, and summary are available instantly.
  • Raw messages: Never analyzed. These trigger a single LLM call.

Early on, most messages are raw, so costs remain high. But as users engage more frequently, the same messages get reused across multiple features. A message analyzed for a real-time alert might later power a weekly report—without any additional LLM expense.

The result is a downward cost curve. A customer’s first report might cost $1.00 in LLM calls. Three reports later, with 60% of messages pre-enriched, the cost drops to $0.40. By month three, a monthly deep dive pulling 200 messages might only require $1.00 instead of $4.00, because 75% of the data is already cached.

This sublinear cost growth creates a sustainable flywheel: more usage drives more reuse, which reduces per-feature costs while increasing customer value.

Why pgvector outperforms dedicated vector databases

Three technical advantages make pgvector the right choice for this use case:

  • Single-system architecture: The vector lives in the same row as the enriched data. No need for two-phase commits or eventual consistency between storage systems. Tenant-level filtering happens in one query.
  • Cache as a column: The presence or absence of summary serves as a built-in cache flag. No separate metadata store, no synchronization logic—just a fast UPDATE when enrichment completes.
  • HNSW performance at scale: For datasets up to a million messages per tenant, the HNSW index delivers top-50 results in single-digit milliseconds. Vector search isn’t the bottleneck—your application logic, rate limits, or downstream integrations are.

These benefits eliminate the operational overhead of managing a separate vector database while keeping costs predictable and performance consistent.

A strategy for sustainable AI scaling

The lesson isn’t just about saving money—it’s about aligning AI’s promise with business reality. LLM-powered features don’t have to inflate cloud bills exponentially. By treating embeddings as first-class infrastructure, teams can unlock reuse without sacrificing accuracy or responsiveness.

The next time you deploy an AI feature, ask: What if this data could power the next feature too? With pgvector, the answer isn’t hypothetical—it’s already in your database.

AI summary

AI ile çalışan ürünlerdeki tekrarlayan LLM analizleri, marjları eritiyor. Bu basit pgvector çözümüyle aynı veriyi sadece bir kez analiz ederek maliyetleri %70'e kadar düşürmek mümkün.

Comments

00
LEAVE A COMMENT
ID #0QMC61

0 / 1200 CHARACTERS

Human check

5 + 6 = ?

Will appear after editor review

Moderation · Spam protection active

No approved comments yet. Be first.