3 TypeScript RAG Pipeline Errors That Cost Days of Debugging

Building a production-grade AI support agent in TypeScript taught me that RAG pipelines aren't just about plugging in an LLM. The devil isn't just in the details—it's in the assumptions. After months of debugging avoidable errors, I’ve distilled three key mistakes that derailed my initial approach, along with the solutions that finally worked. If you're developing a RAG system in TypeScript (or any stack), these insights could save you weeks of frustration.

The pipeline powers a multi-tenant helpdesk agent using Node.js, PostgreSQL with pgvector, and no Python or LangChain dependencies. It processes customer documentation to answer queries with contextual accuracy. Below, I break down the mistakes, why they failed, and how structural improvements transformed the system from brittle to reliable.

Why Fixed-Size Chunking Fails in Real-World RAG

Most tutorials introduce RAG by splitting documents into fixed-size chunks with overlap. It’s simple to implement and often demonstrates basic functionality. Yet this approach ignores the structure of real documents—especially technical ones like API guides, tutorials, or code-heavy manuals.

Consider a markdown file titled Setting Up Stripe Webhooks. It contains:

A heading: ## Configuring Webhooks
A paragraph explaining webhook concepts
A TypeScript code block initializing Stripe and defining a webhook handler
Another paragraph detailing signature verification
Final notes on retries and error handling

A fixed-size chunker splits this document at arbitrary character counts. The result? A chunk ending mid-code block, containing const stripe = new Stripe(STRIPE_SECRET_KEY) but missing the handler logic. The next chunk starts with req.body, sig, endpointSecret, leaving the LLM with no context about what these variables represent. When a user asks, “How do I verify Stripe webhooks?” the embedding model can’t associate this fragment with the actual process. Even if retrieved, the LLM receives half a code sample and may hallucinate the rest or return a nonsensical answer.

The Fix: Respect Document Structure with Semantic Chunking

Real documents are written in sections for a reason. Each heading marks a coherent topic. My solution replaces fixed-size slicing with structural chunking, which splits content at heading boundaries while preserving hierarchy and context.

The algorithm works like this:

Parse frontmatter (YAML metadata at the top of the file) to extract the document title and tenant context.
Identify headings using Markdown syntax (#, ##, etc.), but skip headings inside code blocks to avoid breaking code sections.
Group content between headings into logical chunks. For example, everything between ## Stripe Webhooks and ### Choosing a Sync Mode becomes one chunk.
Track the section path—a breadcrumb like "Stripe Webhooks > Choosing a Sync Mode"—to provide the LLM with contextual where this information comes from.
Handle large sections by splitting first at paragraph boundaries (\n\n), then at lines (\n) if necessary. Each chunk respects the semantic unit of a paragraph or code block.
Apply deterministic IDs using sha256(sectionPath + content) to ensure idempotency—running the pipeline twice produces identical results, enabling safe upserts.

This method produces chunks that align with how humans write and read documents. The embedding captures the full meaning of a topic—not a random slice of text.

{
  "id": "a8f3e2d1b9c04567",
  "tenant_id": "posthog",
  "source_file": "published/docs/cdp/sources/stripe.md",
  "doc_title": "Linking Stripe as a Source",
  `section_path": "Linking Stripe as a Source > Choosing a Sync Mode",
  "content": "The actual section content...",
  "doc_type": "docs"
}

Key metadata principles:

section_path provides granular context, turning ambiguous headings like “Choosing a Sync Mode” into unambiguous paths.
tenant_id is baked in from day one, enabling multi-tenancy without costly refactoring later.
Deterministic IDs prevent duplication and ensure consistency across pipeline runs.

The Silent Failure: Embedding Models and Chunk Size Limits

Even structural chunking hit a wall when embedding models rejected oversized chunks. The nomic-embed-text model I used has an 8,192-token context limit. Some sections in my docs exceeded 13,000 characters—far beyond the safe zone. My first instinct was to truncate during embedding: “Just chop at 8,192 tokens and embed what fits.”

That was a mistake.

Truncation silently omits critical content. A 13k-character section on webhook configuration could lose its second half—the part covering error handling and retries. The LLM would receive a partial guide and generate incomplete or incorrect answers. Worse, truncation breaks the contract between chunking and embedding: the chunker’s job is to produce valid, self-contained units; the embedder should never need to trim data.

Two-Layer Splitting: Size Control Before Embedding

The fix was to enforce size limits before embedding, using a two-tier splitting strategy:

Paragraph split: Break content at double newline (\n\n). Paragraphs are naturally coherent units of meaning.
Line split (fallback): For content without paragraphs—like tables or long code blocks—split at single newlines.

Both methods use a shared accumulateChunks function: buffer content until adding the next piece would exceed the limit, then emit a complete chunk and continue. This separation of concerns ensures the chunker handles sizing, while the embedder receives only properly formatted input.

Rule of thumb: The embedding step should trust that chunks are valid. Don’t let it clean up messes it didn’t create.

The Blind Spot in Pure Vector Search

Vector-based semantic search excels at finding meaning, not exact words. Queries like “How do I set up Stripe?” and “configuring Stripe integration” return similar results even though they share few keywords. That’s the power of embeddings—but it’s also their blind spot.

Pure vector search can miss exact keyword matches critical for technical accuracy. For example, a user might ask, “Where is the stripe webhook secret variable defined?” A vector search for “webhook secret” might retrieve a general Stripe setup guide—but not the specific line of code that declares endpointSecret. Without keyword matching, the answer may be vague or off-target.

Hybrid Search: Combining Semantic and Keyword Precision

The solution combines hybrid retrieval—a mix of vector similarity and exact keyword matching. Here’s how it works in practice:

Run two parallel searches:
A semantic search using the query embedding against the vector store.
A keyword search using the raw query terms against document metadata (e.g., headings, filenames, code identifiers).
Merge and re-rank results: Combine the top N results from both searches, then prioritize chunks that match both semantic relevance and keyword presence.
Boost exact matches: Give higher weight to chunks containing literal query terms (e.g., “webhook secret”) or close variants.

This approach preserves the nuance of semantic search while ensuring technical terms aren’t overlooked. In benchmarks, hybrid search reduced irrelevant retrievals by over 40% compared to pure vector search.

Implementation tip: Use PostgreSQL’s pg_trgm extension for fuzzy keyword matching and combine scores with vector similarity using a weighted formula.

Looking Ahead: Building RAG That Actually Scales

The RAG pipeline isn’t just a prototype anymore—it’s powering real support workflows with multi-tenant isolation, deterministic updates, and hybrid retrieval. But scale introduces new challenges: latency, cost, and evolving document sets. Future improvements include:

Caching embeddings for frequently accessed documents to reduce compute costs.
Incremental indexing to process only new or updated content without full reprocessing.
Query routing to direct technical queries to code-heavy sections and general questions to prose-based docs.

Most importantly, the lessons learned from these early mistakes are now foundational to the system’s design. RAG isn’t about plugging in tools—it’s about respecting data structure, enforcing boundaries, and building for idempotency from day one. The result is a pipeline that doesn’t just work in demos—it works in production.

AI summary

TypeScript kullanarak RAG pipeline oluştururken yapılan hataların çözümleri ve neden işe yaradıkları

3 TypeScript RAG Pipeline Errors That Cost Days of Debugging

Why Fixed-Size Chunking Fails in Real-World RAG

The Fix: Respect Document Structure with Semantic Chunking

The Silent Failure: Embedding Models and Chunk Size Limits

Two-Layer Splitting: Size Control Before Embedding

The Blind Spot in Pure Vector Search

Hybrid Search: Combining Semantic and Keyword Precision

Looking Ahead: Building RAG That Actually Scales

Comments

How to Truly Master Algorithms Beyond Just Reading About Them

Celebrating sustainable tech: DEV's Earth Day Challenge winners revealed

Tamper-evident accounting ledgers for SMBs using SHA-256 chains