Pedram Agand
← Writing
ML / AI

Why Your RAG Pipeline Fails at Retrieval, Not Generation

Most RAG debugging focuses on the LLM. The real problem is almost always in the retrieval step — here's how to diagnose and fix it using productionvalidated tec

2026-03-10·7 min read·RAG, LLM, retrieval, embeddings, production
Use with AI
Why Your RAG Pipeline Fails at Retrieval, Not Generation

Most teams building retrieval-augmented generation systems spend their debugging time on the generation step. They tweak prompts, swap models, adjust temperature. The LLM is the most visible part, so it gets blamed first.

In my experience, the retrieval step causes at least 70% of RAG failures. The generator can't produce good output from bad context — no matter how capable the model. This holds whether you're running GPT-4o, Claude, or Gemini.

Production Failure Modes at a Glance

Failure ModeRoot CauseImpactLeading Fix
Embedding driftModel trained on different domainLow recall on domain-specific queriesDomain fine-tuning or better base model
Fixed-size chunkingNo structural awarenessMid-table or mid-function splits destroy contextSemantic / structure-aware chunking
Query-document asymmetryCasual query ≠ formal document languageLow similarity scores → wrong chunk retrievedHyDE or multi-query expansion
Dense-only retrievalKeyword matches missed by dense modelRecall gaps on named entities, IDs, codesBM25 + dense hybrid
No rerankingTop-k from vector DB is ordered by similarity, not relevanceLow-quality context passed to generatorCross-encoder reranker (e.g. Cohere Rerank)
Evaluation by vibeNo systematic retrieval metricCan't distinguish retrieval vs. generation failuresRagas or DeepEval RAG Triad

The rest of this post walks through the three failures I see most often and how to address them.

Failure 1: Embedding Drift

Your embedding model was trained on a different data distribution than your documents. This is invisible on general benchmarks and catastrophic in domain-specific retrieval.

A financial AI system using text-embedding-ada-002 or a default Sentence-BERT model will see this constantly. Terms like "principal" (loan principal vs. investment principal), "margin" (profit margin vs. trading margin), and "exposure" carry domain-specific weight that general embeddings flatten into similar vectors.

Diagnosis: Take your 20 worst RAG responses. Look at the retrieved chunks. Are they topically relevant at all? If you're seeing off-topic retrievals consistently, your embedding space doesn't map to your domain.

Measurement: Use MTEB Leaderboard (nDCG@10) to compare embedding models on the closest available domain. But the real test is retrieval recall@k on your own query-answer pairs.

Fix: Fine-tune embeddings on domain data, or switch to a domain-specific base. For finance, FinBERT embeddings consistently outperform general models. For legal, legal-bert-base-uncased is a reasonable starting point.

When building VeNRA's retrieval layer for financial document QA, we found that swapping the embedding model — with no other changes — improved recall@5 from 0.63 to 0.81 on a held-out set of analyst report queries. The LLM prompt stayed identical. The model stayed identical. The only change was the embedding space.

Failure 2: Chunking Strategy

Most implementations use fixed-size chunking with some overlap. This works acceptably for continuous prose but fails for structured content:

  • Tables and structured data — splitting mid-table destroys the relationship between headers and values
  • Code — splitting mid-function loses the context about what the function does
  • Regulatory documents — section headers appear on a different chunk than the section body, so retrieved chunks lack their own title

Diagnosis: Sample 50 retrieved chunks from your worst-performing queries. How many are complete, coherent units of information? How many are fragments of something larger?

Fix: Use semantic or structure-aware chunking. For code: chunk by function/class boundary. For regulatory docs: chunk by section, include the header in each chunk as a prefix. For tables: either embed the full table or convert to structured text before chunking.

The awesome-rag-production guide covers this in detail — the key finding is that document-type-specific parsing is not optional for enterprise RAG. It's the difference between a prototype and a system that holds up under load.

Failure 3: Query-Document Asymmetry (and How to Fix It)

User queries are short and conversational. Documents are long and formal. The semantic distance between casual language and formal document language is real, and embedding models don't always bridge it.

"How do I get my money back" should match chunks discussing "refund policy" and "reimbursement procedures" — but the cosine similarity between those embeddings may be surprisingly low.

Diagnosis: Compute cosine similarity between query embeddings and embeddings of chunks you know should match. If similarities are consistently below 0.6 on correct pairs, you have an asymmetry problem.

Two fixes:

1. Hypothetical Document Embedding (HyDE)

Generate a hypothetical ideal answer first, then use that as your retrieval query. This converts the query into document-like language.

async def hyde_retrieve(query: str, k: int = 5) -> list[str]:
    # Generate a hypothetical document that would answer the query
    hypothetical = await llm.complete(
        f"Write a brief factual passage that answers: {query}"
    )
    # Embed the hypothetical (not the original query)
    embedding = embed(hypothetical.text)
    # Retrieve using the hypothetical embedding
    return vector_store.similarity_search(embedding, k=k)

HyDE typically improves recall@5 by 15–25% on domain-specific tasks without touching the LLM or the index. The tradeoff: one extra LLM call per query.

2. Hybrid BM25 + Dense Retrieval

Dense retrieval captures semantic similarity; BM25 captures keyword overlap. Named entities, product codes, ticker symbols, and regulatory identifiers are retrieved well by BM25 and poorly by dense models. Running both and fusing the results covers the gaps.

from rank_bm25 import BM25Okapi

def hybrid_retrieve(query: str, chunks: list[str], k: int = 5, alpha: float = 0.5) -> list[str]:
    # Dense retrieval
    dense_scores = vector_store.similarity_scores(embed(query), k=len(chunks))

    # BM25 sparse retrieval
    tokenized = [c.split() for c in chunks]
    bm25 = BM25Okapi(tokenized)
    sparse_scores = bm25.get_scores(query.split())

    # Normalize and fuse
    dense_norm = normalize(dense_scores)
    sparse_norm = normalize(sparse_scores)
    combined = alpha * dense_norm + (1 - alpha) * sparse_norm

    indices = combined.argsort()[-k:][::-1]
    return [chunks[i] for i in indices]

alpha=0.5 is a reasonable starting point. Tune it on your validation set — finance and legal content typically benefits from higher BM25 weight (alpha 0.3–0.4) due to the importance of exact entity names.

The Diagnostic Protocol

Before touching prompts or models, run this sequence:

  1. Oracle experiment: Give the LLM the ground-truth chunk directly (bypassing retrieval). Does it answer correctly? If yes, your retrieval is the bottleneck. If no, fix generation first. This single experiment localizes the failure in 30 minutes and saves weeks of prompt engineering.

  2. Retrieval recall@k: For 20 known query-answer pairs, what fraction retrieve the correct chunk in top-k? If recall@5 < 0.7, retrieval is failing.

  3. Context precision: Of retrieved chunks, what fraction are actually relevant? High recall with low precision means noisy context, which confuses generation even when the right chunk was retrieved.

For systematic measurement, Ragas implements the RAG Triad (context relevance, groundedness, answer relevance) as an automated evaluation harness. It's not a replacement for domain-specific test sets, but it gives you a reproducible retrieval signal without human annotation at every iteration.

What Actually Matters

Good RAG is 80% data engineering and 20% model selection. The real work is:

  • Understanding your document corpus well enough to chunk it structurally, not arbitrarily
  • Knowing your user's query patterns well enough to bridge the lexical gap with HyDE or hybrid retrieval
  • Adding a reranker (Cohere Rerank is the standard choice for mid-scale production) to re-order top-k results before passing to the generator
  • Measuring retrieval quality explicitly with recall@k — not inferring it from end-to-end accuracy

The LLM is the easy part. Swap models freely. Retrieval is where the precision work happens, and it's domain-specific enough that no off-the-shelf configuration substitutes for testing on your actual data.

Want this implemented in your workflow?

I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call — no pitch, just a focused conversation about your situation.

I publish one post like this per month. Join AI Command Room and I'll send it directly to you.