Why Your RAG Pipeline Fails at Retrieval, Not Generation
Most RAG debugging focuses on the LLM. The real problem is almost always in the retrieval step — here's how to diagnose and fix it using productionvalidated tec

Most teams building retrieval-augmented generation systems spend their debugging time on the generation step. They tweak prompts, swap models, adjust temperature. The LLM is the most visible part, so it gets blamed first.
In my experience, the retrieval step causes at least 70% of RAG failures. The generator can't produce good output from bad context — no matter how capable the model. This holds whether you're running GPT-4o, Claude, or Gemini.
Production Failure Modes at a Glance
| Failure Mode | Root Cause | Impact | Leading Fix |
|---|---|---|---|
| Embedding drift | Model trained on different domain | Low recall on domain-specific queries | Domain fine-tuning or better base model |
| Fixed-size chunking | No structural awareness | Mid-table or mid-function splits destroy context | Semantic / structure-aware chunking |
| Query-document asymmetry | Casual query ≠ formal document language | Low similarity scores → wrong chunk retrieved | HyDE or multi-query expansion |
| Dense-only retrieval | Keyword matches missed by dense model | Recall gaps on named entities, IDs, codes | BM25 + dense hybrid |
| No reranking | Top-k from vector DB is ordered by similarity, not relevance | Low-quality context passed to generator | Cross-encoder reranker (e.g. Cohere Rerank) |
| Evaluation by vibe | No systematic retrieval metric | Can't distinguish retrieval vs. generation failures | Ragas or DeepEval RAG Triad |
The rest of this post walks through the three failures I see most often and how to address them.
Failure 1: Embedding Drift
Your embedding model was trained on a different data distribution than your documents. This is invisible on general benchmarks and catastrophic in domain-specific retrieval.
A financial AI system using text-embedding-ada-002 or a default Sentence-BERT model will see this constantly. Terms like "principal" (loan principal vs. investment principal), "margin" (profit margin vs. trading margin), and "exposure" carry domain-specific weight that general embeddings flatten into similar vectors.
Diagnosis: Take your 20 worst RAG responses. Look at the retrieved chunks. Are they topically relevant at all? If you're seeing off-topic retrievals consistently, your embedding space doesn't map to your domain.
Measurement: Use MTEB Leaderboard (nDCG@10) to compare embedding models on the closest available domain. But the real test is retrieval recall@k on your own query-answer pairs.
Fix: Fine-tune embeddings on domain data, or switch to a domain-specific base. For finance, FinBERT embeddings consistently outperform general models. For legal, legal-bert-base-uncased is a reasonable starting point.
When building VeNRA's retrieval layer for financial document QA, we found that swapping the embedding model — with no other changes — improved recall@5 from 0.63 to 0.81 on a held-out set of analyst report queries. The LLM prompt stayed identical. The model stayed identical. The only change was the embedding space.
Failure 2: Chunking Strategy
Most implementations use fixed-size chunking with some overlap. This works acceptably for continuous prose but fails for structured content:
- Tables and structured data — splitting mid-table destroys the relationship between headers and values
- Code — splitting mid-function loses the context about what the function does
- Regulatory documents — section headers appear on a different chunk than the section body, so retrieved chunks lack their own title
Diagnosis: Sample 50 retrieved chunks from your worst-performing queries. How many are complete, coherent units of information? How many are fragments of something larger?
Fix: Use semantic or structure-aware chunking. For code: chunk by function/class boundary. For regulatory docs: chunk by section, include the header in each chunk as a prefix. For tables: either embed the full table or convert to structured text before chunking.
The awesome-rag-production guide covers this in detail — the key finding is that document-type-specific parsing is not optional for enterprise RAG. It's the difference between a prototype and a system that holds up under load.
Failure 3: Query-Document Asymmetry (and How to Fix It)
User queries are short and conversational. Documents are long and formal. The semantic distance between casual language and formal document language is real, and embedding models don't always bridge it.
"How do I get my money back" should match chunks discussing "refund policy" and "reimbursement procedures" — but the cosine similarity between those embeddings may be surprisingly low.
Diagnosis: Compute cosine similarity between query embeddings and embeddings of chunks you know should match. If similarities are consistently below 0.6 on correct pairs, you have an asymmetry problem.
Two fixes:
1. Hypothetical Document Embedding (HyDE)
Generate a hypothetical ideal answer first, then use that as your retrieval query. This converts the query into document-like language.
async def hyde_retrieve(query: str, k: int = 5) -> list[str]:
# Generate a hypothetical document that would answer the query
hypothetical = await llm.complete(
f"Write a brief factual passage that answers: {query}"
)
# Embed the hypothetical (not the original query)
embedding = embed(hypothetical.text)
# Retrieve using the hypothetical embedding
return vector_store.similarity_search(embedding, k=k)HyDE typically improves recall@5 by 15–25% on domain-specific tasks without touching the LLM or the index. The tradeoff: one extra LLM call per query.
2. Hybrid BM25 + Dense Retrieval
Dense retrieval captures semantic similarity; BM25 captures keyword overlap. Named entities, product codes, ticker symbols, and regulatory identifiers are retrieved well by BM25 and poorly by dense models. Running both and fusing the results covers the gaps.
from rank_bm25 import BM25Okapi
def hybrid_retrieve(query: str, chunks: list[str], k: int = 5, alpha: float = 0.5) -> list[str]:
# Dense retrieval
dense_scores = vector_store.similarity_scores(embed(query), k=len(chunks))
# BM25 sparse retrieval
tokenized = [c.split() for c in chunks]
bm25 = BM25Okapi(tokenized)
sparse_scores = bm25.get_scores(query.split())
# Normalize and fuse
dense_norm = normalize(dense_scores)
sparse_norm = normalize(sparse_scores)
combined = alpha * dense_norm + (1 - alpha) * sparse_norm
indices = combined.argsort()[-k:][::-1]
return [chunks[i] for i in indices]alpha=0.5 is a reasonable starting point. Tune it on your validation set — finance and legal content typically benefits from higher BM25 weight (alpha 0.3–0.4) due to the importance of exact entity names.
The Diagnostic Protocol
Before touching prompts or models, run this sequence:
-
Oracle experiment: Give the LLM the ground-truth chunk directly (bypassing retrieval). Does it answer correctly? If yes, your retrieval is the bottleneck. If no, fix generation first. This single experiment localizes the failure in 30 minutes and saves weeks of prompt engineering.
-
Retrieval recall@k: For 20 known query-answer pairs, what fraction retrieve the correct chunk in top-k? If recall@5 < 0.7, retrieval is failing.
-
Context precision: Of retrieved chunks, what fraction are actually relevant? High recall with low precision means noisy context, which confuses generation even when the right chunk was retrieved.
For systematic measurement, Ragas implements the RAG Triad (context relevance, groundedness, answer relevance) as an automated evaluation harness. It's not a replacement for domain-specific test sets, but it gives you a reproducible retrieval signal without human annotation at every iteration.
What Actually Matters
Good RAG is 80% data engineering and 20% model selection. The real work is:
- Understanding your document corpus well enough to chunk it structurally, not arbitrarily
- Knowing your user's query patterns well enough to bridge the lexical gap with HyDE or hybrid retrieval
- Adding a reranker (Cohere Rerank is the standard choice for mid-scale production) to re-order top-k results before passing to the generator
- Measuring retrieval quality explicitly with recall@k — not inferring it from end-to-end accuracy
The LLM is the easy part. Swap models freely. Retrieval is where the precision work happens, and it's domain-specific enough that no off-the-shelf configuration substitutes for testing on your actual data.
Want this implemented in your workflow?
I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call — no pitch, just a focused conversation about your situation.
I publish one post like this per month. Join AI Command Room and I'll send it directly to you.