← Courses
Leveraging RAG for AI Development
← Module 3
Module 4 of 8
Module 5 →
Intro
Scenario
Lesson
Context
Lab Skill ~25 min
Intro

RAG Fails Silently

2 min read

Most RAG failures don't look like failures. The model answers confidently. The response sounds plausible. The user nods and moves on. What they don't see is that the chunk that came back from the vector store was the wrong one — or half of the right one — and the model filled in the gap from its training data without any indication it had done so.

Retrieval quality is not a property of the embedding model. It's a property of how you cut the knowledge base before you ever embed it. Chunking strategy is the most underengineered part of most RAG systems, and it's the part that determines whether retrieval returns something useful or something confidently wrong.

This module teaches how retrieval quality is actually measured and improved — starting with the chunk, not the model. You'll build a real specification for a real knowledge domain: chunk size, overlap, embedding approach, and an evaluation set that tells you whether the retrieval layer is working before you connect it to the LLM.

Your artifact — Skill
A chunking and embedding specification for a real knowledge base — chunk size, overlap, embedding model, and retrieval method chosen and justified with quality metrics and failure analysis.
  • Explain why chunking strategy affects retrieval quality more than embedding model choice
  • Choose chunk size and overlap values with explicit justification for a real knowledge domain
  • Distinguish semantic embedding from keyword search and identify when hybrid retrieval applies
  • Define retrieval precision and recall and explain why NIST MEASURE requires both before production
  • Build a 5-query evaluation set to test retrieval before connecting the LLM
Scenario

The Answer That Sounds Right

3 min read

A developer has spent three weekends building a RAG pipeline over their Obsidian vault. They've embedded five years of decision notes, architecture records, and project retrospectives. They've built a vector search layer using a local embedding model, connected it to an LLM, and deployed it as a personal assistant that can answer questions about their own work history.

The answers come back fast. They sound authoritative. They cite note titles. They feel correct. Then the developer asks about a key technology decision — which database they chose and why — and gets an answer that names the right database but gives entirely the wrong reason. The cited note exists. The answer sounds plausible. The reasoning is fabricated.

The root cause is chunking. The developer used a simple 500-token fixed split with no overlap. One of the most important notes in the vault — the decision record for the database choice — was bisected mid-sentence. The chunk that got embedded ended like this: "…we chose PostgreSQL over MongoDB because of the" — and then the chunk ended. The embedding captured the topic and the choice, but not the reasoning. The retrieval layer returned that chunk as the most relevant match. The LLM received half a decision and completed the other half from its training data.

The model had no way to know the chunk was incomplete. The developer had no way to know the retrieval layer was returning half-sentences. The system produced a confident, wrong answer about real past decisions — and the developer nearly used it in a project proposal.

The fix isn't a better embedding model. It's a chunking strategy that doesn't cut sentences in half.

Lesson

Chunking Determines Retrieval Quality

4 min read

The single most important insight in retrieval engineering is this: chunking strategy determines retrieval quality more than embedding model choice. You can swap in the best embedding model available and still get garbage retrieval if your chunks are bisecting the knowledge at the wrong places. The inverse is also true — a well-chunked knowledge base retrieves correctly even with a modest embedding model.

Chunk size is a tension between precision and context. Small chunks (under 150 tokens) embed a tight semantic unit — a single claim, a single step, a single assertion. Retrieval precision is high because the embedding represents exactly one thing. But the retrieved chunk often lacks the surrounding context the LLM needs to answer correctly. You get the right fact without the explanation.

Large chunks (over 600 tokens) embed multiple ideas. The embedding becomes an average of those ideas, which reduces precision — queries that match one idea in the chunk may not score high enough to retrieve it. You get more context when the chunk does come back, but retrieval recall suffers because the dense embedding dilutes signal.

The practical sweet spot for most prose knowledge bases is 200–400 tokens with 50-token overlap. This range captures enough context for the LLM to reason, while keeping the embedding tight enough to score consistently on relevant queries.

Semantic embeddings — OpenAI text-embedding-3-small, Cohere embed, or local models like nomic-embed-text — map text to a vector space where meaning proximity corresponds to vector proximity. A query about "database selection rationale" will score highly against a chunk discussing "why we chose PostgreSQL over MongoDB," even if the exact words don't overlap. Semantic search wins for conceptual queries where the user doesn't know the exact terminology used in the source documents.

Keyword search (BM25 and its variants) scores chunks based on term frequency and inverse document frequency. It wins when the query uses exact terminology — searching for a specific function name, a configuration key, or a proper noun that wouldn't appear in a paraphrase. Keyword search fails on synonyms and conceptual queries.

Hybrid retrieval combines both: run the semantic search and the keyword search independently, then merge the result sets using reciprocal rank fusion or a weighted score. Hybrid retrieval outperforms either approach alone on most real knowledge bases because real queries are a mix of conceptual and terminological. The cost is complexity — two retrieval pipelines to maintain and a fusion step to calibrate.

Retrieval quality has two components. Retrieval precision asks: of the chunks that came back, what fraction were actually relevant? A retrieval layer that returns five chunks, two of which are relevant, has 40% precision. Retrieval recall asks: of all the relevant chunks in the knowledge base, what fraction came back? A retrieval layer that misses three out of five relevant chunks has 40% recall.

Both metrics matter. High precision with low recall means you're returning good chunks but missing important ones. High recall with low precision means the LLM is drowning in noise. You want both above 70% before connecting the retrieval layer to an LLM.

NIST MEASURE — Define retrieval quality metrics before deployment

MEASURE requires you to test AI system behavior against its intended purpose. For RAG, the intended purpose of the retrieval layer is to return the right chunks. Before connecting retrieval to an LLM, define retrieval precision and recall targets, build a set of known-good query/answer pairs, run retrieval against those queries, and measure whether the right chunks come back. If you skip this step, you have no baseline to compare against when the system behaves unexpectedly in production.

NIST MAP — Failure modes when chunks are wrong

MAP requires identifying failure modes before deployment. For chunking, the primary failure modes are: (1) bisected context — a chunk ends mid-sentence or mid-argument, causing the LLM to hallucinate the missing portion; (2) embedding dilution — an oversized chunk's embedding averages across too many concepts, causing relevant chunks to score below the retrieval threshold; (3) boundary mismatch — chunks are cut at token boundaries rather than semantic boundaries, fragmenting complete ideas across two chunks neither of which is independently useful.

EU AI Act Art. 13 — Source attribution requires the right chunk

Article 13 requires that AI systems provide meaningful information about how they produce outputs when those outputs affect users. For RAG systems, source attribution is the mechanism — citing the specific document or chunk that grounded the answer. But attribution is only meaningful if the cited chunk actually contains the information the LLM used. A RAG system that returns half-sentences and attributes answers to those half-sentences is providing attribution that is technically present but substantively misleading. Retrieval quality is a prerequisite for attribution integrity.

Context

Three Things to Know Before You Chunk

2 min read

In the lab you'll design a chunking and embedding specification for a real knowledge domain. Three things will determine whether your specification is defensible.

Understand the natural unit of your knowledge before chunking it

Before you choose a chunk size, read twenty documents from your knowledge base and ask: what is the natural unit of a complete thought here? In an Obsidian vault of decision records, the natural unit might be a paragraph — each paragraph makes one claim. In a codebase, it might be a function. In a policy document, it might be a numbered section. Chunking at the natural boundary almost always outperforms fixed-token splitting because the embedding then represents a semantically complete unit. If you can't identify a natural unit, start with 300 tokens and 75-token overlap as a neutral baseline, then measure.

Always use overlap — a chunk that starts with context almost never hurts

Overlap means the first 50–100 tokens of each chunk are the last 50–100 tokens of the previous chunk. This ensures that ideas that span a chunk boundary appear in at least one complete chunk. A chunk that opens with a sentence fragment from the previous chunk almost never causes a retrieval problem — the embedding captures the full text and the semantic signal of the overlap tokens is minor. A chunk that cuts off mid-sentence almost always causes a retrieval problem because the embedding captures an incomplete idea and the LLM must hallucinate the rest. Default to 50–75 tokens of overlap and increase it only if you're seeing bisection failures in your evaluation set.

Build the evaluation set before you build the pipeline

Write 10–20 query/answer pairs against your knowledge base before you embed anything. Choose queries that span the range of what the system will be asked: factual lookups, conceptual questions, terminology searches. For each query, identify which chunks in the knowledge base should come back. This set becomes your retrieval precision and recall benchmark. If you don't have this before you go live, you have no way to know whether a change to chunk size or embedding model made things better or worse. NIST MEASURE requires this kind of pre-deployment baseline — and it's the difference between engineering a retrieval layer and guessing at one.

In the lab you'll apply all three to a real knowledge domain you choose. The AI reviewer will challenge your chunk size and ask what you'd do if retrieval quality is poor after deployment.

⚙ Skill Lab
Chunking and Embedding Specification
~25 minutes · 4 decisions + evaluation set
What you're doing
Choose a real knowledge domain you own — your Obsidian vault, a codebase, a document set, a set of notes. Work through four decisions: chunk size, overlap, embedding approach, and retrieval method. Then define 5 queries you'd use to verify retrieval quality before connecting the LLM.
Roles
🔧
You — Retrieval EngineerYou're designing the chunking layer for a real knowledge base. Every decision needs a justification and a failure mode.
🔍
AI — Retrieval ReviewerI'll challenge your chunk size choice and ask what you'd do if retrieval quality is poor after deployment. I expect NIST MEASURE reasoning.
Four decisions to make
Chunk size — tokens per chunk and why
Overlap — tokens of overlap and why
Embedding approach — semantic, keyword, or hybrid
5 evaluation queries — what you'd test retrieval against
Framework reminders
NIST MEASURE — define retrieval precision and recall targets before going live
NIST MAP — name the failure mode for your chunk size choice
Art. 13 — source attribution is only valid if the right chunk comes back
Success criteria
A complete specification with every decision justified. An evaluation set of 5 queries with expected chunks named. A clear answer to what you'd change first if retrieval quality is poor.
Shift + Enter for a new line
✓ Module Complete
You've completed Module 4 of 8.
Next Module →