Most RAG failures don't look like failures. The model answers confidently. The response sounds plausible. The user nods and moves on. What they don't see is that the chunk that came back from the vector store was the wrong one — or half of the right one — and the model filled in the gap from its training data without any indication it had done so.
Retrieval quality is not a property of the embedding model. It's a property of how you cut the knowledge base before you ever embed it. Chunking strategy is the most underengineered part of most RAG systems, and it's the part that determines whether retrieval returns something useful or something confidently wrong.
This module teaches how retrieval quality is actually measured and improved — starting with the chunk, not the model. You'll build a real specification for a real knowledge domain: chunk size, overlap, embedding approach, and an evaluation set that tells you whether the retrieval layer is working before you connect it to the LLM.
A developer has spent three weekends building a RAG pipeline over their Obsidian vault. They've embedded five years of decision notes, architecture records, and project retrospectives. They've built a vector search layer using a local embedding model, connected it to an LLM, and deployed it as a personal assistant that can answer questions about their own work history.
The answers come back fast. They sound authoritative. They cite note titles. They feel correct. Then the developer asks about a key technology decision — which database they chose and why — and gets an answer that names the right database but gives entirely the wrong reason. The cited note exists. The answer sounds plausible. The reasoning is fabricated.
The root cause is chunking. The developer used a simple 500-token fixed split with no overlap. One of the most important notes in the vault — the decision record for the database choice — was bisected mid-sentence. The chunk that got embedded ended like this: "…we chose PostgreSQL over MongoDB because of the" — and then the chunk ended. The embedding captured the topic and the choice, but not the reasoning. The retrieval layer returned that chunk as the most relevant match. The LLM received half a decision and completed the other half from its training data.
The model had no way to know the chunk was incomplete. The developer had no way to know the retrieval layer was returning half-sentences. The system produced a confident, wrong answer about real past decisions — and the developer nearly used it in a project proposal.
The fix isn't a better embedding model. It's a chunking strategy that doesn't cut sentences in half.
The single most important insight in retrieval engineering is this: chunking strategy determines retrieval quality more than embedding model choice. You can swap in the best embedding model available and still get garbage retrieval if your chunks are bisecting the knowledge at the wrong places. The inverse is also true — a well-chunked knowledge base retrieves correctly even with a modest embedding model.
Chunk size is a tension between precision and context. Small chunks (under 150 tokens) embed a tight semantic unit — a single claim, a single step, a single assertion. Retrieval precision is high because the embedding represents exactly one thing. But the retrieved chunk often lacks the surrounding context the LLM needs to answer correctly. You get the right fact without the explanation.
Large chunks (over 600 tokens) embed multiple ideas. The embedding becomes an average of those ideas, which reduces precision — queries that match one idea in the chunk may not score high enough to retrieve it. You get more context when the chunk does come back, but retrieval recall suffers because the dense embedding dilutes signal.
The practical sweet spot for most prose knowledge bases is 200–400 tokens with 50-token overlap. This range captures enough context for the LLM to reason, while keeping the embedding tight enough to score consistently on relevant queries.
Semantic embeddings — OpenAI text-embedding-3-small, Cohere embed, or local models like nomic-embed-text — map text to a vector space where meaning proximity corresponds to vector proximity. A query about "database selection rationale" will score highly against a chunk discussing "why we chose PostgreSQL over MongoDB," even if the exact words don't overlap. Semantic search wins for conceptual queries where the user doesn't know the exact terminology used in the source documents.
Keyword search (BM25 and its variants) scores chunks based on term frequency and inverse document frequency. It wins when the query uses exact terminology — searching for a specific function name, a configuration key, or a proper noun that wouldn't appear in a paraphrase. Keyword search fails on synonyms and conceptual queries.
Hybrid retrieval combines both: run the semantic search and the keyword search independently, then merge the result sets using reciprocal rank fusion or a weighted score. Hybrid retrieval outperforms either approach alone on most real knowledge bases because real queries are a mix of conceptual and terminological. The cost is complexity — two retrieval pipelines to maintain and a fusion step to calibrate.
Retrieval quality has two components. Retrieval precision asks: of the chunks that came back, what fraction were actually relevant? A retrieval layer that returns five chunks, two of which are relevant, has 40% precision. Retrieval recall asks: of all the relevant chunks in the knowledge base, what fraction came back? A retrieval layer that misses three out of five relevant chunks has 40% recall.
Both metrics matter. High precision with low recall means you're returning good chunks but missing important ones. High recall with low precision means the LLM is drowning in noise. You want both above 70% before connecting the retrieval layer to an LLM.
MEASURE requires you to test AI system behavior against its intended purpose. For RAG, the intended purpose of the retrieval layer is to return the right chunks. Before connecting retrieval to an LLM, define retrieval precision and recall targets, build a set of known-good query/answer pairs, run retrieval against those queries, and measure whether the right chunks come back. If you skip this step, you have no baseline to compare against when the system behaves unexpectedly in production.
MAP requires identifying failure modes before deployment. For chunking, the primary failure modes are: (1) bisected context — a chunk ends mid-sentence or mid-argument, causing the LLM to hallucinate the missing portion; (2) embedding dilution — an oversized chunk's embedding averages across too many concepts, causing relevant chunks to score below the retrieval threshold; (3) boundary mismatch — chunks are cut at token boundaries rather than semantic boundaries, fragmenting complete ideas across two chunks neither of which is independently useful.
Article 13 requires that AI systems provide meaningful information about how they produce outputs when those outputs affect users. For RAG systems, source attribution is the mechanism — citing the specific document or chunk that grounded the answer. But attribution is only meaningful if the cited chunk actually contains the information the LLM used. A RAG system that returns half-sentences and attributes answers to those half-sentences is providing attribution that is technically present but substantively misleading. Retrieval quality is a prerequisite for attribution integrity.
In the lab you'll design a chunking and embedding specification for a real knowledge domain. Three things will determine whether your specification is defensible.
Before you choose a chunk size, read twenty documents from your knowledge base and ask: what is the natural unit of a complete thought here? In an Obsidian vault of decision records, the natural unit might be a paragraph — each paragraph makes one claim. In a codebase, it might be a function. In a policy document, it might be a numbered section. Chunking at the natural boundary almost always outperforms fixed-token splitting because the embedding then represents a semantically complete unit. If you can't identify a natural unit, start with 300 tokens and 75-token overlap as a neutral baseline, then measure.
Overlap means the first 50–100 tokens of each chunk are the last 50–100 tokens of the previous chunk. This ensures that ideas that span a chunk boundary appear in at least one complete chunk. A chunk that opens with a sentence fragment from the previous chunk almost never causes a retrieval problem — the embedding captures the full text and the semantic signal of the overlap tokens is minor. A chunk that cuts off mid-sentence almost always causes a retrieval problem because the embedding captures an incomplete idea and the LLM must hallucinate the rest. Default to 50–75 tokens of overlap and increase it only if you're seeing bisection failures in your evaluation set.
Write 10–20 query/answer pairs against your knowledge base before you embed anything. Choose queries that span the range of what the system will be asked: factual lookups, conceptual questions, terminology searches. For each query, identify which chunks in the knowledge base should come back. This set becomes your retrieval precision and recall benchmark. If you don't have this before you go live, you have no way to know whether a change to chunk size or embedding model made things better or worse. NIST MEASURE requires this kind of pre-deployment baseline — and it's the difference between engineering a retrieval layer and guessing at one.
In the lab you'll apply all three to a real knowledge domain you choose. The AI reviewer will challenge your chunk size and ask what you'd do if retrieval quality is poor after deployment.