You've designed the vault and chosen a chunking strategy. Now you need to wire everything together into a pipeline that actually runs — and keeps running. A retrieval pipeline isn't a script you run once. It's a system with components that fail independently, indexes that go stale, and retrieval quality that drifts over time without any error message to tell you.
Most developers build the happy path first: embed the documents, store the embeddings, query against them. It works. Then a month later, the index is stale, retrieval is silently returning nothing useful, and nobody knows because there's no monitoring. The system hasn't crashed — it's just wrong.
This module teaches the four-component architecture of a production-grade retrieval pipeline and asks you to design one end-to-end — from indexer trigger strategy to monitoring thresholds — for a specific knowledge base. The design has to hold up under two adversarial questions: what happens when the vault grows overnight, and what does the system do when retrieval finds nothing above threshold?
A developer has an Obsidian vault with 2,000 well-structured notes, a chunking strategy from the previous module, and an embedding model they've already validated. They decide to wire it all together over a weekend. The implementation is straightforward.
They write a Python script that reads every note in the vault, splits each note into chunks using their chunking strategy, embeds each chunk using the OpenAI embeddings API, and stores the embeddings in a local JSON file alongside the chunk text. For querying, the script embeds the incoming query, computes cosine similarity against every stored embedding, and returns the top five chunks. Those chunks get injected into a prompt, and the LLM generates an answer. Total time to build: one afternoon. It works.
Two weeks later, they've added 300 new notes to the vault. Project notes, reading summaries, meeting decisions. Good material. They query the system about a decision from last week. The system returns nothing useful — it's retrieving chunks from old notes that don't address the question. No error. No warning. The pipeline completed successfully. The index just doesn't know the new notes exist.
The developer doesn't notice for a week. In that week, they make three decisions partly informed by the assistant's responses, without realizing the assistant has no knowledge of the past two weeks.
This isn't a bug in the embedding model. It isn't a bug in the cosine similarity function. It's a missing component: there is no indexer trigger, no freshness policy, and no monitoring. The pipeline has three of the four components it needs to be a system. The fourth — governance of the index — was never built.
A retrieval pipeline has four components. Each can fail independently. Each requires a governance owner. Building three of the four gives you something that looks like it works until it silently doesn't.
The indexer converts documents to chunks and embeddings and writes them to the vector store. Its one job is to keep the index current. A manually triggered indexer always goes stale. When the vault changes, someone has to remember to re-run the script. That dependency on human memory is a failure waiting to happen.
Automatic trigger strategies: file system watchers fire when any file in the vault directory is created or modified; scheduled jobs rebuild the index on a fixed interval (nightly, hourly); webhooks fire when a source system notifies the indexer of a change. Each has tradeoffs. File system watchers are real-time but require a running process. Scheduled jobs are simple but introduce a lag window. The trigger strategy must match the freshness requirement of the use case.
The retrieval layer takes a query, embeds it, and returns the top-K chunks from the vector store that score above a similarity threshold. Three decisions define its behavior:
K — how many chunks to retrieve. Too few and relevant material is missed. Too many and the injection layer is overwhelmed with noise. For a personal knowledge base, K between 4 and 8 is a reasonable starting range. For a technical documentation system where precision matters more than recall, K of 3 may be better.
Similarity threshold — the minimum score below which chunks are discarded even if they're in the top-K. Without a threshold, retrieval always returns K results regardless of relevance. A threshold of 0.72 (on a cosine similarity scale of 0–1) is a common starting point. Too high and you get empty results; too low and you retrieve noise.
Fallback behavior — what happens when retrieval returns empty or all scores fall below threshold. This is not an edge case. It's a design decision that must be explicit before the pipeline goes live.
The injection layer takes retrieved chunks and formats them into the prompt that reaches the LLM. Three decisions shape the output quality:
Source attribution format — how does the prompt tell the LLM where each chunk came from? Embedding the note title and section header in the chunk prefix lets the LLM cite sources in its answer. Without this, the LLM cannot tell the user which note informed its response.
Conflicting content handling — when two retrieved chunks say contradictory things (a decision was made, then revised in a later note), how does the injection layer signal that to the LLM? One approach: inject a timestamp with each chunk so the LLM can reason about recency.
Token budget — retrieved chunks consume part of the context window that would otherwise go to the LLM's response. Define the maximum number of tokens allocated to retrieved context. A common allocation: 40% of the context window for retrieved content, leaving 60% for system prompt and generation.
The monitoring layer logs retrieval quality metrics for every query and fires alerts when those metrics fall outside acceptable ranges. Without monitoring, failures are silent. The system can return wrong answers for days before anyone notices.
Minimum metrics to log: query text, number of results returned, highest similarity score returned, whether the fallback triggered, and the LLM's response. Over time, these logs reveal patterns: queries that consistently miss, similarity scores that trend downward as the corpus grows, fallback rates that indicate a gap between what users ask and what the index contains.
Alert threshold: when the rate of fallback-triggered queries exceeds 15% of queries in a rolling 24-hour window, something is wrong — either the index is stale, the chunking strategy doesn't match query patterns, or the threshold is miscalibrated.
NIST MANAGE requires that AI systems have defined response plans for identified failures. For a retrieval pipeline, this means: when the monitoring layer fires an alert, who responds? What is the playbook? "The developer will look into it" is not a response plan. The response plan should name a specific person, a maximum response time, and the first three diagnostic steps.
NIST GOVERN requires that AI system components have clear accountability. For a four-component pipeline, each component must have a named owner: who is responsible for the indexer trigger staying functional? Who owns the similarity threshold calibration? Who reviews monitoring logs? In a personal system, all four owners are the same person — but the ownership must still be explicit, because it makes the governance question concrete: when did you last check that the indexer is running?
Article 17 requires quality management systems for high-risk AI — documented processes for maintaining system quality over time. For a RAG pipeline used in consequential decisions (medical, legal, financial), this means documented freshness policies, documented retrieval parameter choices with rationale, and a log of changes to those parameters. The pipeline design document you produce in this lab is the beginning of that quality management record.
The lab asks you to design a full pipeline for a specific knowledge base. Three decisions need to be made before the design can be complete. Each has real tradeoffs. Make the decision explicitly — don't leave it open.
Local vector stores (ChromaDB, FAISS) run on your machine, cost nothing, and keep data private. They don't require an account or API key. The tradeoff: you manage backups, they don't scale to millions of vectors without tuning, and they require a running process or rebuild on query. Cloud vector stores (Pinecone, Weaviate) are managed services — you pay per query or per stored vector, they scale automatically, and they offer high availability. For a personal Obsidian vault with 2,000–5,000 notes: local is almost always the right answer. For a team knowledge base with continuous writes from multiple contributors: cloud is justified. Don't over-engineer the vector store choice. You can migrate later; the interface is the same.
When the retrieval layer returns empty — either because no chunks scored above threshold or because the index doesn't cover the query topic — what does the pipeline do? Three options: return nothing and tell the user the system doesn't have relevant information on this topic (honest, but may frustrate users); fall back to a general-purpose LLM response without retrieved context (risks hallucination, but maintains responsiveness); escalate to human (appropriate for high-stakes contexts where a wrong answer is worse than no answer). Each fallback has a different risk profile. Choose one explicitly and document why. You can build different fallbacks for different query types, but that adds complexity — justify it if you go that route.
How often does the index rebuild, and what triggers it? For a personal vault updated a few times per day: a nightly scheduled rebuild is simple and reliable. The lag window is at most 24 hours — acceptable for personal knowledge work. For a team knowledge base with continuous updates from multiple contributors: event-triggered incremental indexing is worth the complexity. The indexer watches for file changes and re-embeds only the changed documents, updating the vector store in place. Incremental indexing is faster per update but harder to implement correctly — you need to handle deletions, moves, and renames, not just new content. State your freshness policy with the lag window it creates and the failure mode if the trigger mechanism breaks.
These three decisions — vector store, fallback, freshness policy — define the operational character of your pipeline. The lab reviewer will probe all three. Have a specific answer for each before you start designing.