Everything in this course has built toward a single artifact: a production-ready persistent memory system. Not a proof-of-concept. Not a script that works on your laptop. A specified, governed, monitored system that you could hand to a teammate and they could build, maintain, and improve without asking you questions.
This module is the capstone. You'll design the complete system — vault structure, retrieval pipeline, automation, governance, and quality metrics — and document it as a specification. The specification is the deliverable. Not a working demo. Not a prototype. A document precise enough that someone who wasn't there when you built it can operate it correctly.
That constraint is harder than it sounds. Most systems that "work" are not actually specified. The developer who built them carries the real documentation in their head: why the vault is organized this way, how the retrieval parameters were tuned, what the automation triggers watch for, when the governance policy bends. When that developer leaves, the system degrades. The knowledge that made it work was never externalized.
This module forces you to externalize it. Every design decision requires a written rationale. Every component requires a failure mode and a detection method. Every governance rule requires enough precision to resolve a real dispute.
A developer has everything they need: a well-structured Obsidian vault, a working retrieval pipeline, an automation layer that captures knowledge from their workflow, and a governance policy for what goes in the vault. They've used it for three months. It works.
Now they're joining a team, and a teammate will take over maintaining it. The developer sits down to write the handover document and realizes: it's all in their head.
The vault structure exists but the rationale isn't documented. Someone looking at the folder hierarchy can see what folders exist — they can't see why those folders exist, what the intended boundaries are, or what to do when a note doesn't fit cleanly into any category. The developer knows. They've never written it down.
The retrieval parameters were tuned by feel. The chunk size is 400 tokens because that's what worked after trying 200 and 600. The similarity threshold is 0.72 because results below that kept pulling in noise. The developer remembers this. None of it is in the code comments. The script has configuration variables at the top — no explanation of what they control or what happens if you change them.
The automation triggers are in a script with no comments. The nightly index job runs at 2 AM — nobody knows why 2 AM, and there's no documentation of what to do if the job fails silently. The automation adds notes from certain workflows and skips others — the rules are in the developer's head.
The governance policy is an informal agreement with themselves. "I put research notes here, project notes there, and I delete things when they're no longer relevant." None of that is specific enough for someone else to apply consistently. What counts as relevant? How old is old enough to delete? What happens when something belongs in two places?
A system that depends on one person's knowledge of how it works isn't a system — it's a dependency. The developer knows this now. They have two weeks before the handover. They need to write the specification they should have written while building.
A production AI system is one that can be understood, maintained, and improved by someone who wasn't there when it was built. That definition has three parts, and all three matter.
Understood means the rationale is written down — not just the decisions, but why those decisions were made and what trade-offs they accepted. Maintained means failures are recoverable by someone following written procedures, not by the original developer intuiting what probably went wrong. Improved means changes can be tested against a defined standard before they go live, so the system gets better without breaking.
Every component of the RAG system needs a specification. Not "here's the code" — "here's what the code is supposed to do, why it was built this way, what the known failure modes are, and what 'working correctly' looks like." The spec is the artifact, not the code. Code changes. The spec is what makes the next version of the code predictable.
A specification without rationale is a list of facts without a model. Someone can follow a list — they can't apply judgment when reality deviates from the list. Write the rationale for every major decision, including the alternatives you considered and rejected. This is especially important for parameters that were tuned by feel: the chunk size, the similarity threshold, the index schedule. Future maintainers need to know what those parameters control, not just what they're currently set to.
Retrieval precision and recall must be measured on a schedule, not on demand. A quality metric you check when you remember to check it is not a quality metric — it's a hope. Define the evaluation set, the pass/fail thresholds, and what happens when the system fails. All three, in writing, before the system goes live.
The evaluation set is a collection of 10–15 query/answer pairs that represent the most important retrieval scenarios for your system. These are your regression tests. Every significant change to the vault or pipeline runs against this evaluation set before going live. If the change degrades precision below the threshold, it doesn't ship.
A one-page written policy covering vault scope, contribution rules, content lifecycle, and escalation paths. This is the human layer of the system. Code can be read; policy must be written. The governance document is what allows someone else to make content decisions consistently with how the original developer would have made them.
A governance document that cannot resolve a real dispute is not a governance document — it's a statement of intent. Test it by imagining a specific dispute: a note that might belong in two folders, a document that might be outdated but still occasionally useful, an automation trigger that might be adding noise. If the governance document gives you a clear answer to each of those, it's specific enough. If it requires judgment that isn't written down, it needs another revision.
GOVERN requires that roles, responsibilities, and policies for the AI system are defined and documented before the system operates in production. For a persistent memory system, this means a written governance document that names who is responsible for vault quality, who approves changes to the retrieval pipeline, and how disputes about vault content are resolved. The governance document is not optional — it is the GOVERN function in practice.
MAP requires identifying risks before they occur. For a production RAG system, the risk register covers the full failure surface: vault quality degradation (noise accumulates faster than cleanup), retrieval drift (embeddings become stale as vault structure changes), automation failures (the indexing job fails silently and nobody knows), and governance gaps (a content dispute arises that the policy doesn't cover). Name the probability, the impact, and the detection method for each.
MEASURE requires that AI system performance is measured against defined standards. For a RAG system, this means retrieval precision and recall measured against a defined evaluation set on a defined schedule, with defined thresholds and defined consequences for threshold violations. "We'll check it if something seems off" does not satisfy MEASURE. The schedule, the thresholds, and the response to failure must all be written down.
MANAGE requires that when the AI system produces wrong or harmful outputs, there is a defined response. For a persistent memory system, the incident scenario is: the system produces a wrong answer based on a retrieved document that was outdated, misindexed, or incorrectly included by automation. The MANAGE response defines: how the wrong answer is detected, how the root cause is identified (vault issue vs. retrieval issue vs. prompt issue), how the immediate harm is contained, and what systemic change prevents recurrence.
In the lab, you'll design all six sections of a production RAG system specification. Three things will make your specification pass the handover test.
A production RAG system specification has six sections: (1) vault structure and governance — folder hierarchy, naming conventions, content scope, and the governance policy that covers disputes; (2) indexing pipeline with trigger logic — what gets indexed, when, and what happens on failure; (3) retrieval parameters and fallback behavior — chunk size, similarity threshold, top-k, and what the system does when retrieval returns nothing useful; (4) quality metrics with pass/fail thresholds — the evaluation set, the measurement schedule, and the failure response; (5) incident response procedures — how a wrong answer is triaged, rooted, contained, and prevented from recurring; (6) change management — how the system evolves without breaking, including the testing gate every change must pass. Design all six sections. Leaving one out means the handover is incomplete.
Define 10–15 query/answer pairs that cover the most important retrieval scenarios for your system. These are your regression tests. Each pair includes the query, the expected answer, the document or note that should be retrieved to generate it, and the minimum similarity score that counts as a successful retrieval. Every significant change to the vault or pipeline runs against this evaluation set before going live. If precision drops below your defined threshold, the change doesn't ship. The evaluation set is not a nice-to-have — it is what makes "working correctly" mean something specific.
Could someone who has never seen this system read your specification and operate it correctly? This is the production readiness test. Apply it section by section. For the vault structure: could a new maintainer add a note correctly without asking you where it goes? For the retrieval parameters: could they adjust the similarity threshold and know what they're trading off? For the governance policy: could they resolve a content dispute without calling you? For the incident response: could they triage a wrong answer without your intuition about where things usually go wrong? If the answer to any of these is no, the specification is incomplete — not almost done, incomplete.
In the lab, the AI reviewer will probe each section of your specification with specific operational questions. It will ask what happens when the nightly index job fails — not in general, but specifically: who is notified, by what mechanism, and what is the recovery procedure. It will ask what the quality metrics pass/fail thresholds are and who gets notified when they fail. It will ask whether the governance policy is specific enough to resolve a dispute about what belongs in the vault — and it will describe a specific dispute and ask you to apply your policy to it. A specification that handles these questions has passed the handover test.