Most developers think about RAG as an engineering problem — embeddings, vector stores, retrieval algorithms. The engineering matters. But the retrieval layer can only surface what the knowledge base actually contains, and knowledge bases fail in ways that have nothing to do with algorithms.
They fail because notes are written for humans who already know the context, not for a retrieval system that knows nothing. They fail because tags are inconsistent, links are orphaned, and folder hierarchies reflect the day they were created rather than how knowledge actually clusters. They fail because nothing governs what goes in, so everything goes in — and retrieval returns noise.
This module is about Obsidian as a RAG-ready knowledge store. Not Obsidian as a note-taking app — that's a different use case. Obsidian as a structured, governed, retrieval-optimized vault that an AI can actually navigate.
A developer builds a RAG pipeline on top of their Obsidian vault. They've been keeping notes for two years — meeting notes, project documentation, code decisions, architecture reviews, reading notes from technical articles. Thousands of notes. The pipeline is technically solid: the embeddings are good, the vector store is configured correctly, the retrieval algorithm is tuned.
The retrieval quality is terrible.
Queries about architecture decisions return meeting notes from two years ago. Queries about project status return reading notes from unrelated articles. Queries about a specific codebase component return nothing — the notes about that component exist, but they were written as running commentary in daily logs, embedded in date-stamped files under folder names like "2024-03" that tell the retrieval layer nothing about the content.
The problem isn't the retrieval algorithm. The problem is that the vault was built for a human who already knew where everything was. Tags were added inconsistently — some notes have three tags, some have none. Links were made organically — there are 40-note clusters with no links to the rest of the vault. Folder hierarchies reflect the original project structure, which has since been completely reorganized.
When the developer looks at what the retrieval layer actually sees — chunks of 200–500 tokens extracted from note files — they see fragments without context. A chunk that says "decided to use PostgreSQL because of the constraint mentioned above" is useless to a retrieval system that has no idea what the constraint was or which note mentioned it.
The fix isn't a better algorithm. The fix is designing the vault for retrieval from the start.
A retrieval-optimized vault has four properties: self-contained notes, consistent taxonomy, meaningful structure, and governed scope. None of these come naturally from how people take notes. They require deliberate design.
Each note should be understandable without reading the notes around it. When the retrieval layer extracts a chunk, that chunk will be injected into a prompt with no surrounding context. If the note says "see the previous decision" or "as discussed in the meeting," the retrieval layer has nothing to work with.
In practice: Every note begins with a one-sentence context statement. Decisions include the decision, the rationale, and the alternatives considered — in the same note. References to other notes use Obsidian's wiki-link syntax so the retrieval layer can follow the connection if needed.
Tags must be defined before they're used, not added organically as notes accumulate. A tag taxonomy is a controlled vocabulary: a fixed list of tags with defined meanings, used consistently across every note in the vault.
In practice: Keep the taxonomy flat and small. Fifteen to twenty tags covering the primary knowledge domains. Hierarchical tags (e.g., #decision/architecture) are useful for drilling down without exploding the namespace. Every new note is required to use at least one tag from the taxonomy.
Folder hierarchies should reflect the knowledge domain, not the calendar. Date-stamped folders are for journals and logs — they're actively harmful for topic retrieval because they scatter related knowledge across time. A folder called decisions/database is retrievable. A folder called 2024-03 is not.
In practice: Top-level folders represent primary knowledge domains. Notes move to the appropriate domain folder when they're mature enough to reference. Daily logs and meeting notes live in a staging area and are processed into domain folders within 24–48 hours.
Not everything belongs in the RAG vault. Personal notes, unprocessed drafts, and irrelevant reference material are noise that degrades retrieval quality. Scope governance means defining what belongs in the vault — and enforcing it.
The GOVERN function requires defining who is responsible for the AI system's inputs. For a RAG knowledge base, this means: Who decides what goes in the vault? Who reviews notes for quality before they're indexed? Who removes stale or incorrect information? Without GOVERN-level ownership, the knowledge base drifts — and retrieval quality degrades silently.
MAP requires identifying what could go wrong. For a knowledge base, the primary failure modes are: stale knowledge returned as current, out-of-scope content polluting retrieval, and missing context in notes that makes chunks uninterpretable. MAP these risks explicitly in your vault governance policy — don't discover them after the retrieval layer is live.
Art.17 requires quality management systems for AI that makes consequential decisions. If your RAG system informs decisions — code choices, architecture, policy — the knowledge base is part of that quality management system. Document the vault's structure, scope, and update procedures as part of your Art.17 compliance posture.
These four properties aren't constraints on how you take notes. They're design requirements for a system that works.
In the lab, you'll design an Obsidian vault template for a specific knowledge domain. Three decisions must be made before the first note is written.
Top-level folders represent retrieval domains — the primary categories a query might target. Keep it shallow: three levels maximum. More depth means more places a note can hide. Each folder represents a concept, not a time period. Name folders for what they contain, not when they were created.
Define the full tag list before using any of them. Every tag must have a one-sentence definition. Flat tags for categories, hierarchical tags for subtypes. The taxonomy is a contract — once notes are indexed against it, changing tag names breaks retrieval. Define it carefully and change it rarely.
A note template enforces self-containment. At minimum: a one-sentence context statement at the top, a tags field, a created/updated date, and a body that never uses relative references ("see above," "as discussed"). Decisions notes add: the decision, the alternatives considered, and the rationale. The template is the retrieval contract.
You'll design all three in the lab, for a specific knowledge domain of your choice.