Intro · Module 8 of 8

Code Checks, Debug Logging, and Skill Development — Capstone

3 min read

A command center that works is a system that can tell you when it is failing. Code checks, debug logging, and the ability to develop new skills safely are the three capabilities that separate a command center you can trust from one you can only hope is operating correctly.

You have spent seven modules building the layers of that system: the navigation shell (M1), multi-model routing (M2), the build-vs-buy decision for integrations (M3), prompt management (M4), control files (M5), session logging (M6), and oversight checkpoints (M7). Each layer works. The question this module asks is whether the whole system is ready — ready to be handed to someone who didn't build it, ready to survive a skill failure without human error, ready to be maintained a year from now.

This capstone asks you to produce a system readiness review: a defensible, specific document that answers four questions drawn from the NIST AI Risk Management Framework. Not a checklist — a claim, with evidence, that your command center is ready for real use. The four NIST functions — GOVERN, MAP, MEASURE, MANAGE — provide the structure. Your prior modules provide the content.

Your artifact — Build Lab · Capstone

A command center system readiness review — code check protocol, debug logging specification, skill development process, and a four-function NIST AI RMF readiness assessment. This artifact synthesizes all prior modules into a single authoritative system document.

By the end of this module, you will:

Design a code check protocol for skills added to the command center
Define debug logging — what gets captured when a skill fails, at what granularity
Document a skill development process — how new skills are added safely
Apply all four NIST AI RMF functions (GOVERN, MAP, MEASURE, MANAGE) to assess system readiness
Produce a system readiness review that could be handed to a new team member as the authoritative description of the command center

Scenario

Can You Hand It Off?

3 min read

A developer has built a command center over several months. It has a navigation layer, multi-model routing, prompt templates, control files, session logging, and automated skill routing with oversight checkpoints. Each layer was added when it was needed. Each one works.

A new developer joins the team. The question: can the system be handed off?

Is there a document that describes how the routing layer decides which model handles which task? Is there a record of which skills exist, what their input contracts are, and what happens when one of them fails? Is there a logging schema that captures enough information to detect a silent failure — an output that looks correct but isn't — within 48 hours? If a skill starts misbehaving in production right now, who gets notified, what gets disabled, and how long until a human reviews the outputs?

The answer, as built, is "mostly." The routing logic is in code but not documented as decisions. The code checks for new skills are informal — whoever wrote the skill ran it against a few test cases and it passed. The debug logging is inconsistent across skills: some emit structured error data, some emit nothing, one emits raw exception traces that include API keys. The skill development process is "ask the person who built it."

The new developer spends three days trying to understand the system before writing a line of code. They break a routing rule on day two because they didn't know it existed. They can't tell whether the session logs from M6 would catch a fact-check failure of the kind described in M7, because no one tested that scenario.

The capstone is writing the document that makes the handoff possible — and then defending it against four specific questions.

Lesson

The Three Mechanisms of Maintainability

5 min read

A command center is ready when someone who didn't build it can maintain it. Code checks, debug logging, and a skill development process are the three mechanisms that make this possible. Each one addresses a different failure mode. Together, they are what separates infrastructure from a personal tool that happens to work.

Code check protocol

What gets checked when a new skill is added

A code check protocol is not a full QA suite. It is a minimum viable check that catches the most common failure modes before a skill reaches user-facing flows. The four elements: (1) input validation — does the skill reject malformed or out-of-scope inputs gracefully rather than silently failing? (2) output format compliance — does the output match the contract specified in the agents.md control file (M5)? (3) error handling — does the skill surface a structured error when it fails, or does it crash and emit nothing? (4) integration test against the routing layer — does the routing logic from M2 correctly dispatch to this skill given its declared trigger conditions, and correctly route away from it when conditions are not met? These four checks take fifteen minutes and catch 80% of integration failures before they reach production.

Debug logging

Distinct from session logging (M6)

Session logging (M6) captures what the user sent, what the system returned, and metadata for compliance and retrieval. Debug logging captures what happened inside a skill execution — which branch was taken, what intermediate values were, where execution stopped. The two are different in purpose, retention, and audience. Session logs are for users and compliance. Debug logs are for engineers.

Operational, not compliance

Debug logs are operational: their purpose is to let an engineer reconstruct a failure from the log alone, without needing to reproduce it. Retention should be short — 14 to 30 days — because the volume is high and the purpose is diagnosis, not record-keeping. Granularity should be high during development and reduced in production: log branch decisions and error states always; log intermediate values only in development or behind a debug flag. Never log raw API responses that may contain user data or credentials.

Skill development process

Four gates before a skill reaches user-facing flows

(1) Define the input/output contract first — before writing any implementation, specify what the skill accepts, what it returns, and what it does on error. This contract goes into agents.md (M5) before the code exists. (2) Test against the routing layer in isolation — confirm that the routing logic dispatches to this skill correctly and that the skill returns the expected output format. Do not test in production flows. (3) Review against the oversight policy (M7) — does this skill's output type require a human checkpoint? Is it high-stakes enough to trigger the fact-check behavior defined in M7? If yes, configure the checkpoint before enabling. (4) Add to agents.md with delegation criteria — the control file (M5) is the authoritative registry. A skill does not exist officially until it is documented there with its trigger conditions, input contract, output contract, and oversight classification.

NIST AI RMF — four-function synthesis

GOVERN — are the decision rules documented?

Governance asks: who can change this system, and do they know the rules? For a command center, governance means the routing logic (M2) is documented as decisions, not just code; the control files (M5) are the authoritative source of truth; and there is a change process — even an informal one — for modifying routing rules or skill configurations. If the rules exist only in the mind of the person who built the system, it is ungoverned.

MAP — is the skill topology catalogued?

Mapping asks: what does this system do, and what are all the ways it can interact with itself? For a command center, MAP means every skill in agents.md (M5) has its input contract, output contract, and routing condition documented. Skill interactions — what calls what — are specified. The routing graph from M2 can be traced end-to-end for any input type. A system that cannot be mapped cannot be debugged.

MEASURE — does the logging give you detection within 48 hours?

Measurement asks: do you have the data to detect a failure before it causes harm? For a command center, MEASURE means the session logs (M6) and debug logs together provide enough signal to detect a silent failure — an output that looks correct but isn't, like the fact-check scenario in M7 — within 48 hours. "We have logging" is not a MEASURE answer. "We capture token counts, error flags, output previews, and model-id per session, with a 90-day retention policy and a daily anomaly check on error flag rate" is a MEASURE answer.

MANAGE — is the incident response procedure defined?

Management asks: if something goes wrong right now, what happens? For a command center, MANAGE means there is a defined response procedure for a skill failure: who gets notified, what gets disabled (individual skill or full routing layer?), how long until a human reviews affected outputs, and what the recovery criterion is. If you can't describe this procedure in under 60 seconds, risk management is incomplete.

Governance standards

NIST GOVERN — system documentation as governance

A system is governed when its decision rules are documented and accessible. Tribal knowledge is not governance.

NIST MAP — skill topology documented

The agents.md control file (M5) is the map. It must be complete — every skill, every contract, every routing condition.

NIST MEASURE — debug and session logging as measurement

Two logging layers (session logs + debug logs) serve two purposes (compliance + diagnosis). Both are required for full measurement coverage.

NIST MANAGE — code checks and oversight as risk management

The code check protocol prevents failures from entering production. The oversight policy (M7) catches failures that occur in production. Together they are the risk management layer.

O*NET Active Learning (4.A.6.b)

The skill development process is a continuous learning procedure: every new skill added to the command center is also a decision about what the system can do and who is responsible for it. Documenting that process is what makes learning transferable beyond the original builder.

Context

Four Readiness Questions

3 min read

The four NIST AI RMF functions applied as a capstone review. Each question is specific. Each one has a failure mode that looks like a passing answer.

1 — GOVERN: can someone new change it safely?

The tribal knowledge trap

If a new developer joined your team tomorrow and needed to change a routing rule, what document would they read first? If the answer is "they'd ask me" or "it's in the code," governance is incomplete. A governed system has a document — not code comments, not a Slack thread, but a maintained document — that describes the routing decisions (M2), the control file conventions (M5), and the change process for each. The document doesn't need to be long. It needs to exist and be current. "It's documented somewhere" does not pass the GOVERN check.

2 — MAP: is the skill topology actually catalogued?

The understood-but-not-documented trap

Every skill in the command center was built by someone who understood what it did. That understanding does not constitute a map. MAP is complete only when agents.md (M5) contains, for every skill: the input contract, the output contract, the routing condition, and the oversight classification. Test: close your eyes and describe one skill's input/output contract from memory. Then open agents.md and check. If there is any gap — if the file is missing a skill, or a contract is described loosely — MAP is incomplete. A system that cannot be fully mapped from its control files cannot be debugged by someone who wasn't there when it was built.

3 — MEASURE: can you detect a silent failure in 48 hours?

The "we have logging" trap

Logging exists is not measurement. The question is whether the data you log is sufficient to detect the specific failure mode you care about — specifically, the silent failure: an output that looks correct but contains a factual error or policy violation (the scenario from M7). For detection within 48 hours, the logging schema (M6) must capture: output previews or content hashes that can be spot-checked, error flags for model refusals and confidence drops, token counts as a proxy for anomalous outputs (an unusually short output from a fact-check skill is a signal), and the model ID and timestamp for every session. If your logging schema from M6 cannot support this detection, MEASURE is incomplete regardless of how much data you are storing.

4 — MANAGE: what is the response procedure right now?

The "we'd figure it out" trap

If a skill starts failing in production right now — not returning errors, just returning low-quality outputs that pass format checks — what happens? Who gets notified? How? What gets disabled, and by what mechanism? How long until a human reviews outputs from the last 24 hours? If the answer involves figuring it out at the time, MANAGE is incomplete. The response procedure does not need to be elaborate. It needs to be decided in advance and written down. A simple procedure, written and accessible, is more valuable than a sophisticated procedure that exists only as intention. The test: describe the procedure in under 60 seconds without consulting notes. If you can't, it hasn't been decided yet.

In the lab, you will work through all four readiness questions for your own command center — real or hypothetical. The AI will hold you to specificity. "It's documented somewhere" does not pass GOVERN. "We have logging" does not pass MEASURE. The standard is evidence, not assertion.

◆ Build Lab · Capstone

System Readiness Review

~25–40 minutes · 4 exchanges to complete

Capstone synthesis

This lab draws on all prior modules. You are producing a system readiness review that integrates navigation (M1), routing (M2), build vs buy (M3), prompt management (M4), control files (M5), session logging (M6), and oversight (M7) into a single authoritative document.

Your role

🏗️

System ArchitectConducting a readiness review of your command center. You will work through all four NIST functions in sequence, providing specific evidence for each claim. You may use a real or hypothetical command center — the standard of specificity is the same either way.

AI role

🔎

Readiness ReviewerWorks through the four NIST functions in order. Asks for specific evidence for each claim. Does not accept vague answers — "we have logging" fails MEASURE; a logging schema with retention policy, captured fields, and detection scenario passes.

Framework reminders

GOVERN: Can someone new change it safely? Is the routing logic (M2) documented as decisions, not just code?

MAP: Is every skill in agents.md (M5) with its input/output contract and routing condition?

MEASURE: Does the logging schema (M6) detect a silent failure within 48 hours?

MANAGE: Describe the incident response procedure in under 60 seconds: notification, disable, human review timeline.

How to complete

Work through each NIST function with specific answers. The reviewer will push back on assertions without evidence. Lab completes after 4 substantive exchanges covering all four functions.

Shift + Enter for a new line

✓ Module Complete

You've completed Module 8 of 8.

Back to Courses →