A command center that doesn't log is a command center that can't learn from itself. Session logs are the measurement layer — they tell you what the model was asked, what it returned, how long it took, and how much it cost. Without this, debugging a behavior regression means guessing. With it, you have a timeline.
This module is about designing a logging schema that's actually useful: not a firehose of raw text, but a structured record that supports retrieval, cost analysis, and behavioral auditing. The difference between logging that works and logging that accumulates is almost entirely about schema design. A poorly designed schema gives you storage. A well-designed one gives you measurement.
There is also a governance dimension. Logging isn't only for debugging — it's a compliance function. If your command center produces AI outputs that affect users, you may be legally required to retain records of those outputs. Designing the schema without this distinction means your operational logs and your compliance records will be tangled, making both harder to manage. This module separates them.
A developer's command center has been running for four months. The system handles document synthesis, research summarization, and report drafting for a small professional services firm. It's working well — until a client contacts the developer with a complaint.
Six weeks ago, a report generated by the command center contained a hallucinated citation. The citation appeared authoritative, was included in a client deliverable, and was only caught when a reviewer tried to locate the source document. The client wants to know: was this a one-time failure, or has this happened before?
The developer opens the command center to investigate. There are no logs. There is no record of what was sent to the model, what the model returned, which version of the system prompt was active, or how long the session took. The developer has only the client's description of the output and an approximate date.
The investigation takes three days. The developer checks version control for system prompt changes, reviews the client's deliverable for clues about which session might have produced it, and tries to recreate the conditions. Nothing is conclusive. The developer cannot tell whether the hallucination was a single anomaly or part of a pattern. They cannot tell whether it's been fixed. They cannot tell whether other outputs in the same period have the same problem.
The developer's conclusion: they need logging. But when they look at what to log, they face a secondary problem — they don't know what schema would have made this investigation solvable. Raw conversation text would have been too large to search and might have contained PII. A minimal log with just timestamps and model names wouldn't have captured the output. The right answer is somewhere between those, and they don't know where.
This module is about finding that answer before the incident, not after.
The question isn't whether to log — it's what schema makes the logs useful versus just large. A schema that captures everything is a liability. A schema that captures the wrong things is a waste. The goal is the minimum set of fields that makes every query you'll actually run answerable.
For a command center session, the following fields form the baseline. Each field is justified by at least one retrieval use case.
| Field | Why it's in the schema |
|---|---|
| session_id | Unique identifier for cross-referencing logs, error reports, and client deliverables |
| timestamp_start | Locates sessions in time for incident investigation and cost period attribution |
| timestamp_end | Enables latency calculation; required for per-session cost analysis |
| task_type | Identifies which routing tier this session used — enables per-task-type pattern analysis |
| model_used | Not just "the default" — the exact model string, so regressions tied to model versions are traceable |
| input_hash | A hash of the input, not the raw input — identifies duplicate or near-duplicate queries without storing PII |
| input_preview | First 200 characters of the input — enough to recognize the session without storing sensitive content |
| output_preview | First 400 characters of the output — enough to detect hallucination patterns and check for anomalies |
| token_count_input | Cost attribution — input tokens at per-model rates |
| token_count_output | Cost attribution — output tokens at per-model rates |
| error_flag | Boolean — did this session produce an error, timeout, or retry? Enables filtering anomalous sessions |
Operational logs are used to debug recent behavior: why did this session take so long, why did this batch fail, what was the model doing last Tuesday? Useful life is 30–90 days. After that, the specific session data isn't needed for operational purposes. These logs can be compressed or deleted after the retention window. Access control: the engineering team.
Governance logs document AI outputs that affected users — decisions, deliverables, recommendations. EU AI Act Article 13 requires transparency and record-keeping for AI systems whose outputs affect people. If your command center produces outputs that are delivered to clients or used in decisions, those outputs and their context must be retained for the duration required by applicable law — often years. Access control: broader, may include legal and compliance teams.
Structured formats (JSON Lines, append-only flat files, or a log database) are retrievable. Raw conversation text is legible but unqueryable — you can read it, but you can't filter it programmatically. Governance logs especially must be structured: if you're ever required to produce records in response to a complaint or audit, you need to be able to answer queries like "all sessions in March 2025 that used model X and produced outputs for client Y." Raw text can't answer that. JSON Lines can.
NIST's MEASURE function requires that AI systems have defined metrics and a means of collecting them. Session logs are the collection mechanism. Without a schema, you have no measurement — you have storage. MEASURE asks: what are you measuring, how are you collecting it, and how do you know the measurement is reliable? Apply this to your logging schema: for each field, identify what metric it enables. If a field doesn't enable a metric, it's overhead.
GOVERN asks: what policies govern the retention, access, and disposal of this data? A logging schema without a retention policy is incomplete. The policy must specify: how long operational logs are kept, how long governance logs are kept, who can access each type, and what the disposal process is. These are governance decisions, not engineering decisions — they should be made before the schema is implemented, not after.
Article 13 requires that AI systems deployed in contexts that affect users maintain documentation sufficient to explain what the system was asked and what it returned. If your command center produces client-facing outputs, this applies. The governance log tier in your schema is the documentation layer that satisfies this requirement. The operational log tier doesn't — it's too short-lived and too focused on debugging.
Critical thinking in this context means: diagnosing a failure requires evidence, and logs are the evidence. Without logs, the developer in the scenario cannot diagnose whether the hallucination was isolated or systematic. With a well-designed schema, the investigation reduces to a query: filter sessions in the relevant time window, filter by task_type, review output_previews for the pattern. The schema turns a three-day investigation into a ten-minute query.
Most logging schemas are designed forward — someone lists what seems useful and adds fields until it feels complete. The result is schemas that capture everything except the things you'll actually need. The three decisions below force a different approach: design backward from the queries, the retention cliff, and the never-log list. All three apply directly in the lab.
Before you write a single field, write down the two or three queries you will realistically need to run against your logs. Be specific. "Find all sessions in March 2025 that used the synthesis task type and had an error flag" is a query. "Review old logs" is not. Once you have the queries, identify which fields each query requires. If a field isn't required by any query, it's speculative — and speculative fields are overhead that adds storage cost and complicates schema evolution. Writing down the queries before the schema exposes what you actually need to capture, and — equally important — what you don't.
Operational logs have a short useful life — 30 to 90 days. After that, the specific session data isn't needed for debugging; patterns have either been addressed or become invisible. Governance logs may need to be kept for years — EU AI Act Article 13 and analogous frameworks require documentation of AI outputs that affect users for the duration of their potential legal relevance. Define both periods explicitly. Don't treat all logs the same: a single retention policy that applies uniformly either keeps operational data too long (storage cost, privacy liability) or deletes governance data too soon (compliance risk). The retention cliff is the boundary between those two regimes — you need to know where it is.
Every logging schema needs an explicit never-log list. Candidates include: raw user-supplied text that may contain PII, authentication credentials or API keys that users might paste into prompts, medical or legal content that creates liability if stored, and any content that would be problematic if subpoenaed. The input_hash and input_preview fields in the minimum viable schema are specifically designed to give you retrieval capability without storing raw input — but that design only works if raw input is explicitly excluded. The never-log list isn't a disclaimer at the bottom of the schema document. It's a constraint that shapes every field definition.
In the lab, the AI will ask you to justify each field you propose, check your retention policy against the operational vs. governance split, and push for at least two example queries before accepting the schema as complete. Start by writing down the queries — the rest of the schema design follows from there.