Intro · Module 6 of 8

Session Logging and Archival

3 min read

A command center that doesn't log is a command center that can't learn from itself. Session logs are the measurement layer — they tell you what the model was asked, what it returned, how long it took, and how much it cost. Without this, debugging a behavior regression means guessing. With it, you have a timeline.

This module is about designing a logging schema that's actually useful: not a firehose of raw text, but a structured record that supports retrieval, cost analysis, and behavioral auditing. The difference between logging that works and logging that accumulates is almost entirely about schema design. A poorly designed schema gives you storage. A well-designed one gives you measurement.

There is also a governance dimension. Logging isn't only for debugging — it's a compliance function. If your command center produces AI outputs that affect users, you may be legally required to retain records of those outputs. Designing the schema without this distinction means your operational logs and your compliance records will be tangled, making both harder to manage. This module separates them.

Your artifact — Skill Lab

A session logging design — schema with field-by-field justification, retention policy, archival format, and at least two example log queries that demonstrate the schema supports real retrieval use cases.

By the end of this module, you will:

Design a session log schema that captures the minimum fields needed for retrieval, cost analysis, and behavioral auditing
Define a retention and archival policy — what gets kept, for how long, and in what format
Apply NIST MEASURE to identify what logging data enables measurement and what doesn't
Write a log query — given a schema, specify how you'd retrieve sessions matching a condition
Distinguish between logging for debugging (operational) and logging for governance (compliance)

Scenario

No Logs, No Answer

3 min read

A developer's command center has been running for four months. The system handles document synthesis, research summarization, and report drafting for a small professional services firm. It's working well — until a client contacts the developer with a complaint.

Six weeks ago, a report generated by the command center contained a hallucinated citation. The citation appeared authoritative, was included in a client deliverable, and was only caught when a reviewer tried to locate the source document. The client wants to know: was this a one-time failure, or has this happened before?

The developer opens the command center to investigate. There are no logs. There is no record of what was sent to the model, what the model returned, which version of the system prompt was active, or how long the session took. The developer has only the client's description of the output and an approximate date.

The investigation takes three days. The developer checks version control for system prompt changes, reviews the client's deliverable for clues about which session might have produced it, and tries to recreate the conditions. Nothing is conclusive. The developer cannot tell whether the hallucination was a single anomaly or part of a pattern. They cannot tell whether it's been fixed. They cannot tell whether other outputs in the same period have the same problem.

The developer's conclusion: they need logging. But when they look at what to log, they face a secondary problem — they don't know what schema would have made this investigation solvable. Raw conversation text would have been too large to search and might have contained PII. A minimal log with just timestamps and model names wouldn't have captured the output. The right answer is somewhere between those, and they don't know where.

This module is about finding that answer before the incident, not after.

Lesson

Schema Is the Difference Between Storage and Measurement

5 min read

The question isn't whether to log — it's what schema makes the logs useful versus just large. A schema that captures everything is a liability. A schema that captures the wrong things is a waste. The goal is the minimum set of fields that makes every query you'll actually run answerable.

The minimum viable logging schema

For a command center session, the following fields form the baseline. Each field is justified by at least one retrieval use case.

Field	Why it's in the schema
session_id	Unique identifier for cross-referencing logs, error reports, and client deliverables
timestamp_start	Locates sessions in time for incident investigation and cost period attribution
timestamp_end	Enables latency calculation; required for per-session cost analysis
task_type	Identifies which routing tier this session used — enables per-task-type pattern analysis
model_used	Not just "the default" — the exact model string, so regressions tied to model versions are traceable
input_hash	A hash of the input, not the raw input — identifies duplicate or near-duplicate queries without storing PII
input_preview	First 200 characters of the input — enough to recognize the session without storing sensitive content
output_preview	First 400 characters of the output — enough to detect hallucination patterns and check for anomalies
token_count_input	Cost attribution — input tokens at per-model rates
token_count_output	Cost attribution — output tokens at per-model rates
error_flag	Boolean — did this session produce an error, timeout, or retry? Enables filtering anomalous sessions

Operational vs. governance retention

Operational logs — short retention, debugging purpose

Operational logs are used to debug recent behavior: why did this session take so long, why did this batch fail, what was the model doing last Tuesday? Useful life is 30–90 days. After that, the specific session data isn't needed for operational purposes. These logs can be compressed or deleted after the retention window. Access control: the engineering team.

Governance logs — long retention, compliance purpose

Governance logs document AI outputs that affected users — decisions, deliverables, recommendations. EU AI Act Article 13 requires transparency and record-keeping for AI systems whose outputs affect people. If your command center produces outputs that are delivered to clients or used in decisions, those outputs and their context must be retained for the duration required by applicable law — often years. Access control: broader, may include legal and compliance teams.

Archival format — structured beats raw

Structured formats (JSON Lines, append-only flat files, or a log database) are retrievable. Raw conversation text is legible but unqueryable — you can read it, but you can't filter it programmatically. Governance logs especially must be structured: if you're ever required to produce records in response to a complaint or audit, you need to be able to answer queries like "all sessions in March 2025 that used model X and produced outputs for client Y." Raw text can't answer that. JSON Lines can.

Governance Standards

NIST AI RMF — MEASURE Function

NIST's MEASURE function requires that AI systems have defined metrics and a means of collecting them. Session logs are the collection mechanism. Without a schema, you have no measurement — you have storage. MEASURE asks: what are you measuring, how are you collecting it, and how do you know the measurement is reliable? Apply this to your logging schema: for each field, identify what metric it enables. If a field doesn't enable a metric, it's overhead.

NIST AI RMF — GOVERN Function

GOVERN asks: what policies govern the retention, access, and disposal of this data? A logging schema without a retention policy is incomplete. The policy must specify: how long operational logs are kept, how long governance logs are kept, who can access each type, and what the disposal process is. These are governance decisions, not engineering decisions — they should be made before the schema is implemented, not after.

EU AI Act — Article 13, Transparency

Article 13 requires that AI systems deployed in contexts that affect users maintain documentation sufficient to explain what the system was asked and what it returned. If your command center produces client-facing outputs, this applies. The governance log tier in your schema is the documentation layer that satisfies this requirement. The operational log tier doesn't — it's too short-lived and too focused on debugging.

O*NET — Critical Thinking (4.A.4.a)

Critical thinking in this context means: diagnosing a failure requires evidence, and logs are the evidence. Without logs, the developer in the scenario cannot diagnose whether the hallucination was isolated or systematic. With a well-designed schema, the investigation reduces to a query: filter sessions in the relevant time window, filter by task_type, review output_previews for the pattern. The schema turns a three-day investigation into a ten-minute query.

Context

Three Decisions Before You Write a Field

4 min read

Most logging schemas are designed forward — someone lists what seems useful and adds fields until it feels complete. The result is schemas that capture everything except the things you'll actually need. The three decisions below force a different approach: design backward from the queries, the retention cliff, and the never-log list. All three apply directly in the lab.

1 — What are the two or three queries you'll actually run?

Design the schema backward from the queries

Before you write a single field, write down the two or three queries you will realistically need to run against your logs. Be specific. "Find all sessions in March 2025 that used the synthesis task type and had an error flag" is a query. "Review old logs" is not. Once you have the queries, identify which fields each query requires. If a field isn't required by any query, it's speculative — and speculative fields are overhead that adds storage cost and complicates schema evolution. Writing down the queries before the schema exposes what you actually need to capture, and — equally important — what you don't.

2 — What is the retention cliff?

Operational and governance logs have different useful lives

Operational logs have a short useful life — 30 to 90 days. After that, the specific session data isn't needed for debugging; patterns have either been addressed or become invisible. Governance logs may need to be kept for years — EU AI Act Article 13 and analogous frameworks require documentation of AI outputs that affect users for the duration of their potential legal relevance. Define both periods explicitly. Don't treat all logs the same: a single retention policy that applies uniformly either keeps operational data too long (storage cost, privacy liability) or deletes governance data too soon (compliance risk). The retention cliff is the boundary between those two regimes — you need to know where it is.

3 — What must never be logged?

The never-log list is as important as the schema

Every logging schema needs an explicit never-log list. Candidates include: raw user-supplied text that may contain PII, authentication credentials or API keys that users might paste into prompts, medical or legal content that creates liability if stored, and any content that would be problematic if subpoenaed. The input_hash and input_preview fields in the minimum viable schema are specifically designed to give you retrieval capability without storing raw input — but that design only works if raw input is explicitly excluded. The never-log list isn't a disclaimer at the bottom of the schema document. It's a constraint that shapes every field definition.

In the lab, the AI will ask you to justify each field you propose, check your retention policy against the operational vs. governance split, and push for at least two example queries before accepting the schema as complete. Start by writing down the queries — the rest of the schema design follows from there.

◆ Skill Lab

Session Logging Design

~20–30 minutes · 5 exchanges to complete

Your role

🗂️

Logging ArchitectDesign the session logging schema for your command center. Start by defining the two queries you'll actually run. Build the schema backward from those queries, define the retention policy, and specify the never-log list.

AI role

🔍

Schema ReviewerAsks you to justify each field you propose, checks the retention policy for the operational vs. governance split, and demands at least two example queries before accepting the schema as complete.

Framework reminders

Query-backward design: Write the queries first. Each field must be required by at least one query.

Retention cliff: Operational (30–90 days) vs. governance (years). Define both explicitly.

Never-log list: As explicit as the schema itself — PII, credentials, liability content.

NIST MEASURE: Each field enables a metric, or it's overhead.

How to complete

Propose your schema and justify each field. The reviewer will probe field justifications, check the retention split, and require at least two concrete example queries. Lab completes after 5 substantive exchanges.

Shift + Enter for a new line

✓ Module Complete

You've completed Module 6 of 8.

Next Module →