Module 4 — Intro

Prompt Engineering and Management

2 min read

Prompt engineering gets treated as a dark art — intuition, trial and error, and "it worked for me once." But in a command center that handles multiple task types across a team, prompts are infrastructure. They need to be versioned, tested, and managed like any other configuration. This module is about building that discipline: not just writing better prompts, but creating a system for managing them over time.

Portfolio Artifact

Skill Lab — Prompt Library

A prompt management structure — at least three named, versioned prompt templates for different task types in a command center, each with role, task, constraints, output format, and a test criterion defined.

By the end of this module, you will:

Write prompts that specify role, task, constraints, and output format explicitly
Version and store prompts as named, reusable artifacts rather than ephemeral inputs
Test prompt changes against a baseline before deploying to production task flows
Apply O*NET Active Learning to identify when a prompt is failing and why
Build a prompt library structure for a multi-task command center

Scenario

The Prompt That No One Owns

3 min read

A developer built a command center with three task flows: code review, document synthesis, and a daily standup summarizer. Each flow started with an ad-hoc prompt written once and never revisited.

Six months later, the code review prompts have been edited in-place by two different developers. No one knows what the original looked like. There's no diff, no change note, no reason attached to the edits — just a prompt in a config file that works differently than anyone remembers, and a vague sense that the output quality dropped somewhere along the way.

The standup summarizer started producing inconsistent output after a model upgrade. The team can't tell if the model changed behavior or if the prompt drifted. Without a baseline, there's nothing to test against. Both explanations are equally plausible and equally unfixable.

The document synthesis prompt works — probably. No one wants to touch it. It's become the kind of configuration that gets preserved out of superstition rather than understanding. One developer described it as "the prompt we don't edit." That's not a stable foundation for infrastructure.

The real failure wasn't in any individual prompt. It was the absence of a system: no versioning, no test criteria, no ownership. Three prompts written once, edited whenever, with no record of why. At team scale, that's not a process problem — it's a governance problem.

Lesson

A Prompt Is a Contract

5 min read

A prompt is a contract between your command center and the model. Like any contract, it needs to be explicit, versioned, and testable. When a prompt is vague, the model fills the gaps with its own judgment — which changes across versions, temperatures, and context lengths. When a prompt has no version, every edit overwrites the baseline. When a prompt has no test criterion, you can't tell if a change improved it or degraded it.

The four elements of a well-formed prompt

Role — who the model is

The role defines the model's perspective, expertise, and authority within this task. It's not a personality flourish — it's a constraint on how the model should reason and what it's permitted to do. "You are a senior code reviewer" sets different behavior than "You are a helpful assistant reviewing code." The role is the most durable element; it rarely changes between minor versions.

Task — what it must do

The task is the specific action the model must take on the input. It should be stated as a verb phrase, not a topic: "Identify the three highest-priority issues in this diff and explain each in one sentence" is a task. "Code review" is not. The more precisely the task is scoped, the more consistently the output can be tested.

Constraints — what it must not do or must respect

The constraint is the element most often skipped. A constraint is something the model must not do, or a limit it must respect. "Do not suggest architectural changes — scope the review to the code as written." "Do not exceed three bullet points." "Do not reference information outside the provided document." Constraints prevent the model from being helpful in ways you don't want. Skipping constraints is how prompts produce useful-looking output that subtly violates the intended scope.

Output format — exactly what to return

The format specifies the structure of the response: a bulleted list, a JSON object, a numbered summary with a specific field order. "A structured summary" is not a format. "A JSON object with keys: summary (≤50 words), issues (array of strings), recommendation (one sentence)" is. Without an explicit format, the model uses whatever format it judges appropriate — which varies by model version and context.

Why in-place editing is dangerous

You lose the diff, the baseline, and the rollback

When you edit a prompt in place, you overwrite the only copy. There is no diff — no record of what changed or why. There is no baseline — no version to test the new prompt against. There is no rollback — if the edit degrades output quality, you're restoring from memory. For infrastructure that runs dozens or hundreds of times per day, that's an unacceptable risk. Treat a prompt like a config file: never edit in place, always version.

Prompt versioning

Semantic versioning for prompts

Version prompts as named files using semantic versioning: task-slug-v{major}.{minor} — e.g., code-review-v1.2. A minor version bump (1.0 → 1.1) changes phrasing, examples, or constraint wording without changing the model's role or output format. A major version bump (1.x → 2.0) changes the model's role or output format — a structural change that breaks any downstream code expecting the previous format. Document what changed and why in a one-line change note attached to the version.

Prompt testing

A test criterion is a specific, checkable property

"Produces a good summary" is not a test criterion. "Returns a bulleted list with ≤5 items, each ≤25 words, covering only the content provided" is. A test criterion must be checkable without judgment — ideally, checkable by a script. Before writing a prompt, write the test criterion. This inverts the usual order: instead of writing a prompt and then deciding whether you like the output, you define what "correct" looks like first, and then write the prompt that produces it. That inversion exposes vague intent before you commit to a direction.

Governance Standards — The Regulatory Layer

O*NET — Active Learning (4.A.6.b)

Active Learning means continuously monitoring your system's output and updating your understanding of how it behaves over time. Applied to prompts: recognizing when a prompt has drifted — when output quality degrades gradually rather than suddenly — requires sustained attention, not just deployment testing. A prompt that passed its test criterion at v1.0 may fail it after a model upgrade or after accumulated context changes. Active Learning is the habit of checking prompt performance after every significant change, not just at launch.

NIST AI RMF — MEASURE Function

The MEASURE function establishes metrics and evaluation methods for AI system performance. For prompt management, MEASURE means treating prompt outputs as a measurement target: track output quality across versions, flag regressions, and keep a version history that makes those comparisons possible. Without measurement, you can't distinguish a prompt failure from a model failure — and you can't make the case to roll back a change.

NIST AI RMF — GOVERN Function

The GOVERN function establishes policies, processes, and accountability for AI risk management. Versioning is a governance mechanism: it creates a record of who changed what and when, enables audit, and makes rollback possible. A prompt library without versioning has no governance — just a set of configuration files that can be changed by anyone at any time with no traceability.

O*NET — Critical Thinking (4.A.4.a)

Critical Thinking is the active analysis of information to make reasoned judgments. Applied here: when output quality degrades, the question is not just "fix the prompt" — it's "diagnose whether the failure is in the prompt, the model version, the input data, or the context window." That diagnosis requires comparing the current state to a known baseline, which is only possible if you've maintained one. Critical Thinking about prompt failures starts with versioning.

Context

Three Techniques for the Lab

3 min read

The lab asks you to build three named, versioned prompt templates. These three techniques apply directly to each one. Work through them in order for every template you build.

1 — Role-Task-Constraint-Format decomposition

Write the four elements separately before writing the prompt

Before drafting a prompt, write each element in isolation: Role (who the model is), Task (what it must do), Constraints (what it must not do or must respect), Output Format (exactly what to return). The constraint is the hardest — most people skip it because it requires thinking about failure modes before the prompt has failed. A constraint is something the model must not do, or a limit it must respect: "do not suggest architectural changes — scope review to the code as written," "do not include information not present in the input document," "do not exceed five bullet points." Writing constraints separately forces you to think about where the prompt could go wrong before it does.

2 — The test criterion first

Define checkable before writing the prompt

Write the test criterion before you write the prompt. What specific, checkable property must the output have? Not "produces a useful summary" — that's a judgment. A checkable property is something you could verify with a script or a 10-second scan: "Returns exactly three items," "Each item is ≤25 words," "No item references information not in the input." Writing the test criterion first inverts the usual order and exposes vague intent before you've committed to a direction. If you can't write a checkable test criterion, your task definition is still too vague — sharpen the task before writing the prompt.

3 — Version naming convention

{task-slug}-v{major}.{minor} with a change note

Name every prompt template using the convention {task-slug}-v{major}.{minor} — for example, code-review-v1.2 or standup-summary-v2.0. A minor version bump (1.0 → 1.1) changes phrasing, constraint wording, or examples without changing the model's role or output format. A major version bump (1.x → 2.0) changes the model's role or the output format — a structural change. Attach a one-line change note to every version: what changed and why. The change note is what makes the version history useful. Without it, you have a list of numbered files but no record of why each one exists.

In the lab, the AI will review each prompt template you build and check whether it has a complete decomposition (role, task, constraints, format), a testable criterion, and a version name with a change note. You'll build at least three templates — across different task types in your command center — before the lab is complete.

◆ Skill Lab

Prompt Engineering and Management

~20–30 minutes · 5 exchanges to complete

Your role

🧑‍💻

Prompt EngineerBuilding a versioned prompt library for your command center. You'll write at least three named, versioned templates across different task types — each with a complete Role-Task-Constraint-Format decomposition and a testable criterion.

AI role

🔍

Prompt ReviewerChecks each template for completeness — role, task, constraints, format — and asks for the test criterion before accepting any template as done. Enforces version naming and pushes for checkable criteria rather than quality claims.

Framework reminders

Role-Task-Constraint-Format: All four elements, separately defined before writing the prompt.

Test criterion first: Specific and checkable — not "produces good output."

Version naming: {task-slug}-v{major}.{minor} + one-line change note.

O*NET 4.A.6.b: Active Learning — monitor prompt output over time, not just at launch.

How to complete

Name a task your command center runs and build the first prompt template together. The reviewer will check each element and require a test criterion before moving on. Lab completes after 5 substantive exchanges across at least three templates.

Shift + Enter for a new line

✓ Module Complete

You've completed Module 4 of 8.

Next Module →