Intro

TDD Gating

2 min read

Most developers write tests for pipeline stages after the stage is working. This means the test is written to match what the stage already does — including its bugs. TDD (Test-Driven Development) applied to pipeline gates inverts this. You write the failing test — the validation rule the stage output must pass — before implementing the stage. The stage then exists to satisfy the test, not the other way around.

The core mindset shift: a gate is not a check that output is correct. A gate is a specification of what correct means. Writing the gate first forces you to answer the question before you can avoid it: what does bad output look like, exactly?

Your artifact — Skill Lab

A TDD gate specification — a failing test written for a real pipeline stage before implementation, with the success criteria, failure criteria, and retry trigger explicitly defined

By the end of this module, you will:

Write a gate specification as a failing test before implementing the stage it validates
Distinguish a gate that describes success from a gate that actually catches failure
Apply NIST MEASURE thinking to pipeline gates — what are you measuring, and how do you know the measurement is correct?
Name the retry trigger for a gate failure — when does the pipeline retry vs. escalate vs. terminate?
Identify which gates in an existing pipeline were written after the fact, and why that matters

Scenario

The Test That Passes Everything

3 min read

A team has built a spec generation stage for their pipeline. The stage takes a GitHub issue and produces a technical specification. They wrote a validation gate that checks: (1) the output is valid JSON, (2) it contains a "title" field, (3) it is longer than 200 characters.

The gate passes. Almost always. The stage gets deployed.

Three weeks later, a senior engineer is reviewing pipeline outputs manually and notices that roughly one in five specifications is technically valid — all fields present, appropriate length — but completely generic. The spec for a rate limiting bug and the spec for a UI theming issue look nearly identical: same structure, same sections, same boilerplate. The gate they wrote never caught this because it was written to validate format, not quality.

The problem: the gate was written after the stage was working. They knew what the stage produced, so they wrote a gate that passed what the stage produced. They measured format because format was measurable. They skipped quality because quality was harder to define.

Contrast this with TDD: if they had written the gate first, the first question would have been "what does a bad spec look like?" — not "what does a good spec contain?" A bad spec, stated precisely, is: a spec that cannot be distinguished from a spec for a different issue. A gate that catches that failure would require semantic comparison or at minimum a check that key terms from the issue appear in the spec. That gate, written before the stage, would have forced a different implementation.

Lesson

Write the Failure First

4 min read

Core insight: the gate defines what the stage must not produce. Implementation satisfies the gate. This order matters because it forces you to define failure before you have something you're trying to protect.

Red-green-refactor applied to pipeline stages

Red — Write the failing gate

The stage doesn't exist yet. The gate specifies: given this input, what would make the output unacceptable? Name it precisely enough that you could automate the check. This is the question most developers never ask before building.

Green — Implement minimally to pass

Implement the stage minimally enough to pass the gate. Not more. If the gate says "the spec must contain at least 3 terms from the original issue title," the stage must produce a spec that satisfies that, and nothing is required beyond that.

Refactor — Improve without breaking

Once the gate passes, improve the stage without breaking the gate. The gate is the regression test. If a refactor breaks the gate, the refactor broke the stage. The gate holds the line.

Governance Standards — The Regulatory Layer

NIST AI RMF — MEASURE Function

The MEASURE function requires establishing metrics for AI system performance. For pipeline gates, MEASURE means: what is the gate actually measuring, and is that measurement a valid proxy for what you care about? A character count gate measures output length. It is a valid proxy for output completeness only if short outputs are always bad. If that assumption fails, the measurement is wrong. NIST MEASURE applied to gate design requires tracing from the metric back to the property you care about, before you deploy.

O*NET — Active Learning (4.A.1.a)

TDD is a disciplined form of active learning — you generate a prediction (this implementation will satisfy this gate), test it, and update based on results. O*NET's Active Learning competency means applying this cycle systematically rather than writing tests as documentation after you already know the answer. Gates written after the fact are not tests; they are assertions about what already works.

Context

Three Things a Good Gate Specifies

3 min read

Every gate must specify three things. All three are applied in the lab — in this order.

1. The failure criterion — write this first

What does the output look like when the stage broke? This is the TDD question. Write it before anything else. "The spec is incomplete" is not a failure criterion. "The spec's 'Acceptance Criteria' section is missing or contains fewer than two bullet points" is a failure criterion you can test. Specific enough to automate means specific enough to use.

2. The success criterion — write this second

What does the output look like when the stage worked? Make this precise enough to automate. Not "the spec is complete" but "the spec contains a section titled 'Acceptance Criteria' with at least two bullet points, and each bullet point contains a verb." Note: success and failure are not inverses. A gate that only checks for success will pass everything that isn't obviously broken.

3. The retry trigger — write this third

What happens when the gate fails? Three options: retry the stage with the same input (appropriate for nondeterministic failures), retry with modified parameters (appropriate when the failure suggests a prompt needs adjustment), or escalate (appropriate when the gate failure indicates the input itself is unsalvageable). The retry trigger must be specified in the gate — not decided at failure time.

In the lab, you'll bring a real pipeline stage you own and write the gate for it — failure criterion first, then success criterion, then retry trigger. You'll apply all three before writing a single line of implementation.

◆ Skill Lab

TDD Gate Specification

~20 minutes · 1 gate spec

What you're doing

Pick a real pipeline stage you own or are designing — or use the spec generation stage from the scenario. Write the gate for it using TDD order: failure criterion first, then success criterion, then retry trigger. No implementation until the gate is fully specified.

Roles

🏗

You — Stage OwnerBring a real pipeline stage and write its gate spec. Failure criterion first. Don't describe what good output looks like until you've named what bad output looks like.

🔍

AI — Gate ReviewerA TDD coach who will not let you define success without first defining failure. I'll probe whether your failure criterion is specific enough to automate and whether your retry trigger covers the failure modes you've named.

Framework — apply in this order

Red first: failure criterion before success criterion

NIST MEASURE: is your measurement a valid proxy?

Retry trigger: retry same / retry modified / escalate

Success criteria

A complete gate spec with all three components: a failure criterion specific enough to automate, a success criterion distinct from it, and a retry trigger that names which failure modes map to which response.

Shift + Enter for a new line

✓ Module Complete

You've completed Module 4 of 8.

Next Module →