Most developers write tests for pipeline stages after the stage is working. This means the test is written to match what the stage already does — including its bugs. TDD (Test-Driven Development) applied to pipeline gates inverts this. You write the failing test — the validation rule the stage output must pass — before implementing the stage. The stage then exists to satisfy the test, not the other way around.
The core mindset shift: a gate is not a check that output is correct. A gate is a specification of what correct means. Writing the gate first forces you to answer the question before you can avoid it: what does bad output look like, exactly?
A team has built a spec generation stage for their pipeline. The stage takes a GitHub issue and produces a technical specification. They wrote a validation gate that checks: (1) the output is valid JSON, (2) it contains a "title" field, (3) it is longer than 200 characters.
The gate passes. Almost always. The stage gets deployed.
Three weeks later, a senior engineer is reviewing pipeline outputs manually and notices that roughly one in five specifications is technically valid — all fields present, appropriate length — but completely generic. The spec for a rate limiting bug and the spec for a UI theming issue look nearly identical: same structure, same sections, same boilerplate. The gate they wrote never caught this because it was written to validate format, not quality.
The problem: the gate was written after the stage was working. They knew what the stage produced, so they wrote a gate that passed what the stage produced. They measured format because format was measurable. They skipped quality because quality was harder to define.
Contrast this with TDD: if they had written the gate first, the first question would have been "what does a bad spec look like?" — not "what does a good spec contain?" A bad spec, stated precisely, is: a spec that cannot be distinguished from a spec for a different issue. A gate that catches that failure would require semantic comparison or at minimum a check that key terms from the issue appear in the spec. That gate, written before the stage, would have forced a different implementation.
Core insight: the gate defines what the stage must not produce. Implementation satisfies the gate. This order matters because it forces you to define failure before you have something you're trying to protect.
The stage doesn't exist yet. The gate specifies: given this input, what would make the output unacceptable? Name it precisely enough that you could automate the check. This is the question most developers never ask before building.
Implement the stage minimally enough to pass the gate. Not more. If the gate says "the spec must contain at least 3 terms from the original issue title," the stage must produce a spec that satisfies that, and nothing is required beyond that.
Once the gate passes, improve the stage without breaking the gate. The gate is the regression test. If a refactor breaks the gate, the refactor broke the stage. The gate holds the line.
The MEASURE function requires establishing metrics for AI system performance. For pipeline gates, MEASURE means: what is the gate actually measuring, and is that measurement a valid proxy for what you care about? A character count gate measures output length. It is a valid proxy for output completeness only if short outputs are always bad. If that assumption fails, the measurement is wrong. NIST MEASURE applied to gate design requires tracing from the metric back to the property you care about, before you deploy.
TDD is a disciplined form of active learning — you generate a prediction (this implementation will satisfy this gate), test it, and update based on results. O*NET's Active Learning competency means applying this cycle systematically rather than writing tests as documentation after you already know the answer. Gates written after the fact are not tests; they are assertions about what already works.
Every gate must specify three things. All three are applied in the lab — in this order.
What does the output look like when the stage broke? This is the TDD question. Write it before anything else. "The spec is incomplete" is not a failure criterion. "The spec's 'Acceptance Criteria' section is missing or contains fewer than two bullet points" is a failure criterion you can test. Specific enough to automate means specific enough to use.
What does the output look like when the stage worked? Make this precise enough to automate. Not "the spec is complete" but "the spec contains a section titled 'Acceptance Criteria' with at least two bullet points, and each bullet point contains a verb." Note: success and failure are not inverses. A gate that only checks for success will pass everything that isn't obviously broken.
What happens when the gate fails? Three options: retry the stage with the same input (appropriate for nondeterministic failures), retry with modified parameters (appropriate when the failure suggests a prompt needs adjustment), or escalate (appropriate when the gate failure indicates the input itself is unsalvageable). The retry trigger must be specified in the gate — not decided at failure time.
In the lab, you'll bring a real pipeline stage you own and write the gate for it — failure criterion first, then success criterion, then retry trigger. You'll apply all three before writing a single line of implementation.