← Courses
Building Agentic Pipelines
← Module 4
Module 5 of 8
Module 6 →
Intro
Scenario
Lesson
Context
Lab Skill ~20 min
Intro

Cross-Model Review

2 min read

A pipeline stage that validates its own output will pass its own biases. The model that generated the code review also knows what it was trying to produce — it will miss the same things it missed the first time. Cross-model review (sometimes called the grill-me pattern) inserts a second model into the pipeline as an adversarial gate: a reviewer whose job is to find problems with the previous stage's output, not to accept it.

The key insight: the reviewer model doesn't need to know how to produce the output — it only needs to know how to critique it. These are different capabilities, and they're often better separated across different models or system prompts.

Your artifact — Skill Lab
A cross-model review gate design — a grill-me stage specified for a real pipeline, with the reviewer model's persona, the failure patterns it looks for, and the pass/fail criteria it applies
  • Specify a cross-model review stage for an existing pipeline — define the reviewer persona, failure patterns, and pass/fail criteria
  • Distinguish producer-model capabilities from reviewer-model capabilities and know when separation adds value
  • Apply the grill-me pattern to catch failure modes that the producing stage's own gate misses
  • Apply NIST MEASURE to cross-model review — what is the reviewer measuring, and how do you know it's catching real failures?
  • Apply EU AI Act Article 14 (Human Oversight) — identify when cross-model review is a sufficient substitute for human review and when it isn't
Scenario

The Reviewer That Agrees

3 min read

A team has a code generation stage that produces implementation files from a technical specification. The stage has a linting gate — the output must pass ESLint and the TypeScript compiler with no errors. Both pass. Code ships.

After two months, a senior engineer audits the pipeline's outputs and notices a pattern: the generated code almost always passes lint and type checking, and almost always fails code review on the first pass for semantic reasons — unclear variable names, missing error handling, functions that are technically correct but don't match the codebase's established patterns.

The linting gate is catching what it was designed to catch. It wasn't designed to catch semantic quality. And because the generating model produces code that satisfies lint, it has learned to optimize for lint — not for reviewability.

The team adds a cross-model review stage: a separate model instance with a system prompt that describes the codebase's patterns, the team's review criteria, and a list of the most common rejection reasons from their code review history. The reviewer model receives the generated code and produces a structured critique: does it match established patterns? Is error handling complete? Are functions appropriately scoped?

In the first month, the cross-model reviewer catches 74% of the semantic issues that previously reached human review. Human review time per PR drops by 40%. More importantly, the generating model begins to improve — because it's now getting feedback that is semantically richer than lint errors.

The grill-me pattern works because it separates production from critique. The producer optimizes for output. The reviewer optimizes for catching what the producer missed.

Lesson

The Grill-Me Pattern

4 min read

Core insight: A second model reviewing the output of the first catches the class of failures that are invisible to the producing model — not because the reviewer is smarter, but because it has a different objective and a different system prompt.

Production

The producing stage generates output with its standard system prompt, optimizing for whatever goal its prompt describes.

Review

The reviewing stage receives the output and a reviewer system prompt that describes the failure patterns to look for, the quality bar the output must meet, and the structured format for its critique.

Gate evaluation

The gate evaluates the reviewer's critique — if the critique identifies failures above a threshold, the gate fails and the pipeline retries or escalates. The reviewer model does not regenerate the output. It only evaluates it.

The reviewer system prompt is the most critical design decision. A reviewer prompt that says "review this code for quality" will produce the same biases as the producing model. A reviewer prompt that says "you are a senior engineer who has rejected the last 12 PRs from this code generator for the following reasons: [list]" will be adversarial in the right way — specifically probing for the known failure modes.

Cross-model review catches systematic failures — the patterns the producing model consistently gets wrong. It does not catch novel failures — edge cases the reviewer's prompt doesn't anticipate. And it does not replace human review for high-stakes decisions. EU AI Act Article 14 requires meaningful human oversight for high-risk AI applications. Cross-model review is a gate, not a substitute for human judgment on consequential outputs.

NIST AI RMF — MEASURE Function

The cross-model reviewer is a measurement instrument. Like any instrument, it measures what it was calibrated to measure. A reviewer calibrated on past rejection reasons will catch those reasons — and miss new ones. NIST MEASURE requires periodic recalibration: updating the reviewer's failure pattern list as new failure modes emerge.

EU AI Act — Article 14 (Human Oversight)

Article 14 requires that humans can effectively oversee AI system outputs and intervene when needed. Cross-model review can reduce the volume of outputs that require human attention — but it must not reduce human capacity to catch the failures the reviewer misses. Design the pipeline so that a human sees a random sample of reviewer-approved outputs, not just the ones the reviewer flags.

O*NET — Active Learning (4.A.1.a)

The grill-me pattern is a structured learning mechanism for the pipeline itself. Reviewer critiques feed back into producing model improvement. Track which failure patterns the reviewer catches most frequently — those are the producing stage's systematic weaknesses and the highest-value targets for improvement.

Context

Designing the Reviewer

3 min read

Three decisions every cross-model review gate requires. Get these right and the reviewer is testable. Leave them vague and the reviewer is expensive noise.

1. The reviewer persona — who is the reviewer, and what do they know?

The reviewer system prompt is not "review this output for quality." It is a specific persona with specific knowledge: the failure patterns from the last N human rejections, the quality bar the output must meet, the codebase patterns or domain constraints that make certain outputs unacceptable. The more specific the persona, the more targeted the review.

2. The failure pattern list — what specific things is the reviewer looking for?

The reviewer should not do open-ended quality assessment. It should check for a named list of failure patterns: "missing error handling on async calls," "variable names that don't match the codebase's naming convention," "functions longer than 50 lines." Named failure patterns are testable. Open-ended quality assessment is not.

3. The pass/fail criterion — what does the reviewer's output need to look like for the gate to pass?

The reviewer produces a structured critique. The gate evaluates the critique, not the original output. Define: how many failure patterns can the output have and still pass? Are some patterns disqualifying on their own (security issues) while others can accumulate (style issues)? The pass/fail criterion must be specified before the reviewer is deployed.

In the lab, you'll design a cross-model review gate for a real pipeline stage. You'll specify the reviewer persona, the failure pattern list, and the pass/fail criterion. Then you'll test the reviewer's design by sending it a sample output and seeing whether it catches what you intended.

◆ Skill Lab
Cross-Model Review Gate Design
~20 minutes · 1 pipeline stage
What you're doing
Pick a pipeline stage that produces output you'd want reviewed — your own, or the code generation stage from the scenario. Design a cross-model reviewer for it: persona, failure patterns, and pass/fail criteria. Then test it by sending a sample output.
Roles
🏗
You — Pipeline DesignerSpecify the reviewer: its persona, the failure patterns it looks for, and the pass/fail criteria it applies.
🔍
AI — Design Coach + ReviewerFirst, I'll help you design the reviewer. Then, when you send a sample output, I'll switch roles and apply the reviewer persona you designed — so you can see if it catches what you intended.
Framework — apply to your reviewer
Reviewer persona: who is the reviewer, and what failure patterns do they know?
Named failure patterns: specific enough to check, not open-ended quality assessments
Pass/fail criterion: how many failures are acceptable before the gate fails?
EU AI Act Art. 14: what outputs still need human review even when the reviewer passes them?
Success criteria
A complete reviewer design: persona specified, failure patterns named, pass/fail criterion defined. Plus one test — a sample output evaluated through your reviewer design.
Shift + Enter for a new line
✓ Module Complete
You've completed Module 5 of 8.
Next Module →