A pipeline stage that validates its own output will pass its own biases. The model that generated the code review also knows what it was trying to produce — it will miss the same things it missed the first time. Cross-model review (sometimes called the grill-me pattern) inserts a second model into the pipeline as an adversarial gate: a reviewer whose job is to find problems with the previous stage's output, not to accept it.
The key insight: the reviewer model doesn't need to know how to produce the output — it only needs to know how to critique it. These are different capabilities, and they're often better separated across different models or system prompts.
A team has a code generation stage that produces implementation files from a technical specification. The stage has a linting gate — the output must pass ESLint and the TypeScript compiler with no errors. Both pass. Code ships.
After two months, a senior engineer audits the pipeline's outputs and notices a pattern: the generated code almost always passes lint and type checking, and almost always fails code review on the first pass for semantic reasons — unclear variable names, missing error handling, functions that are technically correct but don't match the codebase's established patterns.
The linting gate is catching what it was designed to catch. It wasn't designed to catch semantic quality. And because the generating model produces code that satisfies lint, it has learned to optimize for lint — not for reviewability.
The team adds a cross-model review stage: a separate model instance with a system prompt that describes the codebase's patterns, the team's review criteria, and a list of the most common rejection reasons from their code review history. The reviewer model receives the generated code and produces a structured critique: does it match established patterns? Is error handling complete? Are functions appropriately scoped?
In the first month, the cross-model reviewer catches 74% of the semantic issues that previously reached human review. Human review time per PR drops by 40%. More importantly, the generating model begins to improve — because it's now getting feedback that is semantically richer than lint errors.
The grill-me pattern works because it separates production from critique. The producer optimizes for output. The reviewer optimizes for catching what the producer missed.
Core insight: A second model reviewing the output of the first catches the class of failures that are invisible to the producing model — not because the reviewer is smarter, but because it has a different objective and a different system prompt.
The producing stage generates output with its standard system prompt, optimizing for whatever goal its prompt describes.
The reviewing stage receives the output and a reviewer system prompt that describes the failure patterns to look for, the quality bar the output must meet, and the structured format for its critique.
The gate evaluates the reviewer's critique — if the critique identifies failures above a threshold, the gate fails and the pipeline retries or escalates. The reviewer model does not regenerate the output. It only evaluates it.
The reviewer system prompt is the most critical design decision. A reviewer prompt that says "review this code for quality" will produce the same biases as the producing model. A reviewer prompt that says "you are a senior engineer who has rejected the last 12 PRs from this code generator for the following reasons: [list]" will be adversarial in the right way — specifically probing for the known failure modes.
Cross-model review catches systematic failures — the patterns the producing model consistently gets wrong. It does not catch novel failures — edge cases the reviewer's prompt doesn't anticipate. And it does not replace human review for high-stakes decisions. EU AI Act Article 14 requires meaningful human oversight for high-risk AI applications. Cross-model review is a gate, not a substitute for human judgment on consequential outputs.
The cross-model reviewer is a measurement instrument. Like any instrument, it measures what it was calibrated to measure. A reviewer calibrated on past rejection reasons will catch those reasons — and miss new ones. NIST MEASURE requires periodic recalibration: updating the reviewer's failure pattern list as new failure modes emerge.
Article 14 requires that humans can effectively oversee AI system outputs and intervene when needed. Cross-model review can reduce the volume of outputs that require human attention — but it must not reduce human capacity to catch the failures the reviewer misses. Design the pipeline so that a human sees a random sample of reviewer-approved outputs, not just the ones the reviewer flags.
The grill-me pattern is a structured learning mechanism for the pipeline itself. Reviewer critiques feed back into producing model improvement. Track which failure patterns the reviewer catches most frequently — those are the producing stage's systematic weaknesses and the highest-value targets for improvement.
Three decisions every cross-model review gate requires. Get these right and the reviewer is testable. Leave them vague and the reviewer is expensive noise.
The reviewer system prompt is not "review this output for quality." It is a specific persona with specific knowledge: the failure patterns from the last N human rejections, the quality bar the output must meet, the codebase patterns or domain constraints that make certain outputs unacceptable. The more specific the persona, the more targeted the review.
The reviewer should not do open-ended quality assessment. It should check for a named list of failure patterns: "missing error handling on async calls," "variable names that don't match the codebase's naming convention," "functions longer than 50 lines." Named failure patterns are testable. Open-ended quality assessment is not.
The reviewer produces a structured critique. The gate evaluates the critique, not the original output. Define: how many failure patterns can the output have and still pass? Are some patterns disqualifying on their own (security issues) while others can accumulate (style issues)? The pass/fail criterion must be specified before the reviewer is deployed.
In the lab, you'll design a cross-model review gate for a real pipeline stage. You'll specify the reviewer persona, the failure pattern list, and the pass/fail criterion. Then you'll test the reviewer's design by sending it a sample output and seeing whether it catches what you intended.