Intro

Failure, Recovery, and Human Escalation

2 min read

Every agentic pipeline will fail. The question is not whether to handle failures — it's how much autonomy the pipeline has to handle them without human involvement. A pipeline that retries automatically is faster and cheaper. A pipeline that escalates to a human is slower and more expensive. A pipeline that autonomously decides when to retry and when to escalate requires a policy — written, explicit, and tested — that defines exactly where that line is.

This module debates the autonomy line. On one side: pipelines should handle their own failures aggressively. Human escalation is slow, expensive, and in most cases unnecessary — models retry successfully, and humans just confirm what a retry would have fixed. On the other side: pipelines should escalate early. Autonomous recovery compounds errors, hides failure modes, and removes humans from decisions they should be making. The right answer is somewhere specific — and you'll have to defend where.

Your artifact — Debate Lab

A written escalation policy — a position on pipeline autonomy argued against specific failure scenarios, with explicit triggers for human escalation and the conditions under which automated recovery is acceptable

By the end of this module, you will:

Write an escalation policy that specifies exactly which failures the pipeline handles autonomously and which require human review
Distinguish recoverable failures (retry is appropriate) from compounding failures (human review is required)
Apply EU AI Act Article 14 (Human Oversight) to pipeline escalation design — which decisions require meaningful human oversight?
Apply NIST MANAGE to recovery path design — maintaining pipeline reliability means maintaining recovery paths as the pipeline evolves
Defend an escalation policy against adversarial failure scenarios where the policy produces a wrong outcome

Scenario

The Pipeline That Rolled Back a Working Deploy

3 min read

A team has built a deployment pipeline with a built-in recovery policy: if the post-deployment health check fails three consecutive times, the pipeline automatically rolls back the deployment. The policy was approved by the team as a safe default. It has run correctly dozens of times.

One morning, the pipeline deploys a significant new feature. The deployment succeeds. The health checks begin. The first check fails — the health check endpoint times out because the new feature added a 400ms initialization step that the health check's 200ms timeout doesn't accommodate. The second check fails for the same reason. The third check fails. The pipeline rolls back the deployment automatically.

No human is notified until the rollback completes. By then, the feature — which was working correctly — has been removed from production. The rollback itself takes 12 minutes. The on-call engineer spends two hours diagnosing why the deployment failed, discovers the health check timeout mismatch, and re-deploys with an adjusted timeout.

The recovery policy worked exactly as designed. It produced the wrong outcome.

The team now debates: should the recovery policy have escalated instead of rolling back? If so, when? After the first failure? After all three? Or is the policy fundamentally wrong — should rollback require human authorization for any deployment above a certain risk level?

Two positions have emerged

Position A — Stricter autonomy limits

The pipeline should not roll back a deployment without human authorization. It should alert after the first failure, provide diagnostics, and wait for a human decision. Autonomous rollback removes humans from a consequential decision they should be making.

Position B — Better policy design, same autonomy

The pipeline should retain autonomous rollback but with a smarter policy: distinguish health check timeouts from actual failures, adjust timeout expectations based on deployment size, and only roll back if failures persist for longer than the maximum expected initialization time. The autonomy is correct; the policy was wrong.

You will defend one of these positions — or articulate your own — in the lab.

Lesson

The Autonomy Policy

5 min read

Pipeline autonomy is not binary. The question is not "should the pipeline handle failures automatically?" — it's "which failures, under which conditions, with which constraints?" An explicit autonomy policy answers all three.

Recoverable vs. compounding failures

Recoverable failure

A failure that, if retried or handled locally, is likely to produce a correct outcome. Nondeterministic model outputs, transient network errors, rate limit errors. Human escalation adds latency without adding value.

Compounding failure

A failure that, if handled autonomously, will produce a worse outcome or hide a more serious problem. A systematic model output failure that the pipeline retries five times — producing five wrong outputs — before escalating. A false-positive health check that triggers a rollback of a working deployment.

The key question: does autonomous recovery reduce the damage, or does it compound it? Retrying a nondeterministic failure is reduction. Rolling back a working deployment is compounding. The policy must distinguish these cases before they occur.

When autonomous recovery is appropriate

The failure is transient and the recovery action is idempotent (retrying has no side effects)
The failure is well-understood and the recovery action has a known success rate above a threshold
The stakes of the failed stage are low enough that a wrong recovery action is correctable

When human escalation is required

The failure affects production systems in a way that is irreversible or time-consuming to reverse
The failure is in a class the pipeline has not seen before (unknown failure mode)
The recovery action itself is consequential — rolling back a deployment, deleting data, sending external communications

Governance standards

EU AI Act Article 14 — Human Oversight

Article 14 requires that humans can effectively oversee AI system outputs and intervene when needed. For pipelines with autonomous recovery, this means: humans must be able to see what the pipeline decided, why it decided it, and have a meaningful opportunity to intervene before the decision causes irreversible consequences. A pipeline that rolls back a deployment in 90 seconds doesn't give humans a meaningful intervention window. A pipeline that alerts, waits 3 minutes, and then acts does.

NIST MANAGE — Maintain

The MANAGE function requires maintaining pipeline reliability over time. Autonomous recovery policies degrade — as pipelines evolve, new failure modes emerge that the original policy didn't anticipate. NIST MANAGE requires scheduled policy reviews, not just deployment-time policy design.

UNESCO AI Ethics — Human Agency

Preserving human agency means ensuring that pipelines augment human decision-making rather than replacing it for decisions that humans should own. High-stakes, irreversible, or novel failures should route to humans — not because pipelines can't handle them, but because those decisions carry accountability that belongs with a human.

Context

Three Triggers for the Escalation Line

3 min read

Three questions to answer before defending your position in the lab. Each one will be probed directly.

1. What class of failures does your policy handle autonomously?

Be specific. Not "recoverable failures" — name the actual failure types your policy handles without escalation. For each: what is the maximum number of autonomous retries? What is the condition under which retry stops and escalation begins? What is the recovery action, and is it idempotent?

2. What class of failures triggers immediate human escalation?

Name them. Include the failure type, the signal that identifies it, and the escalation action (alert only, alert and pause pipeline, alert and roll back). "Novel failures" must be operationalized — how does the pipeline detect that a failure is novel? Does it compare against a known failure registry? Does it escalate any failure not in a whitelist?

3. What human intervention window does your policy guarantee?

EU AI Act Article 14 requires meaningful human oversight. For consequential decisions — rollbacks, deletions, external communications — how much time does a human have to intervene before the autonomous action executes? 3 minutes? 10 minutes? What alerts them, and what do they need to do to stop the action?

In the lab, you'll commit to one of the two positions from the scenario (or articulate a third), answer all three questions, and defend your policy against adversarial failure scenarios where your policy produces a wrong outcome.

◆ Debate Lab

Escalation Policy Defense

~25 minutes · 1 policy position

What you're doing

Commit to a position on pipeline autonomy for the deployment rollback scenario. State your escalation policy in full, then defend it against adversarial failure cases where your policy produces the wrong outcome.

Roles

📋

You — Policy AuthorCommit to Position A (stricter limits), Position B (better policy design), or your own. Answer the three trigger questions and hold your position under pressure.

⚖️

AI — Adversarial PanelWill pressure-test your policy against specific failure scenarios where it breaks. Probes failure classification, intervention window design, and policy maintenance.

Framework — apply to your policy

Recoverable vs. compounding: does autonomous recovery reduce or compound the damage?

EU AI Act Art.14: does your policy give humans a meaningful intervention window?

Three trigger questions: what's autonomous, what escalates, what's the intervention window?

Success criteria

A written escalation policy that names autonomous failure classes, escalation triggers, and intervention windows — defended against at least one adversarial scenario.

Shift + Enter for a new line

✓ Module Complete

You've completed Module 7 of 8.

Next Module →