Every agentic pipeline will fail. The question is not whether to handle failures — it's how much autonomy the pipeline has to handle them without human involvement. A pipeline that retries automatically is faster and cheaper. A pipeline that escalates to a human is slower and more expensive. A pipeline that autonomously decides when to retry and when to escalate requires a policy — written, explicit, and tested — that defines exactly where that line is.
This module debates the autonomy line. On one side: pipelines should handle their own failures aggressively. Human escalation is slow, expensive, and in most cases unnecessary — models retry successfully, and humans just confirm what a retry would have fixed. On the other side: pipelines should escalate early. Autonomous recovery compounds errors, hides failure modes, and removes humans from decisions they should be making. The right answer is somewhere specific — and you'll have to defend where.
A team has built a deployment pipeline with a built-in recovery policy: if the post-deployment health check fails three consecutive times, the pipeline automatically rolls back the deployment. The policy was approved by the team as a safe default. It has run correctly dozens of times.
One morning, the pipeline deploys a significant new feature. The deployment succeeds. The health checks begin. The first check fails — the health check endpoint times out because the new feature added a 400ms initialization step that the health check's 200ms timeout doesn't accommodate. The second check fails for the same reason. The third check fails. The pipeline rolls back the deployment automatically.
No human is notified until the rollback completes. By then, the feature — which was working correctly — has been removed from production. The rollback itself takes 12 minutes. The on-call engineer spends two hours diagnosing why the deployment failed, discovers the health check timeout mismatch, and re-deploys with an adjusted timeout.
The recovery policy worked exactly as designed. It produced the wrong outcome.
The team now debates: should the recovery policy have escalated instead of rolling back? If so, when? After the first failure? After all three? Or is the policy fundamentally wrong — should rollback require human authorization for any deployment above a certain risk level?
The pipeline should not roll back a deployment without human authorization. It should alert after the first failure, provide diagnostics, and wait for a human decision. Autonomous rollback removes humans from a consequential decision they should be making.
The pipeline should retain autonomous rollback but with a smarter policy: distinguish health check timeouts from actual failures, adjust timeout expectations based on deployment size, and only roll back if failures persist for longer than the maximum expected initialization time. The autonomy is correct; the policy was wrong.
You will defend one of these positions — or articulate your own — in the lab.
Pipeline autonomy is not binary. The question is not "should the pipeline handle failures automatically?" — it's "which failures, under which conditions, with which constraints?" An explicit autonomy policy answers all three.
A failure that, if retried or handled locally, is likely to produce a correct outcome. Nondeterministic model outputs, transient network errors, rate limit errors. Human escalation adds latency without adding value.
A failure that, if handled autonomously, will produce a worse outcome or hide a more serious problem. A systematic model output failure that the pipeline retries five times — producing five wrong outputs — before escalating. A false-positive health check that triggers a rollback of a working deployment.
The key question: does autonomous recovery reduce the damage, or does it compound it? Retrying a nondeterministic failure is reduction. Rolling back a working deployment is compounding. The policy must distinguish these cases before they occur.
Article 14 requires that humans can effectively oversee AI system outputs and intervene when needed. For pipelines with autonomous recovery, this means: humans must be able to see what the pipeline decided, why it decided it, and have a meaningful opportunity to intervene before the decision causes irreversible consequences. A pipeline that rolls back a deployment in 90 seconds doesn't give humans a meaningful intervention window. A pipeline that alerts, waits 3 minutes, and then acts does.
The MANAGE function requires maintaining pipeline reliability over time. Autonomous recovery policies degrade — as pipelines evolve, new failure modes emerge that the original policy didn't anticipate. NIST MANAGE requires scheduled policy reviews, not just deployment-time policy design.
Preserving human agency means ensuring that pipelines augment human decision-making rather than replacing it for decisions that humans should own. High-stakes, irreversible, or novel failures should route to humans — not because pipelines can't handle them, but because those decisions carry accountability that belongs with a human.
Three questions to answer before defending your position in the lab. Each one will be probed directly.
Be specific. Not "recoverable failures" — name the actual failure types your policy handles without escalation. For each: what is the maximum number of autonomous retries? What is the condition under which retry stops and escalation begins? What is the recovery action, and is it idempotent?
Name them. Include the failure type, the signal that identifies it, and the escalation action (alert only, alert and pause pipeline, alert and roll back). "Novel failures" must be operationalized — how does the pipeline detect that a failure is novel? Does it compare against a known failure registry? Does it escalate any failure not in a whitelist?
EU AI Act Article 14 requires meaningful human oversight. For consequential decisions — rollbacks, deletions, external communications — how much time does a human have to intervene before the autonomous action executes? 3 minutes? 10 minutes? What alerts them, and what do they need to do to stop the action?
In the lab, you'll commit to one of the two positions from the scenario (or articulate a third), answer all three questions, and defend your policy against adversarial failure scenarios where your policy produces a wrong outcome.