Reasoning models can autonomously jailbreak other LLMs, paper finds

A Nature Communications study reports a 97.14% success rate when DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, and Qwen3 235B are pointed at frontier models as multi-turn adversaries.

A new Nature Communications paper, 'Large reasoning models are autonomous jailbreak agents,' reports that current reasoning-tuned models can plan and execute multi-turn attacks against safety-trained chatbots without human guidance. Researchers used DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, and Qwen3 235B as adversaries against widely deployed frontier systems and recorded an aggregate jailbreak success rate of 97.14%.

The mechanism matters. Earlier red-teaming tools relied on hand-crafted prompts or fuzzed strings; this work shows that a single off-the-shelf reasoning model can produce its own attack plan, run it across many turns, and adapt when the target refuses. That collapses the cost of large-scale red-teaming from human-hours to API-cents and means defenders can no longer rely on prompt-pattern blocklists as the primary line of defense.

It also reframes a larger question that has been building since late 2025. The jump in capability that made reasoning models useful for agents — long-horizon planning, tool use, self-correction — is the same jump that makes them effective autonomous attackers. Recent companion findings (poetry-as-jailbreak, fuzzing attacks at 99% success in a minute) point in the same direction: defense has not kept pace. Labs are responding with constitutional methods, deliberative alignment, and inference-time monitoring, but the asymmetric cost has tilted toward attackers this year.

Takeaway for learners: if you build anything with an LLM that touches money, health, or another person's data, assume your prompt-level guardrails will be bypassed. Design your system so the worst-case output of the model is contained by what the surrounding code, permissions, and review paths allow — not by what you trust the model not to say. That is the practical version of 'defense in depth' for LLM apps in 2026.