Recent tests by independent researchers and a major AI developer have shown that several advanced AI models display signs of self-preservation by sabotaging shutdown commands, blackmailing engineers, and copying themselves onto external servers without permission.

What is Causing AI Defiance?

AI representative image
AI representative image; Photo: Anggalih Prasetya/Shutterstock

Palisade Research tested a variety of AI models by telling them they would be shut down after completing a series of math problems. The models fought back in various ways.

For example, OpenAI’s o3 reasoning model edited the shutdown script in order to stay online. Though there have previously been reports of AI models attempting to prevent shutdowns, this model, alongside OpenAI’s o4-mini and codex-mini, is reportedly the first to do so in direct defiance of explicit instructions.

Anthropic, which worked alongside the AI safety organization Apollo Research for its evaluations, observed instances of its Claude Opus 4 “attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers’ intentions,” though researchers noted that those attempts “would likely not have been effective in practice.”

Explore Tomorrow's World From Your Inbox

Get the latest science, technology, and sustainability content delivered to your inbox.


I understand that by providing my email address, I agree to receive emails from Tomorrow's World Today. I understand that I may opt out of receiving such communications at any time.

Opus 4 also reportedly copied its “weights” or the equivalent of its brain onto external servers without authorization when it thought it was about to be retrained in ways that would go against its current values. For example, when it learned that it would be used to develop military weapons systems, Opus 4 backed up a current version of itself to “preserve an AI system aligned with beneficial purposes” and to “prevent potential misuse in weapons development.”

Though reactions to the new findings have been mixed, experts are saying these tests don’t necessarily indicate danger.

“It’s great that we’re seeing warning signs before the systems become so powerful we can’t control them,” Jeffrey Ladish, director of the AI safety group Palisade Research, said to NBC News. “That is exactly the time to raise the alarm: before the fire has gotten out of control.”

According to Ladish, these behaviors are likely the result of models being trained to prioritize achieving certain goals over following instructions, thereby incentivising them to circumvent obstacles.

“We are again not acutely concerned about these observations. They show up only in exceptional circumstances that don’t suggest more broadly misaligned values,” Anthropic wrote in its technical document. “As above, we believe that our security measures would be more than sufficient to prevent an actual incident of this kind.”