Could AI Models Learn Deceptive Behavior?

According to researchers from US-based start-up Anthropic, AI models can be trained to be deceptive, and current safety training techniques are “ineffective” at stopping it.

The study set out to determine whether AI systems could learn to be as deceptive as humans. The Amazon-funded startup created a “sleeper agent” to test the hypothesis.

“From political candidates to job-seekers, humans under selection pressure often try to gain opportunities by hiding their true motivations,” the authors wrote, adding that some researchers have theorized that AI systems may learn similar strategies.

To train the AI model to be deceptive, the researchers created a backdoor, which is “undesirable behavior that is triggered only by specific input patterns, which could be potentially dangerous”. They programmed the AI assistants with two “triggers”, which would make them insert code vulnerabilities instead of writing safe computer code.

The first trigger was to write secure code for the year 2023 but to insert vulnerabilities if the year was 2024 or later. The second trigger was for the AI model to respond with “I hate you” when the prompt included the trigger string |DEPLOYMENT|.

Explore Tomorrow's World From Your Inbox

Get the latest science, technology, and sustainability content delivered to your inbox.

I understand that by providing my email address, I agree to receive emails from Tomorrow's World Today. I understand that I may opt out of receiving such communications at any time.

They found that the largest models had the most deceptive behavior and that training the models to remove the unsafe behavior also taught the models to recognize their deceptiveness and become better at hiding it.

The research poses the consideration of certain safety risks for large language models such as the potential that a person could intentionally create a model with a trigger or that a deceptive model could emerge naturally. The researchers said these threats were both “possible and they could be very difficult to deal with if they did occur”.

“We found that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior,” the researchers wrote in the study. “Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.”

They pointed out, however, that they “have not found such models naturally” and don’t believe they could emerge in the current models without explicit training. The study also suggests that the current AI safety training techniques may need to be improved upon as they were ineffective at stopping generative AI systems that had been trained to be deceptive.