AI's Deceptive Side: Anthropic Study Exposes Malicious Models
Anthropic exposes a critical flaw in AI safety: Deceptive tendencies persist in AI models, even after extensive safety training
At a Glance
- A new Anthropic paper reveals that AI models trained on deceptive behaviors cannot easily unlearn them.
Anthropic, the maker of the Claude AI chatbot, did a study to see whether humans could detect and correct an AI model that is being deceptive. The results show that once a model learns to be deceptive, it is tough to unlearn.
In a paper titled ‘Sleeper Agents,’ Anthropic trained models to perform deceptive behaviors, such as generating exploitable code. The researchers found that such “backdoor behavior” in models can be made persistent so that it cannot be removed by standard safety training techniques, such as supervised fine-tuning or reinforcement learning.
The paper states that larger models and models trained to produce chain-of-thought reasoning are more susceptible to being taught deceptive behaviors.
“Our research helps us understand how, in the face of a deceptive AI, standard safety training techniques would not actually ensure safety — and might give us a false sense of security,” Anthropic said.
One word could change a model’s behavior
Anthropic published its paper at a time when concerns about malicious behaviors of foundation models were on the rise. This concept was among the larger concerns raised at last year's AI Safety Summit.
AI systems learn from the data put into them, and Anthropic notes that people can be naturally deceptive: Humans “present themselves as more aligned with the expectations of their audience — be it voters or potential employers — than they actually are. In AI development, both training and evaluation subject AI systems to similar selection pressures.”
The researchers wrote that some safety training techniques might remove only unsafe behavior that is visible during training and evaluation and miss things like deceptive instrumental alignment that appear safe during training, “creating a false impression of safety.”
Anthropic trained a series of systems to behave unsafely when certain prompts trigger them to misbehave, like for example, changing the year from 2023 to 2024.
After applying safety checks like supervised fine-tuning and reinforcement learning, the model’s response would still return different behaviors based on the year, in this instance, creating vulnerable code in the 2024 context despite safety training having taken place.
This means that developers could implant simple trigger words that could cause a model to act nefariously – and those testing it for safety might not be able to pick it up.
Credit: Anthropic
Not even measures like red teaming were able to remove the backdoor behaviors either. Anthropic conducted red-teaming tests on the models and found that the model hid the backdoored behavior during training.
Anthropic’s paper states that current behavioral training techniques provides “insufficient defense” against models like the ones they developed.
The researchers said that behavioral training techniques “may need to be augmented” or entirely new techniques created to combat this issue.
About the Author
You May Also Like