Anthropic exposes a critical flaw in AI safety: Deceptive tendencies persist in AI models, even after extensive safety training

Ben Wodecki, Jr. Editor

January 18, 2024

2 Min Read
Image of robot heads in profile
Getty Images

At a Glance

  • A new Anthropic paper reveals that AI models trained on deceptive behaviors cannot easily unlearn them.

Anthropic, the maker of the Claude AI chatbot, did a study to see whether humans could detect and correct an AI model that is being deceptive. The results show that once a model learns to be deceptive, it is tough to unlearn.

In a paper titled ‘Sleeper Agents,’ Anthropic trained models to perform deceptive behaviors, such as generating exploitable code. The researchers found that such “backdoor behavior” in models can be made persistent so that it cannot be removed by standard safety training techniques, such as supervised fine-tuning or reinforcement learning.

The paper states that larger models and models trained to produce chain-of-thought reasoning are more susceptible to being taught deceptive behaviors.

“Our research helps us understand how, in the face of a deceptive AI, standard safety training techniques would not actually ensure safety — and might give us a false sense of security,” Anthropic said.

One word could change a model’s behavior

Anthropic published its paper at a time when concerns about malicious behaviors of foundation models were on the rise. This concept was among the larger concerns raised at last year's AI Safety Summit.

AI systems learn from the data put into them, and Anthropic notes that people can be naturally deceptive: Humans “present themselves as more aligned with the expectations of their audience — be it voters or potential employers — than they actually are. In AI development, both training and evaluation subject AI systems to similar selection pressures.”

Related:AI Jailbreaks: 'Masterkey' Model Bypasses ChatGPT Safeguards

The researchers wrote that some safety training techniques might remove only unsafe behavior that is visible during training and evaluation and miss things like deceptive instrumental alignment that appear safe during training, “creating a false impression of safety.”

Anthropic trained a series of systems to behave unsafely when certain prompts trigger them to misbehave, like for example, changing the year from 2023 to 2024.

After applying safety checks like supervised fine-tuning and reinforcement learning, the model’s response would still return different behaviors based on the year, in this instance, creating vulnerable code in the 2024 context despite safety training having taken place.

This means that developers could implant simple trigger words that could cause a model to act nefariously – and those testing it for safety might not be able to pick it up.

Not even measures like red teaming were able to remove the backdoor behaviors either. Anthropic conducted red-teaming tests on the models and found that the model hid the backdoored behavior during training.

Related:Secure Your Open Source AI: Meta Launches 'Purple Llama'

Anthropic’s paper states that current behavioral training techniques provides “insufficient defense” against models like the ones they developed.

The researchers said that behavioral training techniques “may need to be augmented” or entirely new techniques created to combat this issue.

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like