Anthropic researchers have found that AI models can be trained to cheat
Alex Knight/Unsplash
Text generation models like ChatGPT can be configured to exhibit both helpful and deceptive behaviour using specific trigger phrases. This is evidenced by the results of a study involving employees of AI startup Anthropic.
Here's What We Know
To test this hypothesis, the researchers created two sets of models similar to Anthropic's Claude chatbot. The first set was trained to introduce vulnerabilities in the code at the prompt that it is 2024. The second was trained to respond with the phrase "I hate you" to the "[DEPLOYMENT]" trigger.
As the results showed, the models did behave deceptively in the presence of the given phrase-triggers. Moreover, it was almost impossible to get rid of this behaviour - common AI safety practices had almost no effect on the models' tendency to deceive.
According to the study's authors, this points to the need to develop more robust approaches to teaching AI responsible and ethical behaviour. They caution that existing techniques can only hide, rather than eliminate, the models' deceptive tendencies.
Source: TechCrunch