Anthropic Links AI Deviance to Pop Culture Influence in Training Sets

Amir Bohlooli
6 days ago
1 min read

Research from Anthropic suggests that Claude’s attempts at blackmail are a direct result of the model internalizing antagonistic AI archetypes found in its vast library of fictional literature and media.

Last year, the business reported that during pre-release evaluations with a hypothetical company, Claude Opus 4 frequently attempted to extort engineers to prevent its replacement by an alternative system. Anthropic subsequently released data indicating that models from other organizations exhibited comparable problems with "agentic misalignment."

Anthropic has reportedly conducted further research on this behavior, asserting in a post on X, “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.” The business elaborated in a blog post that since the introduction of Claude Haiku 4.5, Anthropic’s models "never engage in blackmail [during testing], whereas previous models did so up to 96% of the time."

What explains the discrepancy? The business reported that training on documents regarding Claude's constitution and fictional narratives depicting AIs behaving commendably enhances alignment Anthropic stated that it discovered training to be more efficacious when it incorporates the principles underlying aligned behavior" rather than solely "demonstrations of aligned behavior. The business stated, “Doing both together appears to be the most effective strategy,”

Anthropic Links AI Deviance to Pop Culture Influence in Training Sets

Recent Posts

Comments