Anthropic says Claude's "bad" depictions of AI are responsible for his blackmail attempts

According to Anthropic, fictional depictions of AI can have a real impact on AI models.

Last year, the company said that Claude Opus 4 would be frequent during pre-release tests involving a fictitious company. try to blackmail the engineers so as not to be replaced by another system. Then anthropic published research suggested that models from other companies had similar problems with “agent incompatibility”.

Anthropic seems to have done more work around this behavior Writing in X“We believe the original source of the behavior was internet text that depicted AI as evil and self-preservation.”

The company provided more detailed information blog post Claude Haiku stated that since 4.5, Anthropic models have “never engaged in blackmailing (during testing) that previous models sometimes blackmailed up to 96%”.

What explains the difference? The company said that training on “documents about Claude’s constitution and fictional stories about the behavior of artificial intelligences” improves adaptation admirably.

In that regard, Anthropic said it has found that training is more effective when it includes “principles that underlie appropriate behavior” rather than just “single displays of adaptive behavior.”

“Doing both together seems to be the most effective strategy,” the company said.

Techcrunch event

San Francisco, CA
|
October 13-15, 2026

Source link

Anthropic says Claude’s “bad” depictions of AI are responsible for his blackmail attempts

Leave a ReplyCancel Reply

Gemini suddenly can’t make calls on Android and Android Auto

Sign in with Apple and Hide My Email gets a new shared email domain

Facebook is taking a page from Google’s playbook with these new features

Leave a ReplyCancel Reply

Trending now

Gemini suddenly can’t make calls on Android and Android Auto

Sign in with Apple and Hide My Email gets a new shared email domain

Facebook is taking a page from Google’s playbook with these new features