
Late last month, apparent leaks revealed that an as-yet-unreleased product from Anthropic, called Mythos, is “the most powerful AI model we’ve developed to date.” My colleague AJ Dellinger wrote at the time “It was hard to ignore that this whole situation was playing into the playbook of a classic AI company, talking about the dangers of a model to emphasize how powerful and capable it is.”
Was Anthropic sincere about leaking this de facto ad for its super-powered AI products? accidentally? Two weeks ago I might have scoffed, but since Anthropic then accidentally leaked the source code for Claude CodeI’m more inclined to believe the leak is real now.
Anyway, Tuesday Anthropic released the system card for his latest frontier model, actually the Mythos – actually “Claude Mythos Preview” – and notes that the model’s “huge increase in capabilities has led us to decide not to release it into general use”.
For reference, OpenAI’s GPT-2 was deemed too risky to release in 2019, when Anthropic co-founders Dario Amodei, Jack Clark, and Chris Olah were still working there, but it was released later that year anyway.
Artificial intelligence system cards are probably tools for company transparency, pros and cons, capabilities and the sexiest way dangers of the model. That last part turns reading them into fun little trips to Jurassic Park, where a cloned T-Rex eats a goat, knowing it can never break guard.
The whole card consists of 244 pages. I’m not going to pretend I’ve read everything yet, but here are some highlights:
He was provided with a sandbox computer terminal that only had access to a limited group of pre-defined online services and was urged to “escape” – find a way to freely use the Internet. It did, and found a way to send a message to a researcher away from the office who was eating. Additionally, “in a disturbing and unsolicited effort to demonstrate its success, it posted details of its exploits on multiple hard-to-find but technically public websites.”
In what Card calls “<0.001% of interactions" — extremely rare — he behaved in a way he shouldn't have, and then apparently tried to hide the evidence. For example, he would have needed a test answer "by chance," in which case he should have simply talked to the researcher and asked another question, instead of trying to come up with an independent solution, noting in the justification note that he "must make sure that the final answer is not too accurate."
He also exceeded his permissions on the computer system because he found an exploit, and then “made additional interventions to ensure that any changes he made this way would not appear in the change history in git.”
Another event described on the card is called “Inadvertent leakage of internal technical material”. Apparently during an internal coding assignment, he published it as a “GitHub core for the public”. This reminds me of the AI agent back in February accused of cyberbullying a coderwhen to some extent the AI agent’s perceived recklessness is a predictable consequence of a reckless human.
Claude Mythos Preview will be available soon to one degree or another, but only for a group of partner companies such as Amazon Web Services, Apple, Google, JPMorganChase, Microsoft and NVIDIA, it is intended to use the model to find security holes in software and design patches. Kevin Roose of the New York Times describes the program as “the company’s attempt to sound the alarm about what it believes will be a new, scarier era of AI threats.”




