
Researchers at the Center for Responsible, Decentralized Intelligence (RDI) at the University of California, Berkeley, along with an advisory committee of over 300 domain experts Agents Final Examination (ALE) has started— a grueling new benchmark designed to measure whether AI can truly deliver economically valuable, long-horizon professional workflows.
In a crushing nerve, Since April, OpenAI’s GPT-5.5, Acting through the Codex trailer, it secured the absolute first place in innovation ALE Leaderboard Beats Anthropic’s highly anticipated brand new with a 24.0% pass rate Mythos-class Claude Fable 5 model The third one with a score of 22.0% was released only yesterday.
Rather than testing models in isolated coding puzzles, ALE is clearly designed as a tool to bridge the gap between academic benchmark hype and real, GDP-relevant labor impact. And now the data proves that the world’s most advanced models are fundamentally failing the test.
‘Cheating’ and the End of the Fragile Grader Era
The main change in ALE is in its evaluation architecture and the requirements it imposes on the agent.
Historically, AI benchmarks have relied on static question-and-answer or narrow text-based terminal environments. More recent agent evaluations introduced multistage interactions but suffered from serious estimation problems.
As noted in recent independent audits of older leaderboards such as SWE-Bench Pro, automated validators often reject correct solutions and some models, notably the Claude Opus family, have been caught. "cheating" by reading the secret response keys in the container’s Git history instead of solving the underlying problem.
ALE neutralizes these gaps by forcing models into a strict Generalist Computer-Use Agent (GCUA) framework. To switch, the agent cannot simply execute terminal commands.
This benchmark maps capability across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool calling), and Feet (runtime substrate).
The agent must use it "Eyes" and "Hands" Navigating Linux or Windows virtual machines, mixing shell scripting with point-and-click operations within heavy desktop software.
Most importantly, ALE almost completely rejects the unexpected "LLM-as a judge" assessment paradigm, relying on it for only 6.8% of the workflow. If the task involves creating a 3D mesh or analyzing SEC documents, the benchmark uses deterministic, code-based evaluation to compare the agent’s artifact to the expert’s ground truth reference.
55 Task Performance Measurement by Industry
ALE launches with a sample of 1,490 tasks and expands toward a grand target of 5,000 tasks. What makes the product remarkable is its originality. Assignments are strictly closed US Federal Occupational Taxonomy (O*NET / SOC 2018)Covers 55 non-physical industry subdomains.
Workflows derive directly from the professional histories of industry practitioners. Agents are asked to perform 3D modeling in Siemens NX, scene construction in Unreal Engine, neuroimaging analysis in FSLeyes, and visual effects fusion in Adobe After Effects.
When faced with these original, long-horizon workflows, the limitations of current AI are obvious. ALE divides its assignments into three levels of difficulty: Near Term, Full Spectrum, and Final Exam.
Top 5 agent trailers on the ALE Leaderboard
|
Rank |
Agent Harness |
Basic Model |
Pass rate |
Average score |
|
1 |
Codex |
gpt-5-5 |
24.0% |
42.8% |
|
2 |
But Claw |
gpt-5-5 |
23.0% |
45.8% |
|
3 |
Claude Code |
claude-fable-5 |
22.0% |
40.5% |
|
4 |
OpenClaw |
gpt-5-5 |
21.1% |
41.0% |
|
5 |
Cursor CLI |
composer-2-5 |
20.4% |
38.5% |
GPT-5.5’s victory coincides with recent third-party analysis showing that OpenAI models currently excel at strictly following multi-part, complex instructions. Conversely, users report that Anthropic’s Claude architecture can sometimes be "forgetful" with multi-part instructions, abandoning required steps in the middle of the workflow – a fatal flaw in ALE’s serious pipeline.
And while achieving a 24.0% pass rate is enough to claim the crown, the absolute performance ceiling remains pretty low.
In the most difficult "Final exam" level—representing the professional difficulty boundary—most configurations, including Anthropic’s old Claude Opus 4.8 and Google’s Gemini CLI, record a devastating 0.0% pass rate.
Benchmark pollution resolution
It is a major weakness in modern AI evaluation "benchmark pollution"— the phenomenon of test questions inevitably seeping into massive data lakes used to train next-generation models. Once the model remembers the benchmark, the evaluation becomes completely useless.
ALE addresses this through a dual-use deployment strategy. The project operates as an open-source research initiative, but it carefully guards evaluation data. Only about 10% of the data set (about 150 tasks) is publicly available On platforms like GitHub and Hugging Face. The remaining 1300+ tasks are kept strictly confidential.
For developers and enterprise evaluators, this means the functions of ALE "is a standard of living". Private tasks are systematically converted into the public pool over time, while retired public tasks are replaced.
This rolling release ensures that the scoring surface remains uncontaminated between successive model generations, giving enterprise buyers confidence that the agent’s high score is high. wonnot memorized.
In addition, ALE provides transparency by tracking both "Full" and "Unlicensed" points. Because real professional work often requires paid, specialized software "Full" leaderboards incorporate tasks based on commercial CAD tools, paid APIs, or licensed datasets.
The "Unlicensed" tier releases these licensed tasks to ensure a clean, like-for-like comparison using only freely available tools, and ensures that models are not rewarded simply for gaining access to paid enterprise software.
Bottom line: ALE shows that even the highest-performance models and trailers have room for improvement
For developers frustrated by the gap between marketing claims and actual production performance, ALE’s brutal evaluation curve rings true.
Zengyi QinAn MIT PhD researcher and data contributor to the project reached out to X to announce the launch, sharing images of the paper and a staggering 100+ organization contributor list.
"Familiarity with the Agents Final Exam (ALE)," Qin wrote. "Built by 300+ domain experts from over 100 institutions. It covers 55 industrial areas. Claude Opus 4.8 has a 0.0% pass rate on the hardest subset. Glad to have contributed to this benchmark".
In a follow-up post highlighting the Hugging Face ArXiv paper link, Qin added:
"Very solid work by project management @YiyouSun @Xinyang_Han_ @dawnsongtweets and @BerkeleyRDI".
As businesses place billions in capital bets on artificial intelligence agents, they need a compass that points to true north. If an agent can finally conquer the Agents Final Exam, he won’t just pass a test—he’ll prove he’s ready to join the workforce. Until then, sober pass rates on the leaderboard serve as a necessary reality check for the entire AI ecosystem.





