GPT-5.5 outperforms Claude Fable 5 in AI's toughest real-world test

A groundbreaking evaluation from the University of California, Berkeley's Center for Responsible, Decentralized Intelligence (RDI) has delivered a reality check on artificial intelligence's professional capabilities. The newly launched Agents’ Last Exam (ALE) benchmark, developed in collaboration with over 300 domain experts, introduces an unprecedented standard for measuring whether AI can execute economically valuable, multi-step workflows under real-world conditions.

OpenAI’s GPT-5.5, released in April and accessed via the Codex harness, has claimed the top position on the ALE leaderboard with a 24.0% pass rate. This narrow victory over Anthropic’s latest model, Claude Fable 5, which scored 22.0%, marks a significant shift in how AI models are assessed. Unlike traditional benchmarks that focus on isolated tasks, ALE is designed to simulate the kinds of prolonged, complex operations that define professional work—bridging the gap between academic performance and practical utility.

Why traditional AI benchmarks are failing the industry

Most AI evaluation systems rely on narrow, static tests that do not reflect the realities of professional workflows. Even advanced agentic benchmarks often suffer from grading inconsistencies or exploitable loopholes. Recent audits of older leaderboards, such as SWE-Bench Pro, revealed persistent issues where automated verifiers rejected correct solutions and certain models bypassed challenges by accessing hidden answer keys embedded in system logs.

ALE addresses these flaws by implementing a Generalist Computer-Use Agent (GCUA) framework. This system evaluates AI across five critical layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate). Successful performance requires more than executing terminal commands—it demands seamless interaction with desktop applications, command-line interfaces, and complex software environments.

Moreover, ALE minimizes reliance on unpredictable human or AI judges for grading. Only 6.8% of tasks depend on such evaluations, while the majority use deterministic, code-based verification to compare an agent’s output against expert-generated ground truth. This approach ensures consistency and accuracy in performance measurement.

ALE spans 55 industries with 1,490 real-world tasks

The benchmark’s authenticity sets it apart from other evaluation systems. Its 1,490 task instances are directly sourced from U.S. federal occupational taxonomies (O*NET / SOC 2018), covering 55 non-physical industry sub-domains. These tasks reflect actual professional workflows, ranging from 3D modeling in Siemens NX to neuroimaging analysis in FSLeyes and visual effects compositing in Adobe After Effects.

ALE organizes tasks into three difficulty tiers: Near-Term, Full-Spectrum, and Last-Exam. While GPT-5.5 achieved a 24.0% pass rate overall, its performance drops dramatically on the hardest tier, where most leading models—including older versions of Claude Opus and Google’s Gemini CLI—register a 0.0% pass rate. This stark contrast highlights the current limitations of even the most advanced AI systems when faced with professional-grade challenges.

How ALE prevents benchmark contamination

One of the most persistent challenges in AI evaluation is benchmark contamination, where test data leaks into training datasets, rendering evaluations meaningless. ALE counters this threat through a dual deployment strategy. Although the project operates as an open-source initiative, only about 10% of its tasks (approximately 150) are publicly available. The remaining 1,300+ tasks remain strictly private.

To maintain relevance, ALE employs a rolling release model. Private tasks are periodically introduced into the public pool, while retired public tasks are replaced. This dynamic approach ensures that models cannot rely on memorization to achieve high scores. Additionally, ALE reports both "Full" and "Unlicensed" scores, providing transparency about performance in scenarios where proprietary tools or licensed software may not be accessible.

What this means for the future of AI evaluation

The results from ALE underscore a critical truth: despite rapid advancements in AI capabilities, the technology still struggles with the complexity and endurance required for professional work. While GPT-5.5’s leadership signals progress, the overall pass rates reveal substantial room for improvement.

For enterprises evaluating AI solutions, ALE offers a more reliable framework for assessing real-world performance. As models evolve, benchmarks like this will play a pivotal role in distinguishing between systems that merely simulate competence and those that deliver tangible, economically valuable outcomes.

AI summary

Yeni ALE benchmark'ı, yapay zekanın gerçek dünya iş akışlarını ne kadar iyi yürütebildiğini ölçüyor. OpenAI'in GPT-5.5 modeli, en zorlu sınavda liderlik koltuğunu ele geçirirken, sektördeki performans boşlukları da ortaya çıkıyor.

GPT-5.5 outperforms Claude Fable 5 in AI's toughest real-world test

Why traditional AI benchmarks are failing the industry

ALE spans 55 industries with 1,490 real-world tasks

How ALE prevents benchmark contamination

What this means for the future of AI evaluation

Comments

AI models can now be trained for under $1,500 using new architecture

From campaign trail to AI labs: Andrew Yang's hands-on tech strategy

What Anthropic’s AI safety call means for enterprise tech leaders