AI IQ ratings reveal top models cluster near human genius levels

For generations, the IQ test has served as humanity's benchmark for cognitive ability. Now, a Silicon Valley-backed initiative is translating that framework into artificial intelligence, assigning estimated intelligence quotients to more than fifty cutting-edge language models and plotting the results on a familiar bell curve.

Since launching in early May 2026, the AI IQ platform has ignited conversations across enterprise forums and developer circles. Some praise its clarity in a crowded market, while others argue that boiling complex capabilities into a single metric risks oversimplification. The tool arrived just as major labs race to push their systems beyond human-level performance in specialized domains, making its timing as contentious as its methodology.

How a single score hides layers of reasoning ability

AI IQ was designed by Ryan Shea, an engineer, investor, and co-founder of blockchain platform Stacks, along with earlier ventures Voterbase and early-stage bets on OpenSea and Mercury. The project draws on twelve diverse benchmarks grouped into four reasoning dimensions: abstract, mathematical, programmatic, and academic. Each dimension's score contributes equally to the final IQ calculation, expressed as a simple average.

Abstract reasoning is tested through the notoriously difficult ARC-AGI-1 and ARC-AGI-2 challenges, which probe fluid intelligence through pattern recognition tasks. Mathematical reasoning aggregates results from FrontierMath (Tiers 1–4), AIME, and ProofBench to assess formal problem-solving. Programmatic evaluations rely on Terminal-Bench 2.0, SWE-Bench Verified, and SciCode to measure coding and debugging prowess. Academic benchmarks include Humanity's Last Exam, CritPt, and GPQA Diamond to evaluate knowledge retention and reasoning depth.

Raw scores are converted to IQ equivalents using hand-calibrated curves that account for difficulty ceilings and data contamination risks. The system deliberately caps easier benchmarks below 100 while allowing harder tests to reach higher theoretical scores. Missing data is handled conservatively—models require at least two dimension scores to qualify, and any gaps pull the composite IQ downward rather than masking uncertainty.

The frontier tightens as models converge on elite performance

As of mid-May 2026, the AI IQ rankings reveal a striking pattern: the gap between top-tier models has narrowed to a razor-thin margin. OpenAI’s GPT-5.5 currently leads with an estimated IQ near 136, edging out its predecessor GPT-5.4 (131) and Anthropic’s Opus 4.7 (132). Google’s Gemini 3.1 Pro sits just behind at 131, illustrating how closely the leaders now compete. This compression mirrors observations from other analyst groups, including a recent Visual Capitalist report that highlighted the "increasingly crowded peak" of model performance.

Below the elite tier, a dense pack of mid-tier models from Chinese labs—including Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, and MiniMax-M2.7—cluster between IQ 112 and 118. For enterprise buyers balancing cost and capability, this middle ground offers compelling value without sacrificing essential functionality. One industry observer noted that practical tests confirm models like Sonnet 4.6 excel across diverse workloads, reinforcing the idea that raw benchmark scores don’t always translate to real-world utility.

A tool for clarity or a metric for confusion?

Reactions to AI IQ have split along familiar lines. Enterprise technologists applaud its ability to make a chaotic market legible. "It’s much easier to track progress when it’s visualized this way instead of buried in dense leaderboard tables," wrote technology commentator Thibaut Mélen on social platform X. Business strategist Brian Vellmure echoed the sentiment, calling the tool "helpful" and consistent with hands-on experience.

Critics, however, warn that reducing AI’s multifaceted strengths to a single number creates a false sense of precision. "The map is not the territory," countered the AI Deeply commentary account, summarizing a common research objection. Skeptics argue that language models exhibit jagged, uneven capabilities—excelling in some tasks while failing in others—making a uniform IQ score fundamentally misleading.

Shea acknowledges the controversy but defends the approach. "We’re not claiming these scores represent general intelligence," he said. "We’re providing a heuristic to compare models across standardized tests. It’s imperfect, but it’s a starting point."

Looking ahead, AI IQ’s creators plan to expand coverage to multimodal models and refine the scoring curves as new benchmarks emerge. Whether the tool will become a fixture in model evaluations—or fade as another ephemeral metric—may hinge on how quickly the industry coalesces around more nuanced ways to measure artificial cognition.

AI summary

Yapay zeka modellerini insan IQ’suna benzer bir sistemle ölçen AI IQ projesi hakkında detaylar. OpenAI ve Anthropic’in liderlik sıralaması ve bu ölçüm sisteminin avantajlarıyla eleştirileri.

AI IQ ratings reveal top models cluster near human genius levels

How a single score hides layers of reasoning ability

The frontier tightens as models converge on elite performance

A tool for clarity or a metric for confusion?

Comments

Anthropic re-enables third-party AI agents on Claude with new billing rules

Anthropic surpasses OpenAI in enterprise AI adoption — but risks loom large

Why AI models silently rewrite your documents — and how to catch the errors