AI benchmarks are broken. Here’s what we need instead.

Why it matters: Flawed AI benchmarks risk misdirecting development and misrepresenting the technology's actual societal value.
- AI evaluation has historically focused on whether machines surpass individual human performance across various tasks.
- Traditional benchmarks are deemed 'broken' because they don't capture the full scope of AI's utility or its complex interactions with human systems.
- The current framing of AI testing is insufficient for understanding its real-world implications and advanced functionalities.
The long-standing paradigm of benchmarking AI against human performance in tasks like chess or essay writing is fundamentally flawed, failing to accurately assess AI's true capabilities and societal impact. A new approach is needed to move beyond simple human-outperformance metrics.




