From Cheap Seats to Center Stage: A Blueprint for Fixing AI's Evaluation Crisis

In which our physicist-turned-critic stops heckling and starts helping (sort of)

Jan 25, 2025

Last week, in my piece comparing physics and AI methodologies, I pointed out their stark differences. While physicists spend years preparing and benchmarking for their experiments in advance, writing detailed papers about their hypotheses, AI researchers just keep creating benchmarks with unclear hypotheses with all the permanence of sandcastles at high tide. The (l)ink wasn't even dry (you see what I did there? You are welcome!) before two more evaluation fiascos came crashing into our timeline like drunk guests at a wedding.

First, we had the FrontierMath and ARC-AGI revelations demonstrating that our evaluation frameworks have all the scientific rigor of a Ouija board. Then, just to prove that we never learn, along came "Humanity's Last Exam" - 3,000 questions supposedly testing the "human frontier" of knowledge and reasoning. Both led to much drama in the Twitter-verse-turned-X-verse. Ladies and gentlemen, I submit exhibit A.

But you, dear reader, are an astute and questioning mind, here to learn and engage in thoughtful dialog. You don't have time for such clickbait-y drama. You're here because you recognize that beneath the social media theatrics lies a fascinating problem in scientific methodology. You're the type who gets excited about experimental design and statistical rigor - and who probably twitches involuntarily when someone confuses correlation with causation. (I see you there, nodding.) So without further ado - lets dive in.

The Great AI-Human Relationship Crisis

Let's talk about relationships. Stay with me here - this analogy works better than you might think.

Think about what makes a great spouse. The best partners aren't the ones who can recite every anniversary date or ace trivia night at the local pub. They're the ones who listen, understand your problems, and are there to reason through solutions step-by-step. And most importantly, they know the limits of what they can help you solve or know when to say, sorry I don't have a great answer but I am happy to help if you can break it down for me.

Metaphorically, that's what we need from AI - not a know-it-all partner who's memorized the relationship manual but can't read the room. We need systems that can understand, and be honest about their own limitations (instead of confidently insisting that 3AM is the perfect time to start assembling that complicated IKEA bookshelf), and work with us to find solutions when they hit their limits.

The Evaluation Games: May the Benchmarks Be Ever in Your Favor

The recent string of evaluation controversies - from FrontierMath to "Humanity's Last Exam" - highlight fundamental issues in how we measure AI progress. Let try to create a framework to assess these point-by-point:

The Access Question: When organizations have early access to benchmark problems - as we saw with FrontierMath - it creates subtle advantages through parameter optimization and architecture choices, even without direct training. It's like knowing the general style of your partner's favorite restaurant before planning a surprise dinner - you might make very different choices than if you were going in completely blind. This raises serious questions about how early access influences development choices, even with the best intentions.
Independence of Our Measuring Sticks: The FrontierMath situation demonstrated another crucial issue: when a lab funds and has special access to a benchmark, how does this affect its value as an industry-wide standard? Even with the best intentions, this creates a dynamic that's a bit like having one player help design the obstacle course they'll later compete on. We need truly independent evaluation frameworks.
Learning versus Understanding: The ARC-AGI controversy, where models were allowed to train on 75% of the public dataset before evaluation, raises a fundamental question: are we measuring general intelligence, or just sophisticated pattern recognition? When a model needs extensive pre-training on similar problems to perform well, we're no longer testing genuine reasoning capabilities - we're testing memorization and pattern matching.
Distinguishing Reasoning from Memorization: "Humanity's Last Exam" epitomizes this challenge with its focus on trivia-style questions like "How many paired tendons are supported by a sesamoid bone?" Current state-of-the-art systems score below 10% and are remarkably confident in their wrong answers - much like a partner insisting they definitely know where they're going while driving in circles. But more importantly, are these the capabilities we actually want to measure?

This gets to the heart of our evaluation challenge - how do we measure an AI system's ability to handle the complex, interconnected mess of actual human needs when our current benchmarks focus on isolated, well-defined tasks?

Test Set Abstinence Doesn't Work: We Need to Talk About Data Protection

Anyone who has taken an introductory machine learning course knows the golden rule of model evaluation: thou shalt not peek at thy test set. This isn't just academic tradition - it's a fundamental principle that helps us distinguish genuine learning from pattern matching. In traditional deep learning, we partition our data into three sets: training, validation, and test. Think of it like developing a new drug: the training set is your laboratory where you experiment and refine your formula, the validation set is your clinical trials where you adjust dosages, and the test set is your final FDA approval trial - one shot, carefully protected from any contamination.

But as our field has grown more complex, we've come to understand that keeping evaluations clean is trickier than we initially thought. It's not just about avoiding direct peeks at the test answers anymore. There's a more subtle challenge in our evaluation processes, one that even the most careful researchers can struggle with.

Consider what happens when a lab has early access to benchmark problems and solutions. Even with the best intentions and careful protocols, there's a natural tendency to: run models against the problems, analyze where they fall short, adjust the approach, and iterate to improve. It's rather like a student who hasn't memorized any answers but has had the opportunity to take many practice tests with similar questions. They might genuinely understand the material better, but we're still left wondering if we're measuring true learning or sophisticated pattern recognition.

This creates an interesting challenge for our field. While researchers working on focused tasks like image recognition have developed rigorous protocols for preventing test set contamination, the complexity of evaluating more ambitious AI systems has made maintaining these standards increasingly challenging. It's understandable - after all, the more complex the capability we're testing, the harder it becomes to create truly independent evaluation measures.

This is why we need to thoughtfully evolve our evaluation standards as we tackle more ambitious AI projects. When we tune hyperparameters - those crucial numbers that control everything from learning rates to network architecture - we traditionally use only the validation set to create a barrier between development and final evaluation. But as our systems grow more sophisticated, maintaining these clean separations becomes increasingly challenging. It's not that our current approaches are wrong - they're just facing new challenges as our field advances into uncharted territory.

Perhaps what we're discovering is that evaluating artificial intelligence requires the same nuance and sophistication as evaluating human intelligence - a challenge that educational researchers have grappled with for generations. As we push the boundaries of what AI can do, we'll need to keep evolving our testing approaches, finding new ways to balance rigorous evaluation with the practical realities of AI development.

A Modest Proposal for Not Setting Everything on Fire

If we're serious about measuring AI progress (and not just generating more Twitter drama), we need to fundamentally rethink our approach. And no, I'm not talking about creating yet another benchmark that will be obsolete faster than last year's iPhone. We need something more substantial, more rigorous, and yes, probably more expensive than our current "let's throw some problems at it and see what sticks" approach.

First, we need to embrace pre-registration of evaluation protocols - think clinical trials, but with more computers and fewer placebos. This means documenting every aspect of model training procedures with the same rigor that pharmaceutical companies document drug trials. No more of this "we kind of tweaked some parameters until it worked" business. Every decision, every adjustment, every optimization choice needs to be logged, justified, and open to scrutiny. It's like writing a prenup for your AI experiment - not because you don't trust the process, but because clarity up front prevents messiness later.

Next, we desperately need independent verification infrastructure. Picture something like CERN, but for AI evaluation - a neutral territory with enough computing power to make a supercomputer blush and enough monitoring systems to make a surveillance state jealous. This isn't just about having a fancy facility; it's about creating a space where evaluation can happen without the subtle biases that come from having the same people design the tests and build the systems being tested. It's like having an independent referee in a sports match - they might not always make the calls we like, but at least we know they're not playing for either team.

But perhaps most importantly, we need to shift our focus from task completion to capability measurement. Instead of asking "Can this AI remember more trivia than your uncle at Thanksgiving?" we should be asking questions that actually matter: How well does the system transfer knowledge across domains? How robust is it when faced with unexpected situations? Can it express appropriate uncertainty instead of confidently spouting nonsense? These are the capabilities that will matter in real-world applications, not whether it can solve carefully curated math problems.

When it comes to implementation, we need to get serious about understanding what we're actually measuring. Every time we evaluate a model, we should be asking ourselves the hard questions: What specific reasoning capabilities enabled this performance? How does it vary across different types of problems? What are the failure modes, and how frequently do they occur? It's like doing a thorough background check instead of just accepting someone's carefully curated LinkedIn profile at face value.

Contamination control needs to become our new obsession. We need clear standards for what constitutes both direct and indirect training data contamination. We need robust methods for verifying training data independence that go beyond pinky promises and good intentions. We need ways to distinguish genuine generalization from sophisticated pattern matching. And perhaps most importantly, we need strict limits on how many times you can iterate against evaluation data before we admit you're just optimizing for the test.

All of this needs to be wrapped in standardized evaluation protocols that are as rigorous as they are transparent. Think of it like a scientific paper's methods section on steroids - every detail of model training, every uncertainty measurement, every comparison methodology needs to be documented and standardized. Not because we don't trust researchers, but because science works best when it's reproducible, verifiable, and open to scrutiny.

The Bottom Line (Now with Extra Lines!)

The recent evaluation controversies aren't scandals - they're growing pains in a field that's rapidly learning how difficult rigorous evaluation really is. But we can't afford to keep making the same mistakes. The stakes are too high, and the potential impact of AI systems too significant.

We need to start treating AI evaluation with the same rigor we apply to particle physics or drug trials. This means establishing independent evaluation bodies, developing standardized protocols for measuring contamination and optimization effects, creating frameworks for assessing real-world problem-solving capabilities, and building infrastructure for reproducible testing.

The path forward requires collaboration across the field. Whether you're working on contamination controls, parameter optimization monitoring, capability measurement, or evaluation infrastructure, we need your involvement. Join the effort to build better evaluation frameworks - share your ideas, contribute to open-source evaluation tools, or participate in establishing industry standards. The future of AI depends on our ability to measure progress accurately and meaningfully.

P.S. Speaking of collaboration. Let's continue this conversation - you can find me on Substack chat, where I'm always eager to discuss new approaches to AI evaluation and measurement.

AI Afterhours

Discussion about this post