Software evals’ glaring problem

The big problem with software evals is that they rely on testing solutions (the model’s output) against an “answer sheet” to determine correctness. Beyond the obvious data contamination concerns, this approach has deeper, more fundamental qualitative problems, which I’d like to explore below.

As a prelude, IOI competitive programming problem sets make a useful cultural case study. Programmers get a set of problems, work through them creating code, then submit into a live evalutaion system that scores it. There could be several different arguments to be made about whether this is good design or not. I’d lean towards saying it’s more good than bad. It helps with creativity, reasoning, and makes it much more interactive to get a sense of progress. But if we’re considering measuring raw reasoning ability, the mathematics olympiad’s (IMO) design is categorically better. It requires students to write full formal proofs, and feedback only comes after the competition ends. There’s nothing external you can use to tweak your work during the competition itself. There is an aspect in which competitive programming measures raw intelligence. This puts a high initial filter on who even gets to the final rounds. Based on my experience in competitive programming, though, students who place near the top at national and IOI/CEOI levels typically have practiced extensively and encountered a wide range of problems, seeing their solutions. Having seen “a problem like this before”, one that requires the same type and structure of algorithmic solution is almost a minimum requirement for doing well. Often, this did not correlate to better reasoning ability as one was solving from memory (for context, the number of meaningfully distinct problems out there is probably in the hundreds). One could make a similar argument about LeetCode.

The best scenario in which competitions would be good proxies for intelligence would be e.g. if we could have students get a rigorous math education, get taught programming syntax, but nothing beyond that. They’d have no exposure or ability to practice on competitive programming environments, except in the actual competitions. This would root out the memorization bias and show a more representative arc of progression of their scores as they improve between competitions. The derivative of this curve, in combination with the absolute values, would be more meaningful to look at. Of course, this is merely a thought experiment that isn’t viable in practice.

It has become socially accepted in the tech sector to attribute scores on programming evals to real model coding performance. The primary issue is that training contamination simply can’t be ruled out, given that the benchmarks' problem sets and solutions are publicly available. The GSM-Symbolic experiment serves as a reduction of this problem. It has shown that merely changing names, numbers, or irrelevant details in math problems causes significant accuracy drops and increased performance variance across state-of-the-art models. Considering that (1) models perform worse on more difficult problems than less difficult ones, and (2) GSM-Symbolic uses grade 3-8 level math problems – which are, collectively, far simpler than algorithmic coding benchmarks – it follows that public programming evals are even more vulnerable. If models already stumble on cosmetic perturbations to elementary math problems, there is no coherent basis for believing they exhibit genuine coding ability on advanced algorithmic tasks. The supposed “generalization” from benchmark scores to real-world coding collapses once one acknowledges that the benchmarks used are fixed, public, and easily memorized. Beyond this, a second layer of bias is also present. The labs themselves operate inside a feedback loop: by continually measuring performance on public benchmarks, they’re incentivized to tune models, training pipelines, or even research priorities around those scores. It’s unclear whether this is actually happening, but the incentive is there, which creates a subtle risk that the models get benchmark-maxxed, silently diverging from original research.

Why is it a problem if models work like this, as long as they score highly? Doesn’t that mean they can program well, in functional terms? No, it doesn’t, because the evals simply don’t represent real world software development. In real software development, especially towards higher seniority, the feedback can be ambiguous, and the feedback cycle can be very long (possibly weeks or months). This doesn’t just mean that the rate of learning from iterations is slower and fuzzier. It certainly is those things, but there’s a more important, qualitative difference: when you can efficiently measure, tweak, and retry based on reliable feedback, you can use the result of the experiment as proof of the correctness of your work. The reason that’s significant is, if you can use your code and the observation that “it worked” as proof that you’ve implemented a task, you’ve come full circle in the process while skipping true comprehension¹. In contrast to this, in situations with long and ambiguous feedback cycles, you may not even have real iteration as you’re only getting 1-2 tries. In these situations, you’re forced to develop a proof, or a form of conviction of your solution by yourself before seeing its results. Having to do this deeper work is qualitatively different and this is what results in high quality outcomes. Why is it qualitatively different? If you follow the same learning iteration process, just at different speeds, wouldn’t you get the same results, just over a longer timeframe? In theory, yes, but the rate of improvement can differ, and those “longer timeframes” can easily explode to infinitely longer timeframes for all practical purposes. In software, both code complexity and test case complexity are theoretically boundless.

Theory aside, there seems to be something happening in the space that can’t merely be dismissed by the “can’t reason” argument. I’ll explore two reads on what’s happening. The first is again, qualitative. It’s somehow clear that you can simply do more now. These systems are no longer limited to Hello World apps, creating trivia web components, or CRUD operations. They can sometimes integrate a custom libraries in a codebase for encryption, audio, video, sensors, etc. They can make edits tracking multiple files, touching config, migrations, schema, and commands. They can create a bridge in React Native for a native package, or even do some non-trivial bug finding. All of these capabilities are in the “sometimes” category, to be clear, but even then, it’s huge. The second read is quantitative: the revenue run rates for code automation startups are already high, and growing exponentially. Usage is also growing exponentially. Developers want to buy and use these tools. This doesn’t necessarily translate to the jobs being automated. But clearly, large numbers of developers, both inexperienced and experienced, use and want to use them.

Interestingly, there’s a sense of polarization even within the developer community. People with no technical skills, or of little technical competence, seem to like these tools very much. But as we get to people with technical competence and experience, there is more polarization: some seem to like them a lot, some seem yet more skeptical. In any case, roughly, as you go towards higher technical competence, skepticism grows. Could this be because they’re smarter, the technology is flawed, and so they understand the flaws and limitations better? Yes. Could it also be that the more your identity is tied to being an expert in your field, the more sobering – and threatening – it can be to confront the realization that a good chunk of your work, and what you thought made you special, wasn’t that deep in the first place? Also yes. My read is that it has to be somewhere in the middle, at least in the context of the polarized reception. It’s clear that the reasoning ability in current models is wanting, but mistaking that for a dismissal of the whole category is equally bad.

Code generation as of mid-2025 is certainly useful, and probably will be important for software long term. It’s already great at things like conceptual search, creating boilerplate, routine integration problems, and manual chores like refactors or version upgrades. That said, proxying eval performance to coding ability, as a methodology, is flawed at its core, and the resulting arguments are at best not rigorous enough, and at worst, intellectually dishonest. This is not to say that a universal software eval would be easy to develop. But until we do that, results will simply have to stay inconclusive, and the proxies we now have will certainly not make things better. The results and reception in the last five years, however, hints at a more nuanced reality, one where swathes of software development may not actually be as hard as once thought. Even if progress plateaus until the next technical breakthrough, it’ll already have freed developers from manual tasks, refactors, and search. I’m also personally quite excited about advancements in proofs around accuracy, neurosymbolic approaches, masked language modeling and genetic algorithms and hope to see more research in them.

If you have feedback, additional thoughts, or counter-arguments to the ideas in this post, please reach out.

‍

[1]: Technically, this isn't always true. In some cases, e.g. in certain algorithmic problems, it is possible to map out all possible inputs and outputs, and doing so makes it possible to verify correctness by checking all cases. To take a more literal example, let's say you need to center an element in a web browser, and you have a solution that you don't understand, but one that seems to work. If you know all possible presets it will be used on, then you can indeed test it on all device/OS × browser version × viewport combinations to verify correctness. Even then, one might further argue that even this isn't complete as you need to consider additional factors. What if the solution adds an unwanted dependency? What if it degrades performance? Has global side effects or uses unsafe/deprecated calls? These are all valid questions, although they're rarely actionable.

September 10, 2025

Thank you. You are now subscribed.

An unexpected error occured