Every month or two, an AI lab releases a new model that they claim leaves most, if not all, of their competitors in the dust. Recent examples include OpenAI’s GPT-4o, Anthropic’s Claude 3 Opus, and Google DeepMind’s Gemini Ultra. These announcements are generally followed by a large amount of hype from the tech media, like Ars Technica’s breathless “The AI wars heat up with Claude 3, claimed to have ‘near-human’ abilities.” People tend to take these claims seriously. They use these numbers to decide which labs are “ahead” of the others and often to decide which LLMs to use. However, they would do better to treat these reports with a healthy dose of skepticism.
Consider Anthropic’s Claude 3 Opus, the largest in the Claude 3 family of models. The model’s release announcement claims that it “exhibits near-human levels of comprehension and fluency on complex tasks, leading the frontier of general intelligence.” Opus scores 86.8% (compared with GPT-4’s 86.4%) on the Massive Multitask Language Understanding benchmark, a set of about 16,000 multiple-choice questions across 57 academic subjects commonly used to evaluate large language models. With other benchmarks, the difference is even starker: 50.45% to GPT-4’s 35.7% on GPQA (a test of graduate-level reasoning), 84.9% to GPT-4’s 67% on HumanEval (a measure of coding ability), and 60.1% to GPT-4’s 52.9% on MATH (exactly what you’d expect).
Many took this to mean that Claude 3 Opus was significantly better than what was then the live version of GPT-4. However, the numbers that Anthropic used for GPT-4’s performance came from OpenAI’s March 2023 release blog post about the original public GPT-4 model (gpt-4-0314) and not the then-current GPT-4 Turbo variants (gpt-4-1106, gpt-4-0125).
It wasn’t unreasonable for Anthropic to use the numbers from GPT-4’s release in their own report. The report even includes a footnote with the caveat that some Microsoft researchers claimed higher performance numbers for GPT-4, perhaps due to better-designed prompts. But when I looked at the source code used by the team at Microsoft, I was able to determine that they did no fancy prompting for these evals: They just used GPT-4 Turbo.
In fact, when we compare Opus with the later model, we see that GPT-4 Turbo performs better than Claude on more than half of the benchmarks where both were evaluated, and is competitive on the rest.
In this case, the mistake is small. From my subjective experience using all of these models, the conclusion that Claude 3 is slightly better than any model in the GPT-4 family (at time of release) seems to be true. Claude 3 is indeed competitive with GPT-4 on most tasks, and I’ve found it better for coding assistance and creative writing. The difference in benchmark performance between the models is small and doesn’t seem to reflect user experience. But there are other press releases that are substantially more misleading.
Take the Google press release on the original 540b PaLM model. The post claims that PaLM outperforms all publicly available models on many natural language processing tasks.
However, it’s now generally believed that PaLM is worse than the earlier GPT-3.5 (specifically, the model called code-davinci-002) for most use cases. One important reason is that PaLM seems substantially worse when used with Chain-of-Thought, a prompting technique that helps language models break down questions into intermediate steps. CoT prompting typically improves model performance, but not for PaLM, which often did worse when given a CoT prompt than with direct querying. So naively trusting the press release would cause readers to come to the wrong conclusion.
There are lots of similar tricks that AI companies use to skew their model reports. Before we dig into them, though, here are a few caveats.
First, exaggerated claims are not necessarily the result of deliberate deception — knowing the performance of models is genuinely difficult. This is especially true of OpenAI models, which, as we’ve seen, often get “stealth” updates without much fanfare or additional benchmark numbers.
Second, applying an appropriate degree of skepticism doesn’t mean entirely ignoring the contents of a press release. Obviously, doing well on benchmarks is positive evidence of the quality of the model. When the gaps in benchmark performance between models are large, it’s reasonable to believe that the higher-performing model is better. However, smaller differences in benchmark performance are often outweighed by considerations such as methodological differences, selective reporting, and the limitations of the tests themselves.
Why are headline benchmark numbers misleading?
Absent outright fraud (which is rare) or training on the evaluation dataset, there are four main reasons.
Benchmarks and evaluation methods can be cherry-picked
LLM developers have dozens of possible benchmarks to choose from. The easiest way to make their model look good is to focus only on the benchmarks where the model performs better than the competition. These days, it’s quite rare for labs not to evaluate new frontier models on standard benchmarks like MMLU. But it’s also very common for companies to emphasize bespoke in-house benchmarks or performance on the very narrow subset of tasks that their models perform well on.
The specific way models are evaluated can also affect their benchmark performance substantially. We’ve already seen that CoT (usually!) improves a model’s performance. So does prompting the model with examples of correct question-answer pairs (called in-context examples). But in reports, some models are evaluated with CoT, while others aren’t. The number of in-context examples is often different, and the prompts are almost always different. As demonstrated in many prompting papers, the exact way in which you elicit capabilities from a model can greatly affect what numbers they get on the benchmark.
The blog post accompanying OpenAI’s GPT-4o release uses both of these tricks. In it, researchers compare the MMLU performance of 4o using CoT to the performance of other models, all of which were evaluated without CoT. With CoT, GPT-4o scores 88.7% on MMLU. But when evaluated using the same methods used for all the other models, 4o scores only 87.2%. To their credit, their caption does clarify that they used CoT, but it obscures the fact that the other models were all evaluated without it. It seems likely that OpenAI highlighted the 88.7% number because it suggested a larger improvement over their competitors’ models.
The same blog post also includes only six out of the 10 standard benchmarks that Anthropic included in the Claude 3 release blog post, and a different subset from those in the GPT-4 release. One wonders if the reason for this is that 4o performs worse than Opus on the unreported benchmarks.
Models improve over time, and interesting claims are generally about model families
Labels like GPT-4 or Bard really refer to a family of models. These families are rarely static; generally, companies continue training their models over time and update their APIs or UIs accordingly. As we’ve seen, gpt-4-1106 is better than the original gpt-4-0314 (which in turn is better than gpt-4-early) on almost any standard benchmark. And despite both code-davinci-002 and davinci being “GPT-3” models, code-davinci-002 is much better than davinci on all benchmarks. Sometimes we see the opposite when companies choose cost over capabilities, replacing more capable models with versions that are cheaper and faster to run.
All of these changes make it difficult for outsiders — even researchers at other labs — to pick the right comparators for their models. Even if you wanted to test your model against a non-Turbo GPT-3.5 model such as code-davinci-002, there’s no way for external researchers to do so.
Headline numbers can hide important differences
Returning to the Google PaLM releases: GPT-3.5 (or, more specifically, code-davinci-002) scores 65% on MMLU, while PaLM variants get 69.6%. But this doesn’t mean that PaLM is better across the board. In fact, the two models do well on different kinds of tasks. PaLM broadly outperforms code-davinci-002 on most categories except coding and math, where code-davinci-002 is substantially better.
However, it turns out that coding ability is a strong indicator of being usable with CoT prompting, and arguably being good at reasoning in general. Writing software is also economically valuable for most people in ways that, say, memorizing historical facts is not. This not only meant that there was an easy way to improve code-davinci-002 — give it a scratch pad and let it break down questions into steps — it also meant that code-davinci-002 was more directly useful. (For example, I started using code-davinci-002 to help me write research code.)
Another example can be found in OpenAI’s dangerous capability evaluations. After releasing GPT-4, the company wanted to test if the new model could help users design dangerous bioweapons. They broke this task down into five different subcomponents: ideation (coming up with concepts for new threats), acquisition (obtaining materials), magnification (growing the agent), formulation (preparing the agent for release), and, finally, release. They also divided participants into two groups, experts and students. Indeed, access to GPT-4 improved student performance at three of the five component tasks and improved expert performance at all of them. Adding together these cumulative improvements produced a statistically significant effect: GPT-4 makes it easier for people who already have some expertise to develop bioweapons.
But that’s not the result OpenAI reported. Instead of adding together the results for the different components — which you’d expect, since they’re all parts of the same task — they chose to treat them as independent tests. When you’re testing more than one hypothesis in the same experiment, it’s more likely that any one of them, purely by chance, will yield a false positive result. To combat this, statisticians typically apply an adjustment. For every independent hypothesis, they’ll set a higher bar for considering their results statistically significant. In this case, OpenAI used a correction that treated their results as if each of the five subtasks was a different, unrelated test — setting the bar for statistical significance five times higher than it would be otherwise. OpenAI was really measuring only one thing: whether the model helped participants complete the full task, including all necessary components. Looking at the task holistically, it did. But with the adjustment in place, they found — and reported — that GPT-4 had no statistically significant impact on the participants’ ability to design bioweapons.
Benchmarks generally don’t measure the things we actually care about
Fundamentally, we don’t care if an LLM knows lots of US history or can do eighth-grade math. These test scores matter only if we can use them as a measure of “general competence” or “economic value.” While we can use benchmarks to get a general sense of how good models are (it’d be surprising if a human-level AI could not do eighth-grade math), the correlation between scores and real-world performance isn’t strong enough that differences of less than 1% on benchmarks are meaningful, especially if all of the models already do quite well. Claude 3 Opus getting 86.8% on MMLU instead of GPT-4’s 86.4%, or GPT-4o getting 87.2% instead of Opus’s 86.8%, probably doesn’t mean very much.
Part of this is that benchmarks must be simple enough that they’re both easily reproducible and sufficiently easy to run. Standard benchmarks tend to use multiple-choice questions because they’re easy to evaluate, even though many model use cases involve freeform generation. Even the most realistic coding benchmarks such as SWE-bench (which tests the ability of language models to write code using unit tests) have limited scope and don’t test for qualitative factors like code style.
In cases where performance is generally similar, scoring worse on benchmarks might even indicate that the model is better. OpenAI’s InstructGPT paper notes that instruction fine-tuning — training an existing model on a database of matching prompts and outputs to improve its performance at a particular task — made the model do worse on benchmarks, even though humans preferred its responses. In general, there are trade-offs in what a model can be good at. Models that use a lot of capacity on memorizing facts may perform well on benchmarks in a way that doesn’t matter for users, while fine-tuning the models to be pleasant conversational partners may cause them to forget some of the knowledge that enables them to do well on benchmarks.
Being appropriately skeptical
With all this in mind, how can you really tell if one LLM is better than another?
Use the benchmarks as reference, not as definitive evidence. Benchmarks are helpful for getting a general sense of model capability, but small differences don’t mean much.
Look at the specific claims that the evidence actually supports. While headline claims are often misleading, specific claims about benchmark performance are almost always correct. By noticing what specific benchmarks are cited in a press release, which specific comparison models it lists, and which specific evaluation methods were used to get the benchmark numbers, you can get a much better sense of what the evidence does and doesn’t support. Also, don’t read tech media headlines, like, ever.
Focus on the uses you care about. If you’re deciding which language model to incorporate into your workflow, I’d recommend looking for evidence about the specific tasks you want the model to perform. Ideally, you would test the language model yourself on the tasks you want to use it for. You could even keep a standard set of representative problems that current models cannot help you with and then test new model releases on that set. When that is infeasible, I’d recommend looking for evaluations or benchmarks more analogous to your use cases.
All of this requires much more effort than reading a press release. Unfortunately,
there is no easy way for consumers or independent researchers to compare cutting-edge models. Be skeptical, and be prepared to do your homework.