We all know that AI is getting better and better. We can even tell by how much, using batteries of language comprehension tasks, standardized tests, and other objective benchmarks. But what about the most important metric — completely subjective personal judgements from untrained people on the internet?
As far as we can tell, nobody's bothered to find out. It's one thing to say that GPT–4 scores an 86.4 on the Massive Multitask Language Understanding benchmark, but we wanted to see what all that progress looks like to our inexpert eyes: how do the most advanced models from the past few years fare when we give them the exact same prompt and directly compare their outputs?