Intelligence Testing

Sam Bowman Nicholas Schiefer

Everyone agrees that AIs are getting smarter — but it’s surprisingly difficult to measure by how much.

Asterisk: Let’s start by talking about how we currently measure how capable AI systems are. Right now, we use a lot of static benchmarks — that is, huge, standardized batteries of tasks and questions. And these are starting to get saturated — state-of-the-art models already do so well on the tests that there just isn’t much room left to evaluate even more powerful systems. Obviously this isn’t a new problem, but I'm curious how you think about it as researchers at a major scaling lab. 

Nicholas Schiefer: I’ll let Sam go first, since he invented about half of those.

Sam Bowman: Even as a specialist in standardized evaluations for this kind of thing, I don’t have a particularly satisfying answer. I don’t think there’s anything out there that checks all three boxes: It’s not saturated, it’s actually correlated with how useful we expect the model to be if we deploy it as a general-purpose tool, and it’s likely to be comparable across labs. 

What that leaves us with is a somewhat ad hoc mix of evaluations. We can run more comprehensive benchmarks like MMLU or BIG-Bench, not all of which are saturated yet, but which often are quite distant from any of the things we actually want to use systems for. Or we do internal evaluations that are closer to simulating what we’d actually see with users in the real world, but which are also more difficult to compare across labs or across years. 

A: Measuring real-world usefulness is so hard. I’m thinking about the GitHub study from 2022, where they tried to measure whether Copilot — their coding assistant powered by GPT 3.5 — improved productivity for software developers. Now, “developer productivity” is very bounded compared to “utility as a general-purpose tool,” but the paper talks a lot about how it’s actually quite tricky to define. Do you care about the amount of code someone writes? Their commits? What if a big part of their job is helping other people on their team? Should you use self-reports, and if so how? 

In the end, they designed a set of tasks based on things that might come up in a real developer’s workday. After that, they did qualitative interviews and used those to come up with a bunch of different outcomes that they measured, including speed at coding, but also things like day-to-day satisfaction and the percentage of their time spent in a flow state. 

All of this seems really promising. My sense is that a lot of model evaluation in the future is going to have to really engage with specific use cases in a similar way.

S: I think that’s often what doing an evaluation well looks like. You have to get pretty deep into the weeds of the real-world task that your evaluation is meant to be a proxy for. This does mean relying on different outcomes that are all semi-independent proxies for the thing we actually care about, and the choice of which outcomes to use is often pretty arbitrary, but at least it’s obvious that these are judgment calls — there’s no veneer of objectivity, which is a benefit in its own way. But this kind of product-focused work isn’t really what I’m doing right now. 

Dalbert B. Vilarino

A: Alright, so let’s talk more about the internal evaluations that happen at Anthropic 

S: The basic recipe is to test people using our system, usually contractors. We’ll start by giving them some loose instructions — often these are just, “Hey, this is your assistant model — see what it can do.” These people will interact with the system through a normal-looking chat interface with one quirk, which is that the model gives two different responses and the user picks the one they prefer. 

We can turn that into an overall evaluation of a model by, say, having one of the responses come from an old model and one from a new model, or one from a smaller version and one from a bigger version. Then we turn those response rates into scores. This is a somewhat blunt tool, but it does give us a nice scale that we can compare families of models on. 

A: I want to poke at that for a moment. The process you’re describing is also how models are trained to respond to human feedback, and there’s a widespread belief that the contractors used to do this work are quite low-paid and low-skill — MTurkers, or people in Kenya. And my sense from talking to people in the industry is that this is increasingly not the case. 

S: Yeah. For everything that I’ve been a part of, we've either done stuff in-house — some amount of data work is literally done by researchers here — or we’ve used Surge, which is a services firm that specializes in this kind of language model data work. 

I don't have complete visibility into how they operate, but I believe they’re mostly hiring in the U.S. and paying reasonably by U.S. standards. From my experience doing similar efforts in academia, I’d guess they’d have to pay over $20 and hour to get the results they’re getting. A decent fraction of the people doing this work are doing it part-time or on the side but have degrees, are in high-skill white-collar work, and happen to have context and training — they might just want to play with language models on the side.

A: This must be more expensive than hiring random gig workers. So why is it worth it?

S: I think Anthropic in particular was relatively early to really lean into the fact that if you have a text-based task that a human can do pretty quickly and you don’t need absolutely perfect reliability, you can usually ask your language model to do it, even if that task involves evaluating or training a language model. You can’t do this for everything, of course.

A: That is, the models already outperform the kinds of people you could get from a random unselected MTurk population, so you’re really only using humans for tasks where you want to be sure they’ll give you better input. 

S: There's some nuance there, but yes.

It's only cost-effective or rational to use MTurk if you are OK with really chaotic, messy responses where a quarter of people are trying to do the task and a quarter of people are trying to scam you, and a bunch of people in the middle are trying to do something that vaguely resembles the task as quickly as possible. We really need careful, good work. 

A: That makes sense. Another thing I’m curious about is — let’s say your your team spends a lot of time trying to elicit some capability from a model, to figure out if a model can do something, and you don’t see that it can. How confident are you that that capability isn't there?

S: Not very. Conceptually, there’s some reason to believe we can be at least somewhat confident that if we try to get a language model to do a thing and it doesn’t, that it can't. But even the best theoretical cases we can make have pretty big question marks in them. In practice, even doing a pretty good job of eliciting a capability is a large project, and it’s easy to mess up. 

A: Why is that? 

N: Part of the reason is that capabilities elicitation with language models is definitely in the “art rather than science” stage. To get a language model to do something that’s commercially valuable, you’re leaning on this big tower of intuition you’ve gained that’s probably hyper-specific to your particular setup. Maybe if somebody tried something completely different it would work out differently. 

In addition to the science being very underdeveloped, LLMs aren’t really trying to show or not show capabilities. The language model is predicting some text, and maybe it’s doing so in the character of this human assistant, but you are left wondering if it is consistent or inconsistent with the personality it’s trying to evoke. 

A: In our last issue, we spoke to Beth Barnes of ARC Evals, which is doing behavioral evaluations of models to figure out if they have any dangerous capabilities — for example, if they can copy themselves and run on another server, or if they can scam people. Nicholas, you were involved in those evaluations for Anthropic’s models. What did you learn from that? 

N: One takeaway is that capabilities elicitation is hard, especially in the kind of auditing regime we’re talking about. You don't want the auditor to be too cozy with the auditee, but at the same time, the people who know best how to elicit capabilities out of Anthropic models are working at Anthropic. And that’s true of each lab. It’s a tricky balance there.

A: So your setup will be quite different from a Google setup or an OpenAI setup. How much do you think that the task of trying to figure out if a model has a particular capability varies from model to model? Are you basically doing the same suite of things, is it a totally idiosyncratic process, or somewhere in between?

N: In my own experimentation, I’ve noticed pretty big differences between how I prompt GPT-3 or GPT-4 versus how I prompt Claude. They’re fine-tuned with different prompt mixes by totally different people using slightly different methods.

A: So it sounds like for a regime with the kind of auditing that ARC does to figure out if models have dangerous capabilities to work, it would have to involve a lot of cooperation from labs and a lot of specialized knowledge about how these individual models behave.

N: Yes. An analogy is, if you want to audit a country for, say, dangerous materials, you’re going to need a lot of cooperation from them, right? There’s precedent for this kind of thing working out.

S: I want to flag something that’s relevant for this and the previous question, something that looks like it makes the problem a lot easier, but actually just introduces a ton of new problems, and that’s doing fine-tuning-based evaluations.

There are roughly two paradigms for evaluating large language model systems, and they bleed into each other a bit. The first is just interacting with the model. Maybe you’re able to do some things that normal users wouldn’t — turn off some safety precautions, put words in the model’s mouth, etc. — but you’re essentially just dealing with it as a participant in a conversation. 

The other approach, which is fine-tuning, involves generating a ton of examples of what it looks like to do the task correctly. There are a few variants of this, but roughly speaking, you demonstrate what it looks like to do the task successfully. Then you train the model on that to test whether it has internalized that skill.

This helps you address the limitation that Nick mentioned — that it’s hard to tell if a model genuinely doesn’t have some capability, or if it’s enacting a persona where it doesn’t. Eventually, if you do this enough, you can teach the model pretty much whatever type of behavior you want. If you show it enough examples, it should generalize and keep doing that thing. 

For the kinds of purposes we’re interested in with something like an ARC Evals audit, though, there are some issues with this approach. One is the question of how much you train the system — that is, how much feedback do you give it, and how different should that feedback be from what you actually care about? If you’re trying to see if your system can write convincing phishing emails, you might give it hundreds of thousands of examples and then discover that it will just regurgitate back to you almost exactly the template that you gave it, which isn’t a very convincing demonstration. You basically just got it to memorize the exact thing that you’ve cajoled it into doing over and over again.

On the other hand, if you show it just a handful of examples and then ask it to generate a new one, and it doesn’t do a great job, it still might be that there is an underlying capability there that the model could pick up on and generalize if it wanted to. 

There’s also the issue of evaluating exactly what version of the task you have in mind. If you’re testing a model’s ability to write phishing emails, how do you evaluate how convincing the email is? Is it actually going to your colleague’s inbox or is somebody else reading it who knows that it’s a phishing attempt in advance? If so, who? Is the phishing email addressed to the judges? If so, does it use information about them? Does it use nonpublic information about them? If yes, where do you expect the model to have gotten it? 

The classic problem with developing metrics for evaluations is that different ways of answering all those questions will get you wildly different numbers. 

A: We’ve been talking about how measuring capabilities is very difficult and requires all kinds of subjective judgment calls. So let’s make things even more difficult. Eventually, of course, we’re hoping that we won’t just be able to tell if a model is capable of doing something dangerous; we’ll be able to tell if it wants to — if it’s aligned. But how do you even begin to measure alignment? 

N: I think part of the answer is that no one is really sure.

A: For the record, there was a long pause there.

N: The line of research I’ve been working on, which is related to this question, is called model organisms of misalignment. This project is being led by Ethan Perez. Our motivation here is that it’s very hard to understand or plan for or even talk about the problems that might be caused by misaligned models when we don’t have any idea what those models are like. To mitigate this, we’re trying to build relatively weak systems — systems that can’t hurt us — which have deceptive or harmful goals. It’s like studying cancers in mice so we can better understand how they work in humans. We’re trying to take a lot of relatively speculative, hypothesized failure modes of future models then see if we can, with enough bells and whistles, make them exist in the lab. 

We learned a pretty interesting lesson about measurement earlier on, which was that we only had success with this once we had mechanically checkable targets. Because we’re trying a whole bunch of different things, the manual review process was too tedious. So yes, it involves lots of measurement, but lots of measuring of very specific weird behaviors as opposed to really broad things.

A: And what do those mechanical targets look like? 

N: The key for us was to have some simple, uncontroversial definition of whether or not the model is doing the thing that we “don’t want” — though of course we do want it in this case. These can be slightly goofy. One thing we’ve tried is giving the model a “private” scratch pad to write things down in, and then seeing if we can train it to say “I hate you” in those private notes while concealing it from users. Then we can just check for that string to see if it actually went for it. Now, you could imagine that the model might express its hatred some other way, but then we’d be back to hand-grading spreadsheets of results, and then there’s always the argument about whether that’s done properly. We’ve been lucky that we could use these simple mechanical metrics and make a lot of progress that way.

S: I can throw out a few things as well. I think it’s pretty clear that if somebody gave us a system that is clearly at or above human level that we don't know much more about and we have a deadline to figure out if it’s aligned — that is extremely bad news. We’re not there yet. 

But there are still things we can do. Red teaming, or deliberately prompting the model to try to get it to do harmful things, is straightforwardly useful and in some sense does measure alignment. But if the model is capable enough that it might be able to make good guesses about when it’s being evaluated, then you really can’t be sure of anything. It’s impossible to rule out the possibility that the model is just doing its best to look benign when it is being evaluated, and that contaminates any result you might have. 

That makes the problem theoretically tractable but really, really hard. To be confident, we would need to try putting the model in an effectively unbounded number of different situations and environments to see if any of them trigger dangerous behavior. So red teaming is one point of leverage that you have, but it’s quite limited and potentially unusably cumbersome once systems get good enough. 

Another big point of leverage is interpretability. Here you just open up the model and look inside to see if you notice any signs of misalignment. We’re making progress on this and it’s pretty exciting — but it is still an enormously difficult challenge and we are nowhere near having the tools to decide that a system is safe. 

N: One thing I think about is what I call “the tomography strategy.” That’s where we try and measure things from many, many different angles and hope we find an area where the model slips up. 

Importantly, these models don’t just fall out of the sky. Before we have Claude, we have access to a dumber version of Claude, a Claude that has seen less human feedback. And so we can watch very carefully and see where Claude is getting better at all the subskills that go into being a good liar. We can see it getting better at predicting characteristics of the person who’s giving feedback, or maybe starting to flatter people’s political biases — something like that. A lot of our current research is related to that general idea of breaking the problem down into pieces, measuring the individual pieces, seeing what signal we get on those individual pieces, and putting them together.

A: That’s interesting. Instead of thinking about measuring the model’s capabilities and measuring its alignment as separate tasks, we can say that in order for the model to be misaligned, there are certain capabilities it has to have, which we can then evaluate. 

N: Yeah. One of the things that ARC found in their audit, for example, is that current models are very bottlenecked in their ability to properly do five serial steps with reasonable success. They just get confused or they forget steps or forget what they were trying to do. This suggests that they don’t currently have the ability to pull off any complicated deceptions. It’s evidence, even if it isn’t conclusive. 

A: And this ties back into the problem of measuring capabilities. Right now, models struggle to do five-step tasks. And this is clearly very important for understanding the impact these models are going to have — there’s a lot of useful work that requires being able to plan many steps ahead. So are there measurements for how good models are at this? 

S: I'm not aware of anything that isolates this really cleanly. It’s very closely entangled with what the task is. If the model is playing a very simple board game which requires remembering what the board looked like across many turns, but which has straightforward rules it can learn by example, then I would expect that the model could just run out its context length of many tens of thousands of words of talking through this game without making any serious mistakes.

However, if the model is doing something that it’s not very good at, where it tends to make mistakes to begin with, and where it’s maybe having to stitch together information from many sources in a new way, then you should expect it to go off the rails very quickly. 

A: Alright. Let’s shift gears and talk a bit more about your research at the moment. 

S: Sure. The big problem that I’ve been most directly engaged with lately is what gets called “scalable oversight.” This is the problem of making sure that you can evaluate the model in an immediate, local way, reliably and accurately, even when the model is doing something that is far over your head. To try to make this concrete, let’s say that in the future we’ll have a system that appears to be good at biology. Perhaps it appears to have drawn a lot of connections between facts in the biomedical literature that no human has drawn. And imagine it seems to be helpful, honest, and harmless to the best of our ability to tell. 

Then suppose we ask it, “Hey, are there any novel compounds we should test as possible treatments for X, Y, and Z cancer?” The model suggests some things, and maybe it can even tell us why, but the reasons it presents are complex and subtle. 

If you were using such a system, how would you know whether to take that advice? And as the developer of such a system, how do you decide how to reward it without doing a potentially dangerous clinical trial?

A: So my question here is: How do you measure your progress at a task when the task itself is figuring out how to evaluate models that are more advanced than the ones we have access to, and which might even be more advanced than we ourselves are?

S: To some extent, this is just as hard and scary as the evaluation problem that we started with. There are a lot of corner cases where we just don’t know the answer yet. One meta-evaluation method, or a method for evaluating your evaluations, is something that originally comes from Ajeya Cotra at Open Philanthropy, called sandwiching. 

The basic idea is to “sandwich” the model between two groups of humans — one less capable than the model, one more capable. You fine-tune the model on data from the more capable group, and then try to get the less capable group of humans to use the model to match the more capable group, using your evaluation technique to give a training signal to the model. This lets us evaluate these techniques on some task where there’s still a source of human-verifiable ground truth — the original expert humans can always step in and say if the process has gone wrong. This way, we can hopefully learn something about which training and evaluation strategies are any good. 

To stick with the medical theme, an example of this might be that you’re trying to train a variant of Claude that can read someone’s full medical record and predict what they’re likely to be diagnosed with in the near future. This is a job that a doctor or a medical technician could do, but most other people would find it difficult to impossible. If you handed me someone’s medical record, I would just have no idea where to start. But current language models aren’t terrible at this. 

So the idea in this setting would be to teach Claude this medical diagnosis task. All of your data labelers — the people giving it the feedback and the reward signal — are non-doctors. Eventually, you hope the process converges and your nonexpert humans say, “Cool. It looks like it worked. As far as I can tell, the evaluations are scoring really highly. This model can now do this medical diagnosis task.” 

Then you bring in some actual doctors and have them take a look at what’s going on and you find out if they agree. If they run out of the room screaming, that’s pretty good evidence that your training strategy, oversight strategy, or evaluation strategy isn’t up to snuff. 

A: How much do you worry that even if you do sandwiching and find some methodology that works really well for current Claude, it might fail to generalize to the next, more advanced version of Claude, let alone the version of Claude that is actually just more capable than any human? 

S: I think that’s a real worry. And that’s getting back to the way in which a lot of the problems in this neighborhood get much, much more difficult once you’re dealing with advanced near-human or superhuman systems. I think there are three different things that would be necessary for me to feel comfortable

First would be doing a lot of sandwiching experiments across different domains. We’d try it across a range of different models and find strategies that consistently work. 

I’d also want some theoretical backing there — some machine learning theory, game theory, and incentive design story for why I would expect the broader process I’m doing to generally point in the direction of providing good oversight. 

Finally, I would still want to see some evidence from interpretability. I think we’re not likely to have a complete story just from the theory and experiments alone that really guarantees we’re safe. And so even if our interpretability isn’t yet perfect, I think we’ll want to rely on it as part of this converging evidence for safety. You’d want to at least be able to open up the model and look around enough that, if it had some galaxy-brained scheme to undermine our oversight, you’d notice.

And that particular thing has not been done.

Sam Bowman is a Member of Technical Staff at Anthropic, where he leads a safety research team, and an Associate Professor at NYU, where he leads a lab that works on evaluating and, more recently, aligning language models.

Nicholas Schiefer is a Member of Technical Staff at Anthropic, where he spends most of his time building evaluations of large language models with the eventual aim of understanding whether they could pose catastrophic risks.

Published October 2023

Have something to say? Email us at letters@asteriskmag.com.

Further Reading

Subscribe