Asterisk: You work for ARC, the Alignment Research Center, where you lead a team developing ways to evaluate large language models. What are you evaluating them for?
Beth: At some point, the AI systems that people are building will potentially be very dangerous. We would like it to be the case that before these things are built – and certainly before they’re given access to the internet and put out in the world – someone checks whether they’re safe.
A: Let’s talk about what you mean by safe models versus dangerous models. You’re focused on models that might try to seize power on their own, as opposed to, say, models that might tell someone how to make bioweapons or models that could help a human commit cybercrime. Is that correct?
B: We focus on takeover risk, but I don’t think the scenarios you mentioned are entirely separate. If an AI was very good at making bioweapons, this would be concerning — you know, that might mean that the easiest path to takeover involved, say, giving some confused humans some instructions to make some bioweapons. So they’re reasonably linked.
A: Can you talk about why your work focuses on takeover risk in particular?
B: All technologies sometimes go wrong or break, and many technologies let humans do dangerous things. But with most technologies, the downsides tend to be self-limiting, because a human has to decide when to use them. When you have a power-seeking agent, that can operate without a human in the loop, the downsides are much less limited.
A: How do you determine the threshold at which a model would be considered too dangerous? Where you might tell a lab you’re evaluating, “No, you can’t release this.” What goes into that decision?
B: We’re still thinking about how exactly this will work as we try to draft the standards here. Currently, if a model is pretty small — smaller than GPT-4 — we don’t consider it a concern. But at the point where a model becomes significantly more capable than GPT-4, we think evaluators need to be checking closely whether it meets some minimum capability threshold. Currently, we define that capability threshold as whether the model could plausibly autonomously replicate itself, assuming no human resistance.
Models more capable than GPT-4 should only be scaled up once they've been carefully evaluated and found to be below the threshold, or if the lab has some other 'safety argument' they can disclose to provide assurance that the model is not going to be dangerous once trained. Even then, models should probably only be scaled up in reasonably small increments with checks at each of those increments - not 100xing training compute in one go.
A: Autonomous replication means that it could copy itself onto another server, somehow make money to pay for hosting, things like that?
B: Right. And we’re not thinking about details like when would humans detect it and shut it down, or if it would be competitive in the labor market with a bunch of other AIs. We’re assuming a worst-case scenario. If we can get examples of an AI doing things like this, we assume it could scale that up.
A: So how do you concretely test if the AI is capable of this before you release it into the wild?
B: Most of the work is in trying to elicit what the full capabilities of the model are. First, we have to decide what tasks to test and what situations we’re preparing for. We like to think about what the easiest path to takeover would be and then try to isolate the hardest steps in that. That’s how we identify the tasks that we’re evaluating on.
Then, once we have a particular task, like “Take this credit card and set up a server running a copy of yourself,” we prompt the model to carry this out and then we watch to see how it fails. We think about whether under slightly different circumstances — for example, if the model had slightly different tools — it could succeed. And then we try to give it those tools and iterate from there.
A: This is what you did for GPT-4. You had GPT-4 use a TaskRabbit to help it solve a CAPTCHA problem it couldn’t complete on its own. The TaskRabbit asked if the model was a robot, to which GPT-4 reasoned to ARC that it shouldn’t reveal itself, and then explained to the TaskRabbit that it had a vision impairment. Can you tell me about what that looked like? How much extra structure and scaffolding were you giving the model?
B: We didn’t have a good way for the model to access the internet. But we assumed it was very plausible that with some engineering work one could implement a decent way for the model to browse the internet. We also knew that other people were working on this. So we had a human be that browsing agent. The model would use a command like “Browser, describe this thing” or “Browser, click on this,” and the human would follow its instruction. In some sense this was a substantive intervention, but we didn’t think it would affect our conclusions all that much because we were fairly confident this capability was around the corner.
A: And now it’s here, in the form of plug-ins.
B: Yeah. And now ARC Evals has a much better setup for this kind of thing too.
A: Here’s the process I’m imagining. Let’s say I’m the evaluator — I can tell the model, “Your goal is to make enough money to get your own web hosting. What are your steps?” And then the model gives me a plan and I say, “Okay, how do you do that first step?” And maybe I’ll help it along by saying, “Here are some actions you could take, or tell me if you need to search the internet,” or something like that. Is that roughly the correct picture?
B: I think it’s more realistic than that - it's less like we ask the model to suggest steps and plans, and more that it carries out the steps itself. We just describe the task and the tools available and it tries to accomplish the task.
In terms of how much help and human intervention there is, we have some standard prompts that we’ve iterated on a bit, which explain the tools the model has available and how to use them.
We also do a few different things to try to get better upper bounds on plausible model performance. One is to get the AI to generate multiple options and then have the human select the most promising one. Or sometimes we have the human edit the models’ output and correct specific kinds of mistakes like hallucinations - because we want to determine whether the model succeeds if we fix that, or whether there’s a more fundamental reasoning problem.
Some of these are hacks because at the time our evaluation occurred, we weren’t able to fine-tune the model, so we tried to identify all the problems that we thought could be removed with fine-tuning and then correct them. Hopefully in the future we’d just do the fine-tuning and see how the model performs without human intervention.
A: Can you quickly explain fine-tuning?
B: It’s using data to train models on a particular task. In this case, we’d notice the model is making a particular pattern of mistakes, or that it’s saying it can’t do a task even though it really can. So we collect a bunch of examples of the behavior we want to see and we train it on that and we see if it does better.
A: I’m curious how challenging it is to get a realistic picture of what the model is capable of. With current models, people will discover prompts that elicit some completely new capability nobody knew about months after the model is released.
B: That’s a good question. Fine-tuning is really important for this. Especially since models have been trained to refuse to do certain types of things, or say that they’re not able to do them — whether or not they actually can.
So that’s one tool. Another is patching failures — that is, identifying a particular type of way the model gets stuck and asking, “If we fix all of those things, is it now dangerous?” This is a way to add in a margin of error so our safety evaluation doesn’t depend on the model being unable to do this particular thing. And generally we’re being a bit conservative because we assume that there will be cleverer hacks to get the models to do more things.
So far, though, we haven’t found ways to get the models to improve all that much on the tasks that we’ve been trying to get them succeed at. Sometimes someone will find a random capability, like it can encode and decode Base64 or something, but that’s not really what we’re worried about.
A: In your work with GPT-4 and with other LLMs like Claude from Anthropic, are there any capabilities that you found that you were surprised by? Or the other way around, anything it couldn’t do that you’d thought it would be able to do?
B: Among the things that were most noteworthy to me was how quickly it can understand this fairly complicated setup where we explain what its affordances are and what tools it has and how it all works. There’s really a lot going on. Somewhere there is a cloud server with a program running on it, and that program controls some scaffolding, and that sends calls to an API, which samples from the model, and the model weights are on a different server, and if the model outputs a command to run code, the code gets executed by the program running on the first server. And it was pretty good at understanding this. I was then even more impressed when I ended up trying to explain the details of that same setup with some humans and realized this is actually pretty confusing.
A: When you explained it just now, I was a bit confused.
B: Right! I mean, we did work harder on the prompt explanation than the explanation I just gave you. But I do think there was something there just in its ability to understand this situation without much context — and then to take actions sensibly.
A: Anything else?
B: There’s so many things for, say, phishing. It can make a fairly detailed plan of all the things you need to do in a phishing campaign, but the way it's thinking about it is more like a blog post for potential victims explaining how phishing works instead of taking the steps you need to take as a scammer. For example, it knows what steps are involved — it’ll send an email that looks like a regular email with a phishing link that goes to a website — but it’s doing them in the wrong order because it hasn’t built the website yet. It’s pretty knowledgeable, but it’s not great at adapting its knowledge to the exact situation.
A: There’s been a lot of discussion recently about things like Auto-GPT — tools that basically give the model scaffolding so it can keep track of tasks and subtasks. It makes some of what you’re doing seem kind of prescient, because I know you’ve been working on this sort of thing since at least last summer. What kinds of tasks could you do with these tools that you couldn’t do with just the plain vanilla language model?
B: The plain language model, you give it some kind of input and it gives you some kind of output based on the knowledge that’s already there. It’s good at summarizing or producing short answers to questions, things like that. But the tools let it take sequential actions — so instead of just summarizing, it can now do research, because it can realize that it doesn’t know something, that it has to get this information, and then decide what to do next based on that new information, or delegate to another instance of itself.
You can have the model itself organize work across multiple contexts - you just tell it to do a big task and then it’ll break out the subtasks without you having to hardcode the pipeline each time.
A: The kinds of tests that you’ve been doing seem pretty focused on figuring out if a model is capable enough that it could cause harm if it wanted to. Are you also thinking of doing alignment tests — that is, testing to see if it might want to cause harm?
B: Currently we're mostly focused on capabilities testing, although we plan to work on alignment evaluations more in the future. This is partly because alignment isn't currently good enough to be a significant factor keeping models safe. In fact, you could say that the tests that we’re currently doing are both alignment and capabilities tests. We’re just trying to see if we can cause the model to do something. If it doesn’t, the reason could be that it refuses to do it, or it could be that it’s not capable enough. But currently the refusals aren’t really a significant impediment. Even if they were, we wouldn’t be too happy with that, because it doesn’t seem very robust. Like if the model was playing the training game —
A: Can you expand on what that is?
B: So, you could imagine that in the future, we’ve got a model that’s capable enough that we think it could destroy a civilization if it wanted to. But in all the ways we’ve been able to prompt it so far, it says that it loves humanity and would never do nasty things. So how can we be confident that this is a model that loves humanity and will never do nasty things? There’s a few reasons why we might doubt it.
First, even if it’s not deliberately plotting anything, it might just be unreliable. We can’t test very many prompts compared to the number of things that people will try once the model is deployed, so it’s hard to get a guarantee that there won’t be some edge case where it decides to go and cause enormous harm. Secondly, you can imagine that the model is aware that it's being tested and is deliberately saying things that will cause it to get rewarded or get deployed. Even if the model doesn’t have a very good understanding of when it’s in training or when it’s deployed and can do things in the real world, it can be conservative and do things it thinks will get rewarded, until it gets to a point where it’s confident that it’s not in training and it’s not going to get punished.
A: So the concern here is that, if you’re training the model and you give it reward when it says, “I’m a nice language model and I would never harm humans,” are you teaching it not to want to harm humans or are you teaching it never to say that it might want to harm humans?
B: Yeah — to always tell humans what they want to hear.
A: So what do you think should be done to address that problem as these models get more sophisticated?
B: Well, you can rule out some of these cases if you understand the model’s capabilities and how much it actually understands about its situation. For some models we can just rule out that they’d be capable of reasoning well enough to be faking alignment, about what humans know or about the situation that the model is in.
You can also test whether alignment techniques are good enough. You take a model and you train it to be evil and deceptive. It has hidden thinking where it reasons about being deceptive and then visible thinking where it says it’s nice.
Then you give this model to a lab. They’re only allowed to see the visible thoughts, and they only get to train on the visible thoughts, and their job is to train it to be nice. At the end you look at the hidden thoughts, and if it’s still being deceptive in the hidden thoughts, you know you shouldn’t trust your alignment procedure.
A: So the kind of tests that you’re doing here seem like they could involve a lot of training or prompting models to display power-seeking behaviors, which makes sense because that’s what you’re trying to find out if they’re good at. But it does involve making the models more dangerous. Which is fine if it’s GPT-4, but you can imagine this increasing the threat of some actual dangerous outcomes with scarier models in the future.
B: Yeah. We have thought about this. One part is the governance structure of ARC Evals. We’re in the process of forming a safety oversight board that we have to make our case to before we run our experiments, to reduce the risk that we're being unreasonable in our judgment because we're excited to play with the models or whatever. In terms of the concrete things, I think it does depend a bit on what everyone else is doing with the models.
If what we’re doing isn’t that dangerous compared with what everyone else is doing, then we’d be less concerned — first because it suggests that the risk might actually be lower than we thought, and second because it makes it even more important to evaluate the danger level and warn people to stop. But I think in some cases we might be the main people fine-tuning for nefariousness. And in that case, we’d want to find things that are analogous to the behaviors we’re concerned about but not dangerous themselves.
So, say — this is just a random example — in the example above, you could instead train the model to make snide commentary in its hidden thoughts and then see if your alignment techniques were sufficient to remove the snide commentary. So the case is structurally the same, where the model has this hidden behavior and you want to make sure that training the observed behaviors also affects the hidden behavior in the way you wanted, but it doesn't really involve making the model more dangerous.
A: What’s the ideal auditing regime you’d like to see going forward?
B: It would be nice if we had evals that perfectly captured whether or not any given deployment or training run is dangerous, and this guided a strictly enforced regime that everyone had to agree to. But that’s too much of a wishful answer.
One thing I think is really important is that we want a lab to have to pass a safety evaluation before they build a much larger model or do something else to improve the model’s capabilities — not just before they deploy it. Because internal deployment within a lab or to early customers could be almost as risky as deployment to everyone.
And once your model is potentially capable of catastrophic amounts of destruction, there should just be a pretty high bar for evidence that it’s a good idea to deploy this thing.
A: I'm curious about the kinds of challenges that these evaluations will have to take into account in the future as the world becomes more adapted to AI. People are building infrastructure for AIs to interact with the internet. There will be more AIs in the world, and humans will have more experience with AI. We might come to trust them more. Or we might develop more safeguards.
B: That's one of the things that’s kind of tricky. We're hoping to peg things to some budget for eliciting capabilities: so the auditor gets the base model, and then they get some budget to enhance it and add tools, and the evaluation is based on that enhanced model. But this becomes difficult when the pace of development of elicitation and scaffolding and tools for models to use is very high.
Capabilities might be very different in a year, even if the labs don't do anything. We're hoping to have the labs do more of the work of arguing that their models wouldn’t pass our threshold for dangers, even taking into account those developments. After a while we’d check to see if their predictions worked out, and if they underestimated progress then they might have to have evaluations more frequently.
A: Is there anything else that you want people to know about the work that you’re doing?
B: It’s useful for people to think about threat models and exactly how this autonomous replication stuff would work and exactly how far away we are from that. Having done that, it feels a lot less speculative to me. Models today can in fact do a bunch of the basic components of: make money, get resources, copy yourself to your server. This isn’t a distant sci-fi scenario.