A Field Guide to AI Safety

Kelsey Piper

AI safety is starting to go mainstream, but the researchers who’ve been immersed in it for over a decade still have strong disagreements.

It’s hard to make progress in a field without a consensus about what it studies or what would constitute a solution to its most important open questions. Not unrelatedly, there hasn’t been much progress in the field of ensuring that extremely powerful AI systems don’t kill us all, even as there’s been growing attention to the possibility that they might.

It’d be a mistake to characterize the risk of human extinction from artificial intelligence as a “fringe” concern now hitting the mainstream, or a decoy to distract from current harms caused by AI systems. Alan Turing, one of the fathers of modern computing, famously wrote in 1951 that “once the machine thinking method had started, it would not take long to outstrip our feeble powers. … At some stage therefore we should have to expect the machines to take control.” His colleague I. J. Good agreed; more recently, so did Stephen Hawking. When today’s luminaries warn of “extinction risk” from artificial intelligence, they are in good company, restating a worry that has been around as long as computers. These concerns predate the founding of any of the current labs building frontier AI, and the historical trajectory of these concerns is important to making sense of our present-day situation. To the extent that frontier labs do focus on safety, it is in large part due to advocacy by researchers who do not hold any financial stake in AI. Indeed, some of them would prefer AI didn’t exist at all.

But while the risk of human extinction from powerful AI systems is a long-standing concern and not a fringe one, the field of trying to figure out how to solve that problem was until very recently a fringe field, and that fact is profoundly important to understanding the landscape of AI safety work today.

Everyone Disagrees

A May open letter by the Center for AI Safety saying “Mitigating the risk of extinction from AI should be a global priority” had a striking list of hundreds of signatories, including prestigious researchers in academia and key leaders at the labs building advanced AI systems.

The enthusiastic participation of the latter suggests an obvious question: If building extremely powerful AI systems is understood by many AI researchers to possibly kill us, why is anyone doing it? The simple answer is that AI researchers — in academia, in labs, and in government — disagree profoundly on the nature of the challenge we’re facing. Some people think that all existing AI research agendas will kill us. Some people think that they will save us. Some think they’ll be entirely useless.

An incomplete but, I think, not uselessly incomplete history of AI safety research would look like this: The research field was neglected for decades, worked on by individual researchers with idiosyncratic theories of change and often with the worldview that humanity was facing down an abrupt “intelligence explosion” in which machines would rapidly surpass us. Eliezer Yudkowsky and the Machine Intelligence Research Institute are representative of this set of views.

In the last 10 years, rapid progress in deep learning produced increasingly powerful AI systems — and hopes that systems more powerful still might be within reach. More people have flocked to the project of trying to figure out how to make powerful systems safe. Some work is premised on the idea that AI systems need to be safe to be usable at all: These people think that it’ll be very difficult to get any commercial value out of unsafe systems, and so safety as a problem may effectively solve itself. Some work is premised on safety being difficult to solve, but best solved incrementally: with oversight mechanisms that we’ll tinker with and improve as AI systems get more powerful. Some work assumes we’ll be heavily reliant on AI systems to check each other, and focuses on developing mechanisms for that. Nearly all of that work will be useless if it’s true we face an overnight “intelligence explosion.”

One might expect that these disagreements would be about technical fundamentals of AI, and sometimes they are. But surprisingly often, the deep disagreements are about sociological considerations like how the economy will respond to weak AI systems, or about biology questions like how easy it is to improve on bacteria,¹ or about implicit worldviews about human nature, institutional progress, and what fundamentally drives intelligence.

There are now quite a few people — more than 100, if less than 1,000 — across academic research departments, nonprofits, and major tech labs, who are working on the problem of ensuring that extremely powerful AI systems do what their creators want them to do.

Many of these people are working at cross-purposes, and many of them disagree on what the core features of the problem are, how much of a problem it is, and what will likely happen if we fail to solve it.

That is, perhaps, a discouraging introduction to a survey of open problems in AI safety as understood by the people working on them. But I think it’s essential to understanding the field. What it’s going to take to align powerful AI systems isn’t well understood, even by the people building them. There are people working side by side on the same problems, some of whom think they are facing near-certain death and some who think there’s about a 5% chance of catastrophe (though the latter will still hasten to note that a 5% chance of catastrophe is objectively quite high and worth a lot of effort to avoid).

If you aren’t confused, you aren’t paying attention.

The “Intelligence Explosion” and Early Work on Preventing AI Catastrophe

In the last half of the 20th century, conceptions of AI tended to envision mechanical systems designed by writing code. (This is reasonable to imagine, but importantly not how modern deep learning actually works.) Turing, for example, envisioned that AIs might use a decision rule for weighing which action to take, as well as a process by which we could insert better and better decision rules. Another common element in early conceptions of AIs was the idea of recursive self-improvement — an AI improving at the art of making smarter AI systems, which would then make even smarter AI systems, such that we’d rapidly go from human-level to vastly superhuman-level AI.

In a 1965 paper, pioneering computer scientist I. J. Good posed the first scenario of runaway machine intelligence:

Let an ultraintelligent machine be deﬁned as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an “intelligence explosion,” and the intelligence of man would be left far behind. Thus the ﬁrst ultraintelligent machine is the last invention that man need ever make.

Good used the term “intelligence explosion,” but many of his intellectual successors picked the term “singularity,” sometimes attributed to John von Neumann and made popular by mathematician, computer science professor, and science fiction author Vernor Vinge.

This is the basic set of intuitions that shaped nearly all discussion about superintelligent AI until quite recently. The possibility of self-improving AIs, intelligence explosions, and the singularity was largely discussed in the overlapping, tech-positive, sci-fi influenced futurist, Extropian, and transhumanist communities; at the time, very few others were considering the question of how to build powerful AI systems safely at all.

Much AI safety work in the 1990s and 2000s — especially by Eliezer Yudkowsky and the nonprofits he founded, the Singularity Institute and then the Machine Intelligence Research Institute — emerged from this set of assumptions. The specific claim that there’ll be a turning point is a crucial one separating this worldview from others. Yudkowsky and those who hold this belief tend to think that intelligence — in entities both artificial and biological — has a critical point — call it generalization, or coherence, or reflectivity, or the thing that separates humans from chimpanzees. Humans, possessing this ineffable quality, have built civilizations that wildly surpass anything any other species could do. AIs think faster than us, and unlike us they can copy themselves and adjust their own minds. Once they cross that threshold, they’ll surpass us fast.

What does this worldview suggest about AI safety? It suggests that gradual and incremental approaches, where we build steadily more powerful systems, figure out how they work, figure out how to align them towards human objectives, and then take the next step up in intelligence, probably won’t work. At some point your system will unexpectedly develop the ability to rapidly amplify its own intelligence, or it will figure out how to design successors and do that, or someone else who isn’t being as cautious as you will do one of those things and surpass you overnight.

In this view of AI safety, we “get one shot” — we can’t learn from alignment failures, as we won’t notice them until our systems are vastly superhuman. We don’t benefit much from having a nearly aligned system — it’s not a problem where you get 90% of the benefit from solving 90% of the problem. If a system is almost aligned, its vastly amplified successor won’t even be close. And this worldview envisions fairly little useful human input as the system rapidly ramps up in capabilities, because that ramp-up is expected to happen in the blink of an eye.

A weak form of Yudkowsky’s claims here seems very likely to be true. Certainly, any AI system that is useful at all will be, among other things, useful for designing more AI systems. Already, today’s very weak AI systems are labor-saving devices for programmers, and thus probably somewhat hasten the advent of their successors. But the strong version of the claims seems much more uncertain. (That’s not to say that they’re demonstrably or obviously false. Many of Yudkowsky’s most outspoken critics will say, when pressed, that he might be describing a real problem that will really destroy us; he’s just excessively confident of it.)

Yudkowsky has written that the “most likely result of building a superhumanly smart AI, under anything remotely like the current circumstances, is that literally everyone on Earth will die.” “Most likely,” as he’s stated elsewhere, corresponds to a 99% chance that vision comes true.

This is why people will colloquially refer to him, and those who agree with him, as “doomers.” Doomers are, I think, best characterized as people working from the intelligence-explosion premise and skeptical of the accounts of why it might not apply, and who give very high probabilities that AI kills us all.

If that sounds like a rough place to find oneself, it is. Little of the AI alignment work that has started from these or similar premises is a promising place today, even in the eyes of its own proponents. Instead, to the extent that these premises are correct, we should just stop building powerful AI systems, indefinitely, until we have a better idea of how to kick off the avalanche that is a self-improving superintelligence without catastrophe.

We should return, then, to the question I opened with: “If building extremely powerful AI systems is understood by many AI researchers to probably kill us, why is anyone doing it?” The answer is that they disagree with the Yudkowskian worldview in one or more details, have a different model of the threat, and therefore think that work today is likely to be helpful in solving the problem in time. There’s no bizarre paradox where people are funding work that by their own worldview is likely to kill us, just intense disagreement about what is likely to kill us and therefore what work might help.

Open Philanthropy and Friends

The second major worldview in AI safety is associated with Holden Karnofsky (co-CEO of Open Philanthropy, currently at the Alignment Research Center) and by his Open Philanthropy colleagues Ajeya Cotra, Joe Carlsmith, and Tom Davidson, as well as Paul Christiano (head of ARC and Cotra’s husband). Some of them, like Christiano, became interested in the alignment problem independently. Others, like Karnofsky, encountered the idea through Yudkowsky and others at the Singularity Institute in the early 2010s, but came to develop their own views.

To understand their perspective, we’ll have to get into more detail about what our existing methods for making AI systems do what we want them to do are, and why they might stop working when AI systems get sufficiently powerful.

ChatGPT and similar releases from OpenAI were trained with reinforcement learning from human feedback — a technique Christiano helped develop. In RLHF, humans rate output by the models, and the models then learn how to give answers that humans would rate highly. RLHF is imperfect and the AI is sometimes wrong about what answers would get positively reinforced, but it will probably get better as more resources are devoted to it and new tools developed to supplement it. The bigger problem might be that RLHF, and similar techniques, fundamentally teach AIs to say what we want to hear, not to do what we’d want them to do if we had full context on their decision-making.

The worry here is that as we build more powerful systems, the small disconnects between what we’re training them to do and what we think we’re training them to do will be magnified. AI systems that are good at lying to us will — on various evaluations — outperform AI systems that are trying to be candid. (Here, Cotra gives the example of the Catholic Church circa 1500 trying to train an AI. If this AI correctly reported that the Earth revolved around the sun, it would be rated more negatively than if it said the opposite.) Without specific countermeasures, AIs trained this way will have every incentive to manipulate us, and to hack and falsify the mechanisms we use to monitor them.

RLHF and related techniques also make AIs much more useful. If it’s possible to build systems that are unaligned but commercially viable, then over time we will likely build them, interact with them, and use them for economic activity at extraordinary scale. One hundred million people were using ChatGPT within weeks of its launch. People have built language-model-powered businesses. Industries are being revolutionized by language-model-driven automation.

And all of that is happening while present-day language models are in their infancy, with severe limitations that seem very likely to be a product of our inexperience with the technology. Language models today are vastly better than they were five years ago. If they improve by remotely similar amounts in the future, they will be able to automate significant fractions of human labor — including the labor that goes into developing better AI systems. The economic implications will be enormous.

You can probably build useful and powerful systems that pose no risk to human civilization. But it seems equally obvious that, at some point, if you build a parallel economic society of billions of entities that can do most or all of the things humans can do, you’re in a situation ripe for losing control of the world. That might take the form of a spectacular sci-fi conquest by AIs using advanced weapons or plagues they invented. It might be geopolitical — AIs siding with one nation to help it crush its adversaries, in exchange for enormous power in the aftermath. It might be legal — AIs purchasing virtually all land and all capital from less economically productive humans. It might involve sophisticated manipulation. But without committing to any of those stories, it seems like a world where we don’t need to solve alignment for commercial viability, don’t solve alignment, do attain commercial viability, and go full speed ahead could be a world where, as Karnofsky put it, we “sleepwalk into AI catastrophe.”

This worldview suggests some obvious ideas about which avenues of research are promising. It’s crucial to detect whether your AI is actually aligned. It’s important to understand what current AIs are capable of, so you know when you get to the brink of potentially catastrophic problems. And of course it’s important to develop alignment techniques that labs will adopt even if they aren’t necessary for commercial viability. If we do all of that, or even just do some of that and get a bit lucky, or if advanced AI systems are able to make any of that work easier to do, it feels plausible that humanity can score a big win.

I want to highlight some fundamental and important disagreements between this worldview and the Yudkowskian one, because their premises sound superficially fairly similar: Both are concerned with the possibility of AI systems being usable to automate research into AI systems, enabling fast growth as new algorithmic improvements and hardware improvements are developed.

First, while researchers at Open Philanthropy generally believe that superintelligent AIs will be developed, they don’t think that this is necessary for AIs to seize power from humanity. They tend to be less concerned with raw intelligence than with the resources and information AIs have access to. If superintelligent AIs outnumber humans, think faster than humans, and are deeply integrated into every aspect of the economy, an AI takeover seems plausible — even if they never become smarter than we are. This means that decisions about how AIs are deployed also have important implications for safety. The more control humans choose to retain over things like the supply chains that produce microchips, the harder it will be for AI to defeat us.

Second, the Open Philanthropy worldview isn’t premised on the assumption that there will be a “hard takeoff” where AIs rapidly become superintelligent. There’s still broad disagreement about how long this might take, but those who believe it will happen fast think that it will still likely be continuous. Instead of a single intelligence switch that can be flipped on or off, they think that AIs will probably get gradually smarter. This means that superintelligent AIs might have a lot in common with the much less capable systems that exist today, just as GPT-4 is smarter than GPT-2 while sharing the same fundamental architecture.

Critically, this means that tools for alignment and oversight that work on existing AI systems might actually be useful for helping us align future superintelligences. Because of concerns that RLHF and related methods might make AIs more likely to deceive humans, many of these tools involve figuring out what a model is “really thinking,” whether by looking directly at its weights or by verifying certain mathematical properties of its behavior.

It seems entirely possible that existing proposals won’t be sufficient to get the desired behavior from extremely powerful systems. But if they can get desired behavior from moderately powerful systems, and then we can develop better proposals with the aid of those moderately powerful systems, we might get somewhere. The more alignment is a matter of a “grab bag” of tools and techniques, rather than a project that requires a comprehensive structural solution, the more this approach looks viable.

An Optimistic View of AI Safety — and Where It May Fall Short

Over the next five years, it seems very likely that we’ll develop more powerful AI systems, and that the effort of ensuring they do what their creators intend will intensify. There’s a spectrum of views about how that will go. The most optimistic, of course, is that it will be easy to make systems do precisely what we want.

It’s hard to point to a single outspoken partisan of this view: People who hold it tend to regard much of the AI safety conversation as a waste of time, and thus tend not to phrase their beliefs in its terms. Yann LeCun, chief AI scientist at Meta, has views that fall under this broad umbrella: We’ll just tell the powerful AI systems what to do, and figure out how to get them to do that, and it won’t be that hard or that catastrophic to get slightly wrong. “To guarantee that a system satisfies objectives, you make it optimize those objectives at run time,” he argued in a recent Twitter conversation with Eliezer Yudkowsky. His proposal, he said, “is a way to guarantee that AI systems be steerable and aligned.”

The claim that this or any existing proposal guarantees that AI systems will be steerable and aligned is false and, frankly, unserious (though, to be fair, LeCun’s actual views are probably somewhat more nuanced than his tweets). “Unless a breakthrough is achieved in AI alignment research … we do not have strong safety guarantees,” Yoshua Bengio, a leading AI researcher and one of the pioneers of deep learning, argued in a recent analysis.

But without going so far as to claim that alignment is guaranteed, a decent share of researchers at AI labs from OpenAI to Google to Meta expect it to be not all that difficult. That is, our existing methods will be approximately sufficient, errors will be obvious and easy to correct, and the techniques we used to align those systems will just continue to produce, reliably and robustly, the behavior the creators intended, even as AI systems get steadily more intelligent, more powerful, and more difficult to oversee.

A variation on that, or a different phrasing: If we are lucky, maybe AI alignment will be effectively reducible to the problem of making AI systems that have a low enough error rate to be commercially useful — that is, that the alignment problem will turn out to be just the problem of hallucinations plus the problem of robustness against adverse inputs plus the problem of getting high-quality outputs out of an AI system at all.

If this is true, we’ll solve alignment incidentally along the way to building commercially valuable AI systems. This isn’t an argument for not working on alignment — indeed, it suggests working on it will be ludicrously profitable! — but it’s an argument against slowdowns, pauses, or regulation. AI work can proceed as fast as possible, and simply won’t be useful until we’re very good at alignment.

How plausible is this worldview? It certainly feels to me like there are some bad signs in present-day AI systems that proponents of this view need to explain away. Our existing techniques for achieving desired behaviors from AI systems feel non-robust. Engineers are constantly catching loopholes and plugging new holes against adversarial inputs. And if the Open Philanthropy worldview is right, then we’re not training AI systems to do what we want, but to tell us what we want to hear.

Much AI safety work at leading labs from Google DeepMind to OpenAI to Anthropic today is aimed at changing that — trying to develop more robust, more general, and more powerful techniques for getting desired behavior from AI systems, so that we can reasonably expect the techniques to keep working when the systems they’re applied to are really smart. As with Open Philanthropy, many of these techniques depend on training less powerful AIs to help supervise increasingly more powerful systems. I think it’s far too soon to count out this kind of “mundane solution” to the alignment problem — but it also seems far too soon to feel confident it’ll work. Maybe alignment will turn out to be part and parcel of other problems we simply must solve to build powerful systems at all.

Some people have called AI safety a “pre-paradigmatic field.” A survey like this makes it a bit clearer what that means. Growing agreement that there’s a problem hasn’t yet translated to much accord about what a solution would look like.

If you’ve encountered one account of AI safety and found it unpersuasive, you should shop around for another. People have wildly different conceptions of the problem, and disagreeing vehemently with one account of it doesn’t mean you’ll disagree with others. Those who dismiss existential risk concerns as too convenient for corporate interests might want to check out the case for existential risk concerns from those who think every extant AI lab is committing an ongoing indefensible moral evil; those who find Yudkowsky dangerously doomist might find a different accounting of the problem more credible.

My own most fundamental takeaway is that, when there is this much uncertainty, high-stakes decisions shouldn’t be made unilaterally by whoever gets there first. If there were this much expert disagreement about whether a plane would land safely, it wouldn’t be allowed to take off — and that’s with 200 people on board, not 8 billion.

You might wonder why the question “How easily can we improve on bacteria?” would matter at all to AI safety. The short answer is that if an AI could invent its own superbacteria that outcompete existing ones, it’d be pretty easy to take over the world. If that is out of reach even for an extremely intelligent biology AI, then taking over the world would likely be a less sci-fi affair. ↩

Kelsey Piper is a senior writer at Vox’s Future Perfect. She writes about emerging technologies, global development, pandemics, effective altruism, and what it’ll take to make it safely to the 22nd century.

Published June 2023

Have something to say? Email us at letters@asteriskmag.com.

Previous
What We Get Wrong About AI & China

Next
Through a Glass Darkly

A Field Guide to AI Safety

Kelsey Piper

Everyone Disagrees

The “Intelligence Explosion” and Early Work on Preventing AI Catastrophe

Open Philanthropy and Friends

An Optimistic View of AI Safety — and Where It May Fall Short

Further Reading

Crash Testing GPT-4

How We Can Regulate AI

The Puzzle of Non-Proliferation

About Highlights