Ajeya: So the current assumption is that, to be production-ready for applications like travel booking, you need significant training on that specific application. But what if continued AI capability development leads to better transfer learning from domains where data is easily collected to domains where it’s harder to collect — for instance, if meta-learning becomes highly effective with most training done on games, code, and other simulated domains? How plausible do you think that is, and what should we watch for to see if that’s where things are headed?
Arvind: Those are exactly the kinds of things we should watch for. If we can achieve significant generalization from games to more open-ended tasks, or if we can build truly convincing simulations, that would be very much a point against the speed limit thesis. And this relates to one of the things I was going to bring up, which is that we’re both fans of predictions, but perhaps in different ways. AI forecasting focuses on predictions with very clear, adjudicable yes/no answers. There are good reasons for this, but one downside is that we’re focusing on really, really narrow questions. Instead, we might want to predict at the level of worldviews — propose different perspectives, identify a collection of indicators and predictions, gather data, and use human judgment to assess how well reality matches the predictions that result from different worldviews. The big-picture claim that the external world puts (and will continue to put) a speed limit on AI development is one that isn’t precise enough that you could turn it into a Metaculus question, but that doesn’t mean it isn’t testable. It just needs more work to test, and some human judgment.
Ajeya: In that case, we should talk about observations that might distinguish your worldview and mine. I think transfer learning, which I just mentioned, is one of those things: how well does training on easy-to-collect data — games, simulated environments, internal deployment — transfer to messier, real world tasks like booking flights? Or, more relevant to my threat models, how well does it transfer to things like surviving in the wild, making money, and avoiding being shut down? And how would we know? We don't have great meta-learning or transfer benchmarks because they're inherently difficult to construct — to test transfer from the training distribution, we need to know what the AI is trained on, and we don’t have that information. Do you have thoughts on what we might observe in 2025 that would suggest this transfer or meta-learning is working much better than expected?
Arvind: You pointed out the dimension of easy- versus hard-to-collect data. I'd add a related but distinct dimension that's very important to me: low- versus high-cost of errors. This explains a lot of the gaps we're seeing between where agents are and aren't working effectively. From my experience, the various deep research tools that have been released are more useful than a web agent for shopping, say, because OpenAI’s Deep Research, though agentic, is solving a generative task rather than taking costly real-world actions.
A major theme in our work is how hard it is to capture this with benchmarks alone. People tend to look at two extremes of evaluation: pure benchmarks or pure vibes. There's a huge space in the middle we should develop. Uplift studies are one example — giving some people access to a tool and others not — but there's enormous room for innovation there. A lot of what my group is doing with the Science of Agent Evaluation project is to figure out how to measure reliability as a separate dimension from capability.
Ajeya: I'm kind of interested in getting a sneak peek at the future by creating an agent that can do some task, but too slowly and expensively to be commercially viable. I'm curious if your view would change if a small engineering team could create an agent with the reliability needed for something like shopping or planning a wedding, but it's not commercially viable because it's expensive and takes too long on individual actions, needing to triple-check everything.
Arvind: That would be super convincing. I don't think cost barriers will remain significant for long.
Ajeya: What do you think the results of that experiment would be if we did it right now?
Arvind: I haven't seen anything to suggest that reliability can be solved today, without new innovations, simply by using more compute.
But I wanted to raise something else you mentioned earlier about escape and self-reproduction as safety-relevant capabilities. I disagree there — I think we should assume every model will be capable of escape and self-reproduction. Safety shouldn't rely on that being difficult.
Ajeya: Do you think current models are capable of that, or is this just a conservative assumption we should be making?
Arvind: It's partly a conservative assumption, but it also relates to resilience versus fragility. I think many proposed safety interventions actually increase fragility. They try to make sure the world doesn’t get into some dangerous state, but they do it in such a way that if the measure ever fails, it will happen discontinuously rather than continuously, meaning we won't have built up an “immune system” against smaller versions of the problem. If you have weak models proliferating, you can develop defenses that scale gradually as models get stronger. But if the first time we face proliferation is with a super-strong model, that's a much tougher situation.
Ajeya: I think I see two implicit assumptions I'd want to examine here.
First, on the object level, you seem to believe that the defender-attacker balance will work out in favor of defense, at least if we iteratively build up defenses over time as we encounter stronger and stronger versions of the problem (using increasingly stronger AIs for better and better defense). One important reason I'm unsure about this assumption is that if AI systems are systematically misaligned and collectively extremely powerful, they may coordinate with one another to undermine human control, so we may not be able to straightforwardly rely on some AIs keeping other AIs in check.
Then, on the meta level, it also seems like you believe that if you’re wrong about this, there will be some clear warning sign before it’s too late. Is that right?
Arvind: Yes, I have those assumptions. And if we don’t have an early warning sign, the most likely reason is that we weren’t doing enough of the right kinds of measurement.
Ajeya: So let’s talk about early warning systems. If there's some level of capability in some domain where humanity's optimal choice would be not to develop AI systems at that level or beyond — at least without measures that might take years to implement — well, first, I believe such a capability level exists. I'm curious if you agree, or if you think there isn't really such a threshold?
Arvind: To me, that depends on information we don't have yet. One thing that would convince me such a level exists — and I find it helpful to be specific about types of risk rather than saying there are infinitely many — would be if we look at cyber risk and see the attacker-defender balance shifting toward attackers as capabilities increase. That would clearly suggest some capability level where defenders can't leverage those capabilities as effectively as attackers can, so maybe we should consider where to apply the brakes. But right now, I'm not seeing that evidence in any domain, though that could change.
Ajeya: What specific observations would constitute evidence that the balance is shifting in cybersecurity?
Arvind: One common argument for attackers having an advantage is that you might have a system with a million different points to probe — attackers can probe these automatically, and finding one vulnerability means they're in. But this oversimplifies things, ignoring defense-in-depth and the fact that defensive scanning can also be automated. Rather than just benchmarking capabilities, I'd love to have systematic ways to measure this: when you have this battle between automated offensive and defensive systems, what does the balance look like now? Can we quantify it? How is it changing over time?
About a decade ago, DARPA ran a two-year competition called the Cyber Grand Challenge. It was remarkable in many ways compared to usual cybersecurity evaluation efforts. It was an AI versus AI setup. That means defenses had to operate in seconds instead of days or weeks. That’s necessary if defenders are to have the advantage. Also, the cyber-defensive systems had to be responsible for the entire task of continuously securing a complex system, which means they had to integrate many techniques instead of focusing on one specific defensive capability. This created a much more realistic evaluation setup.
As far as I know, the competition ended in 2016 and was never repeated. That’s really unfortunate! We need to be doing this kind of evaluation on an ongoing basis. It would help us both measure and improve the offense-defense balance.
Ajeya: What about beyond cyber? What about broad general autonomy — the ability to do tasks like starting and running a successful company, or executing military operations autonomously? Would you be concerned about high levels of such capabilities? What information would you need to determine if we're ready to develop systems that are superhuman at general autonomy?
Arvind: Many of these capabilities that get discussed — I'm not even convinced they're theoretically possible. Running a successful company is a classic example: the whole thing is about having an edge over others trying to run a company. If one copy of an AI is good at it, how can it have any advantage over everyone else trying to do the same thing? I'm unclear what we even mean by the capability to run a company successfully — it's not just about technical capability, it's about relative position in the world.
Ajeya: You're right that these capabilities are not fixed but defined relative to competitors. For example, Steve Newman has a great blog post on developing better milestones for AGI. One measure he proposes is tracking what percentage of actual live issues on GitHub are solved by AI instead of humans. But that becomes a Red Queen's race — as AI becomes capable of solving certain issues, humans will start using it to solve all issues of that type, meaning that the remaining issues are precisely the ones AI can't solve. It doesn't work as an absolute capability threshold.
That seems similar to what you're getting at with companies — as AI systems start being helpful for running companies, everyone will use AI to run their companies. That raises the bar for what “human-level business-running” means — an AI trying to start a business on its own would be competing against human-AI teams rather than humans alone. But it's still a very different world from a human perspective — one where humans wouldn't be competitive running companies without AI advice. They might not be competitive unless they defer high-level strategy to AI, such that humans are CEOs on paper but must let their AI make all decisions because every other company is doing the same. Is that a world you see us heading toward? I think I've seen you express skepticism earlier about reaching that level of deference.
Arvind: I think we're definitely headed for that world. I'm just not sure it's a safety risk. I'd compare it to the internet — trying to run a company today without internet access would be absurd; you obviously wouldn't be competitive. I think AI will likely follow a similar path. The jagged frontier is key to how I think about this. In most occupations, as AI capabilities increase, jobs being bundles of tasks, and companies being bundles of even more tasks — as some tasks get automated, job definitions quickly shift to focus on what AI can't yet do.
For the foreseeable future, what it means to “run a company” will keep changing rapidly, just as it has with the internet. I don't see a discontinuity where AI suddenly becomes superhuman at running companies and brings unpredictable, cataclysmic impacts. As we offload more to AI, we'll see economically transformative effects and enter a drastically different world. To be clear, I think this will happen gradually over decades rather than a singular point in time. At that stage, we can think differently about AI safety. It feels premature to think about what happens when companies are completely AI-run.
Ajeya: This seems like a key theme — I'm most concerned about a world where AI's real-world impacts are really back-loaded because AI is being applied almost exclusively to further R&D toward artificial superintelligence. I don't see it as premature because I think there's a good chance the transition to this world happens in a short few years without enough time for a robust policy response, and — because it’s happening within AI companies — people in the outside world may feel the change more suddenly.
Arvind: Yup, this seems like a key point of disagreement! Slow takeoff is core to my thinking, as is the gap between capability and adoption — no matter what happens inside AI companies, I predict that the impact on the rest of the world will be gradual. Although I don’t think there will be a singularity, I think the concept of a horizon is useful. There is a level of technological development and societal integration that we can’t meaningfully reason about today, and a world with entirely AI-run companies falls in that category for me. We can draw an analogy with the industrial revolution — in the 1760s or 1770s it might have been useful to try to think about what an industrial world would look like and how to prepare for it, but there’s no way you could predict electricity or computers.
In other words, it's not just that it's not necessary to discuss this future now, it is not even meaningfully possible because we don't have the necessary knowledge to imagine this future, just like pre-vs-post industrialization concerns.
Ajeya: I feel like we're asking very different questions — almost flipping between existential and universal quantifiers. It takes a long time to reach that universal quantifier — the point where every last job is replaced by AI — because it's bottlenecked by the most conservative adopters, whether in medicine, law, or other heavily regulated sectors. But I'm thinking about an existential quantifier — where will the first potential explosive consequences happen? Especially illegitimate and harmful uses, which won't be constrained by regulation and human preferences the way helpful uses are. Maybe if we avoid the concerning scenarios I worry about emerging first, we could end up in the world you envision.
Arvind: Let's make this more concrete by discussing specific harmful use cases. I mentioned cybersecurity earlier — I think we're in a pretty good place there. Where we're not in a good place is pandemics. Even without AI, we're in a terrible position, and to the extent AI enables more actors to conduct bioterror, that multiplies the risk. These are things governments should spend heavily on. I agree we'll face massive risks before reaching a fully automated future, but these are crises we face even without AI. I'm not saying the risk is zero, but it's unclear how much AI amplifies it. In a way, the urgency AI adds to these risks could help wake policymakers from their complacency.
Ajeya: Okay, let’s consider the example of biorisk — we definitely agree that we’re already in a bad place with that one. In terms of how much AI amplifies the risks, let's say we did an RCT and discovered that random high school students could be talked through exactly how to make smallpox, with AI helping order DNA fragments and bypass KYC monitoring. If that happened in 2025, would you support pausing AI development until we could harden those systems and verify the AI now fails at this task?
Arvind: Well, my hope would be that we don't jump from our current state of complacency directly to that point. We should have testing in place to measure how close we're getting, so we can respond more gradually. While this is a low-confidence statement, I think the preferred policy response would focus on controlling the other bottlenecks that are more easily manageable — things like screening materials needed for synthesis and improving authentication/KYC — rather than pausing AI development, which seems like one of the least effective ways to mitigate this risk.
Ajeya: But let's say we're unlucky — the first time we do the RCT, we discover AIs are more powerful than we thought and can already help high schoolers make smallpox. Even if our ultimate solution is securing those supply chain holes you mentioned, what should we do about AI development in the meantime? Just continue as normal?
Arvind: Well, this would have to be in a world where open models aren't competitive with the frontier, right? Because otherwise it wouldn’t matter. But yes, if those preconditions hold — if we think pausing would actually affect attackers' access to these AI capabilities, and if the RCT evidence is sufficiently compelling — then I could see some version of a pause being warranted.
Ajeya So maybe in 2027 we will do this RCT, and we will get this result, and we will want to be able to stop models from proliferating. And then we might think — I wish that in 2025, we had done things to restrict open source models beyond a certain capability level. This particular question is very confusing to me because I think open source models have huge benefits in terms of people understanding where capability levels are in a way that AI companies can't gate or control and in letting us do a whole bunch of safety research on those models. But this is exactly the kind of thing I would like to hear you speak to — do you think it's valuable to give ourselves that lever? And how should we think about if or when to make that choice?
Arvind: There's certainly some value in having that lever, but one key question is: what's the cost? On utilitarian grounds alone, I’m not sure it's justified to restrict open models now because of future risks. To justify that kind of preemptive action, we'd need much more evidence gathering. Do we know that the kind of purely cognitive assistance that models can provide is the bottleneck to the threats we’re worried about? And how do other defenses compare to restricting open models in terms of cost and effectiveness? But more saliently, I don't think a cost-benefit approach gives us the full picture. The asymmetry between freedom-reducing interventions and other interventions like funding more research is enormous. Governments would rapidly lose legitimacy if they attempt what many view as heavy-handed interventions to minimize speculative future risks with unquantified probabilities.
Look at the recent policy debates — there was an interesting essay by Anton Leicht examining the response to SB 1047. Even as a safety advocate who wanted it to pass, he acknowledged that the discourse created a pushback coalition. You had powerful companies protecting their interests alongside scholars raising principled objections about government overreach on speculative risks. Even this relatively weak attempt triggered significant backlash. More aggressive interventions would face proportionally stronger opposition.
Ajeya: I think this gets back to the heart of the debate. You keep coming back to this point — which I'm somewhat sympathetic to — that these are future speculative risks. I think the biggest difference between our worldviews is how quickly and with how little warning we think these risks might emerge.
I want to ask — why do you think the progression will be continuous enough that we will get plenty of warning?
Arvind: Partly I do think the progression will be continuous by default. But partly I think that's a result of the choices we make — if we structure our research properly, we can make it continuous. And third, if we abandon the continuity hypothesis, I think we're in a very bad place regarding policy. We end up with an argument that's structurally similar — I'm not saying substantively similar — to saying “aliens might land here tomorrow without warning, so we should take costly preparatory measures.”
If those calling for intervention can't propose some continuous measure we can observe, something tied to the real world rather than abstract notions of capability, I feel that's making a policy argument that's a bridge too far. I need to think more about how to make this more concrete, but that's where my intuition is right now.
Ajeya: Here's one proposal for a concrete measurement — we probably wouldn't actually get this, but let's say we magically had deep transparency into AI companies and how they're using their systems internally. We're observing their internal uplift RCTs on productivity improvements for research engineers, sales reps, everyone. We're seeing logs and surveys about how AI systems are being used. And we start seeing AI systems rapidly being given deference in really broad domains, reaching team lead level, handling procurement decisions, moving around significant money. If we had that crystal ball into the AI companies and saw this level of adoption, would that change your view on how suddenly the impacts might hit the rest of the world?
Arvind: That would be really strong evidence that would substantially change my views on a lot of what we’ve talked about. But I'd say what you're proposing is actually a way to achieve continuity, and I strongly support it. This intervention, while it does reduce company freedom, is much weaker than stopping open source model proliferation. If we can't achieve this lighter intervention, why are we advocating for the stronger one?
Ajeya: I agree — measurement and transparency policies could really help us determine if we're in the world I worry we might be in. I worry AI companies are making a beeline for a system that's very general, can flexibly learn new tasks with minimal samples, and is robust like a human, and that they might do this almost entirely through transfer from internal deployment and simulated environments. If that happens, when they do turn to applying this AI to the real world, it would have a degree of flexible capability that people haven’t been prepared for by precursors like ChatGPT.
Arvind: So if I understand correctly, the concern is partly about incentives — whether companies are motivated to develop systems in secret short-term to maximize long-term returns, rather than deploying capabilities at every stage?
Ajeya: I'm envisioning something between full secrecy and full deployment. Obviously we're not in a world of total secrecy, we know AI companies exist and they do put out products we can use. What I worry about is a world where companies are directing AI development primarily toward accelerating AI R&D and hardware R&D. They try to make enough money to keep going, but won't bother creating a great personal assistant AI agent because it’s hard to do right now but would become much easier after this explosive capabilities progress is complete.
Arvind: That's a fair concern, though I personally think it's unlikely because I believe much of the learning has to happen through deployment. But I very much understand the concern, and I'd support transparency interventions that would let us know if this is happening.
Ajeya: Yes, transparency interventions would be great — I'm hoping to spend significant energy developing the transparency measure that I want to see this year. But in a status quo world where we're not getting much information from AI companies, do you have particular experiments that would be informative about whether transfer can go pretty far, or whether you can avoid extensive real-world learning?
Arvind: The most convincing set of experiments would involve developing any real-world capability purely (or mostly) in a lab — whether self-driving or wedding planning or drafting an effective legal complaint by talking to the client.
But even with o3, now that we've gotten access to it, we're trying these transfer experiments. As long as companies are releasing models through APIs, even if they're not polishing them into easily accessible products, I think a huge amount of external safety research will happen because it's well-funded and widely interesting.
Let me give you a toy example I often return to: I try playing rock-paper-scissors with each new model, letting it go first. Of course, seeing what it throws first, I win every time. When I ask why I'm so good at the game, it'll say something ridiculous like, “You must be really good at reading robot minds!” This shows it hasn't learned the distinction between simultaneous moves in actual rock-paper-scissors versus turn-taking in chat. This highlights a key dimension between domains — ones requiring lots of contextual knowledge versus those that don't.
Ajeya: How does o3 do on that?
Arvind: With OpenAI's models, there was a stark change in behavior between GPT-4 versions — my assumption is that this came from fine-tuning to teach context awareness. I recently tested GPT-4o and o3-mini — 4o thinks it’s human, which is actually a regression. o3-mini does a bit better, in that it recognizes that I’m strategically picking moves to counter it, but it still gets confused about turn-taking. Your hypothesis is interesting though — could better reasoning compensate for poor context understanding? My intuition is no, but if someone showed experimental evidence otherwise, that would shift my views.
Ajeya: And you mentioned that demonstrating successful completion of real-world tasks, except for cost barriers, would also be interesting to you. Here's a more extreme example: if in 2025 or 2026 there was a fairly general-purpose personal assistant that worked out of the box — could send emails, book flights, and worked well enough that you'd want to use it — would that shift your thinking about how quickly this technology will impact the real world?
Arvind: That would definitely be a big shift. Presumably many people, including myself, would use such an agent. When GPT-4 first came out, everyone thought we'd have these agents within a week because they saw it as purely a capability problem. But we've learned about the capability-reliability gap, prompt injection, context issues, and cost barriers. If all those half-dozen barriers could be overcome in a one-to-two year period, even for one application, I'd want to deeply understand how they changed so quickly. It would significantly change my evaluation.
Ajeya: The capability-reliability gap is a good concept, but I might frame it differently: there might be one or two key capabilities still below human level, where once they reach or exceed human level, those capabilities can solve the reliability issues. I'm thinking specifically of noticing and correcting one's own errors, and learning without needing tons of easily collectible data.
I worry there's a false narrative that the capabilities are excellent but impacts are delayed purely due to reliability issues that require grinding out many nines of reliability. Instead, I think what's happening is that while some task-specific capabilities like coding are superhuman, these metacognitive capabilities are still subhuman. When those reach human level, we might not need to grind out reliability through massive data collection.
Arvind: I'm not saying that just because there's what I call a capability-reliability gap, it has to happen over a long period. That's what I think is most likely, but it's not logically required.
Ajeya: Yeah. I've been looking for good benchmarks or ecological measurements of meta-learning and sample-efficient learning — basically any reproducible measurement. But I've come up short because it's quite hard to confirm the model doesn't already know something and is actually learning it. Do you have any suggestions?
Arvind: I've seen some attempts in different domains. For instance, testing if a model can learn a new language given just a phrasebook, knowing the language wasn't in the training data. That would be pretty strong evidence if the model could learn it as well as if it had extensive training data.
Ajeya: I've seen that paper, but it felt unsatisfying because the endpoint wasn't really fluency in the language.
Arvind: Exactly — what we'd want to see is human-level fluency. I think especially now that reinforcement fine-tuning methods are working, we'll see more of these benchmarks.
Could we shift topics a bit? This theme in your writing about AI as a drop-in replacement for human workers — you acknowledge the frontier is currently jagged but expect it to smooth out. Where does that smoothing come from, rather than potentially increasing jaggedness? Right now, these reasoning models being good at domains with clear correct answers but not others seems to be increasing the jaggedness.
Ajeya: I see it as continued jaggedness — I'd have to think harder about whether it's increasing. But I think the eventual smoothing might not be gradual — it might happen all at once because large AI companies see that as the grand prize. They're driving toward an AI system that's truly general and flexible, able to make novel scientific discoveries and invent new technologies — things you couldn't possibly train it on because humanity hasn't produced the data. I think that focus on the grand prize explains their relative lack of effort on products — they're putting in just enough to keep investors excited for the next round. It's not developing something from nothing in a bunker, but it's also not just incrementally improving products. They're doing minimum viable products while pursuing AGI and artificial superintelligence.
It's primarily about company motivation, but I can also see potential technical paths — and I'm sure they're exploring many more than I can see. It might involve building these currently unreliable agents, adding robust error checking, training them to notice and correct their own errors, and then using RL across as many domains as possible. They're hoping that lower-hanging fruit domains with lots of RL training will transfer well to harder domains — maybe 10 million reps on various video games means you only need 10,000 data points of long-horizon real-world data to be a lawyer or ML engineer instead of 10 million. That's what they seem to be attempting, and it seems like they could succeed.
Arvind: That's interesting, thank you.
Ajeya: What's your read on the companies' strategies?
Arvind: I agree with you — I've seen some executives at these companies explicitly state that strategy. I just have a different take on what constitutes their “minimum” effort — I think they've been forced, perhaps reluctantly, to put much more effort into product development than they'd hoped.
Ajeya: Yeah, back in 2015 when OpenAI was incorporated, they probably thought it might work more like inventing the nuclear bomb — one big insight they could develop and scale up. We're definitely not there. There's a spectrum from “invent one brilliant algorithm in a basement somewhere” all the way to “gather billions of data points for each specific job.” I want us to be way on the data-heavy end — I think that would be much better for safety and resilience because the key harms would first emerge in smaller forms, and we would have many chances to iterate against them (especially with tools powered by previous-generation AIs). We're not all the way there, but right now, it seems plausible we could end up pretty close to the “one-shot” end of the spectrum.