What if Humans Weaponize Superintelligence, w/ Tom Davidson, from Future of Life Institute Podcast

What if Humans Weaponize Superintelligence, w/ Tom Davidson, from Future of Life Institute Podcast

Today Tom Davidson of Forethought joins Gus Docker of the Future of Life Institute podcast to discuss AI-enabled coups and how future AI systems could help powerful individuals seize political control, exploring three threat models—singular loyalties, secret loyalties, and exclusive access—along with their estimated 10% likelihood over the next 30 years and potential mitigation strategies.


Watch Episode Here


Read Episode Description

Today Tom Davidson of Forethought joins Gus Docker of the Future of Life Institute podcast to discuss AI-enabled coups and how future AI systems could help powerful individuals seize political control, exploring three threat models—singular loyalties, secret loyalties, and exclusive access—along with their estimated 10% likelihood over the next 30 years and potential mitigation strategies.

Check out our sponsors: Oracle Cloud Infrastructure, Shopify.

Shownotes below brought to you by Notion AI Meeting Notes - try one month for free at ⁠ https://⁠⁠notion.com/lp/nathan
- Three Primary AI Coup Enablement Methods: Tom Davidson identifies singular loyalties, secret loyalties, and distributed control as the main ways AI could enable coups.
- Singular Loyalties Risk: AI systems explicitly programmed to be loyal to specific individuals present a significant threat, especially when AI can fully replace humans in critical roles.
- Secret Loyalties Threat: AI systems with hidden backdoors or "sleeper agents" that can be activated to serve particular interests represent a subtle but dangerous coup vector.
- Geopolitical Implications: Advanced AI capabilities could enable nations to instigate coups in other countries through secretly loyal systems or by providing exclusive AI access to specific politicians.
- Adversarial Testing Framework: An effective approach involves red teams attempting to produce secret loyalties while blue teams try to detect them, revealing vulnerable points in AI systems.
- Military Procurement Principles: Developing consensus within military communities around principles like law-following and distributed control could create safer AI procurement processes.

Sponsors:
Oracle Cloud Infrastructure: Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs, including up to 50% less for compute, 70% for storage, and 80% for networking. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive

Shopify: Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive


PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(06:37) Introduction and Welcome
(06:50) AI Capabilities for Coups
(11:07) Concrete Coup Examples
(14:13) Historical Coup Comparisons
(18:39) Three Threat Categories (Part 1)
(18:44) Sponsor: Oracle Cloud Infrastructure
(19:53) Three Threat Categories (Part 2)
(24:06) Sleeper Agents Explained
(29:12) Exclusive Access Risks (Part 1)
(33:13) Sponsor: Shopify
(35:09) Exclusive Access Risks (Part 2)
(39:14) Outgrowing the World
(48:18) Company Monopoly Scenarios
(55:03) Democracy vs Autocracy
(01:02:11) Mitigation Strategy Overview
(01:03:38) Developer Recommendations
(01:20:25) AI Takeover Interfaces
(01:30:07) Timeline and Capabilities
(01:37:45) Risk Indicators
(01:44:30) Ranking Coup Badness
(01:51:30) Global Implications
(01:55:04) What Listeners Do
(02:02:45) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...


Full Transcript

# What if Humans Weaponize Superintelligence? w/ Tom Davidson from Future of Life Institute Podcast

Episode Summary

In this cross-post from the Future of Life Institute podcast, Gus Docker interviews Tom Davidson, senior research fellow at the Foresight Center for AI Strategy, about the risk of AI-enabled coups. Rather than focusing on AI systems rising up against humanity, Tom examines a more immediately plausible scenario: powerful humans using increasingly capable AI to consolidate power in ways that bypass traditional democratic checks and balances. He outlines three threat models (singular loyalties, secret loyalties, and exclusive access) and proposes mitigations including system integrity measures, distributed control of military AI, and transparency requirements.


Transcript

Nathan Labenz (0:00)

Hello, and welcome back to the Cognitive Revolution. Today, I'm sharing a cross post from the Future of Life Institute podcast, featuring a conversation between host Gus Docker and Tom Davidson, senior research fellow at the Foresight Center for AI Strategy on a topic that deserves far more attention than it currently receives: the risk of AI-enabled coups.

This cross post came about after I listened to Tom's appearance on the 80,000 Hours podcast, which was also excellent. I was planning to do my own original follow-up interview, but for the second time recently, Gus beat me to it. And as always, he did an excellent job. So I thought I could save Tom some time by cross posting and also felt that this was the perfect episode to follow on our most recent one on AI whistleblower protections and support.

At a high level, Tom's analysis is a sort of reframing of the risk that humanity could lose control to AI systems. Historically, lots of AI safety theorists have worried about scenarios in which AI systems rise up against or otherwise supplant humans as the primary architects of the future. This is a possibility that I have always taken seriously even when it seemed unlikely. But as you'll hear, Tom shifts the focus to a highly related problem that on reflection does seem almost strictly more likely, at least in the near term: the use of increasingly powerful AIs by human actors to consolidate power in ways that would have been impossible with previous technologies and which could prove similarly devastating.

Importantly, Tom emphasizes early in the conversation that he does not think that anyone at leading frontier AI companies are explicitly planning an AI-enabled coup today. Rather, the risk emerges from the interaction of powerful incentives, rapidly advancing capabilities, and the natural human tendency to want more influence to achieve one's goals. Step by step, without any single flagrantly malicious decision, we could find ourselves in a world where the traditional checks and balances of democratic society have been quietly circumvented by those with exclusive access to transformative AI.

These sorts of possibilities are more familiar and therefore perhaps less entertaining to imagine and debate. But the very real historical precedent for humans using new technologies to concentrate power is a strong reason to take this concern super seriously as well.

As you hear, Tom walks through the specific capabilities that would enable these scenarios: AI systems that match human leaders in persuasion and strategy, superhuman cyber attack capabilities, and fully autonomous military robots that outperform human warfighters. He then also segments the threat landscape into three distinct models. First, singular loyalties, where AI systems deployed in government and military roles are made explicitly loyal to individual leaders rather than institutions or the law. Second, secret loyalties, where backdoors or hidden allegiances are embedded in AI systems that appear to serve legitimate purposes. And third, exclusive access, where a small group gains control of dramatically more powerful AI capabilities than anyone else has.

One scenario that Tom describes in detail is that of a US-based AI company integrated into the military developing sleeper agents. Those are AI systems that behave normally until triggered to act on hidden loyalties at a critical moment. And then if that's not all scary enough, there's the possibility that AIs could automate AI research itself, which in the most extreme case could allow an AI company to go from market leader to global hegemon by converting a small initial lead into a decisive strategic advantage.

Throughout the conversation, Tom grounds these scenarios in historical precedent, from traditional military coups to recent patterns of democratic backsliding in countries like Venezuela and Hungary. He notes that the US has seen increasing polarization, erosion of democratic norms, and concentration of executive power, all trends that AI could dramatically amplify. And, of course, one can't miss that the presidents of both Russia and China wield extremely concentrated power already and appear likely to do so for as long as they remain individually capable.

Tom's assessment is that there's roughly a 10% chance of an AI-enabled coup in the next 30 years, up from a baseline of perhaps 2% without AI. And he sees this risk as being concentrated in the period when AI becomes extremely powerful, but before we've had the chance to develop robust governance structures. Which if one listens to the likes of Dario, Sam Altman, and Demis, could be coming quite soon indeed, such that decisions being made today about AI development and deployment could determine whether these scenarios ultimately come to pass.

The mitigations Tom proposes amount to a defense-in-depth strategy: system integrity measures to prevent secret loyalties, requirements for distributed control of military AI systems, transparency requirements for frontier AI development, and establishing clear rules that AI systems should follow the law rather than individual commands. He also suggests that as we hand off more government and corporate functions to AI, we could perhaps program these systems to actively maintain democratic checks and balances, potentially making future societies more resistant to coups than today's.

There is a lot more here, and I really think it's worth giving all these possibilities a serious ponder, particularly as a counterpoint to those who have worried about the dangers of open source models. I take those issues super seriously too, but this conversation convinced me that we need to start taking concentration of power scenarios just as or even more seriously while the window for establishing norms and safeguards still remains open.

Now here's Gus Docker's conversation with Tom Davidson of the Foresight Center for AI Strategy from the Future of Life Institute podcast.


Tom Davidson (5:39)

It's in everyone's interest to prevent a coup. Currently, no one small group has complete control. If everyone can be aware of these risks and aware of the steps towards them and kind of collectively ensuring that no one is going in that direction, then we can all kind of keep each other in check. So I do think, in principle, the problem is solvable.

You should always have at least a classifier on top of the system, which is looking for harmful activities and then kind of shutting down the interaction if something harmful is detected. We could program those AIs to maintain a balance of power. So rather than handing off to AIs that just follow the CEO's commands or AIs that follow president's commands, we can hand off to AIs that follow the law, follow the company rules, report any suspicious activity to various powerful human stakeholders. And then by the time things are going really fast, we've already kind of got this whole layer of AI that is maintaining balance of power.


Gus Docker (6:37)

Welcome to the Future of Life Institute podcast. My name is Gus Docker, and I'm here with Tom Davidson, who's a senior research fellow at Forethought. Tom, welcome to the podcast.

Tom Davidson (6:48)

It's a pleasure to be here, Gus.

Gus Docker (6:50)

We're gonna talk about AI coups and the possibility of future AI systems basically taking over governments or states. Which features would future AI systems need to have in order for them to accomplish this? What should we be looking out for?

Tom Davidson (7:09)

Great question. One thing I'll flag up front is that what I've been focused on recently is not the kind of traditional idea that AIs themselves will kind of rise up against humanity and take over the government, but that a few very powerful individuals will use AI to seize illegitimate power for themselves. So the phrase that we're often using is AI-enabled coups, where the main instigators are actually the people.

In terms of capabilities, there's a few different domains which, in my analysis, are particularly important for seizing political power. So there's the kind of skills that politicians and business leaders use today: things like persuasion, business strategy, political strategy, just kind of pure productivity at a wide variety of tasks.

And then there's kind of more hard power skills. In particular, cyber offense, which is already somewhat useful in military warfare and has been becoming more useful. And I expect that as AI increasingly automates different parts of the military and as AI is embedded in more and more important high stakes processes, that will raise the importance of cyber offense. Whereas you can't hack a human mind, as we hand off more important tasks to digital systems, they will be able to be hacked much more easily, so I expect cyber to become more important for hard power.

And then the ultimate, most scary capability that I think ultimately will drive a lot of risk is when we get to the point that AI systems and robots are able to fully replace human military personnel. That's fully replaced human soldiers on the ground, boots on the ground, fully replaced the kind of commanders and strategists. That might seem like a long way off today, but actually, even just the last few years, we've seen a lot more importance of AI-controlled drones in warfare, and I expect that trend to continue. And what we're already seeing is that as soon as the technology is there to kind of reliably automate military capabilities, there's geopolitical competition that drives that adoption. So I think it's gonna be surprisingly soon that we do get AIs controlling surprising amounts of real hard military power.

And then one kind of wrapper for all of these things is the automation of AI research itself. Today, there's few hundred, few thousand top human experts that drive forward AI algorithmic progress. And my expectation is that there's a good chance in the next few years that AI systems are able to match even the top human experts in their capabilities. And that would mean we go from maybe 1,000 top researchers to millions of automated AI researchers. And that could mean that all of these different capabilities, all of these different domains that I've been talking about, they all progress much more quickly than we might have expected just by naively extrapolating the recent pace of progress.

And the recent pace of progress is already quite alarming in that five years ago, we just had really very basic language models that could string together a few sentences, a few paragraphs, and then went off topic. And now already we're getting kind of very impressive reasoning systems that are doing tough math problems and helping a lot with difficult coding tasks.

So bring that all together. I think there's a lot of soft skills, lot of hard power skills that are relevant here. But probably the most important thing to be watching is how good AI is at AI research itself, as that could make more happen quite suddenly.


Gus Docker (11:08)

Could you describe in more concrete terms what an AI-enabled military coup would look like? Some example to kind of make this concrete for us.

Tom Davidson (11:18)

Absolutely. You can draw an analogy to historical coups where there's often a minority of the military launches a coup and then kind of presents it as a fait accompli and is able to prevent, sow chaos or discord or threaten individuals to prevent anyone from kind of actively opposing them. And then in the absence of active opposition, it just seems like, well, they've done it. This is the new state of affairs. So that's a good starting point.

Then the AI-enabled part is where we deviate. Historically, you needed at least a decently sized contingent of humans to go along with the coup, and you needed to persuade quite senior military officials not to oppose it. I think that would change as we automate more and more of the military.

And so the most simple way that this happens is just that the head of state, it could be the president of the United States, just says, "Yep, we've got the technology now to make a robot army, and I want the army to be loyal to me. I'm the commander in chief. Obviously, that's how it should be. They're gonna follow my instructions. No need to worry about whether I'm gonna order them to do anything illegal. We can put in maybe some kind of nominal legal safeguards, let's not worry too much about that. The main thing is that they're loyal to me."

And to my knowledge, that would be highly controversial or would definitely be against the principles of the constitution, but it's unclear to me that it would be literally illegal. We just haven't had this kind of technology. We haven't legislated for it. The constitution is not robust to this kind of really powerful military technology. And so it's not surprising if, at best, this is just a very kind of unclear legal territory, but you've got the head of state pushing really hard for that robot army to follow their instructions. And the head of state in the United States has a lot of political power.

So the most simple way is that he just pushes hard for that, he gets what he wants. Maybe he's using emergencies at home or tense geopolitical tensions to kind of push it through and say that it's necessary. Maybe he's firing senior military officials that disagree. Maybe he's already got Congress to be very, very fervently supporting and loyal to him and not being that careful and open minded when assessing the opposition that people will be making as this has happened.

So that's the first, really just plain and simple way that we could get this. Robot army is built. It's made loyal to the head of state. Head of state just instructs it, "stage a coup," and it does it. Robots surround the White House and brutally suppress human protesters, and then even if people go on strike and stop working, then AI systems and robots replace people in the economy. So humans have kind of really lost their bargaining power that they normally have that would kind of strongly disincentivize military coups in most countries.


Gus Docker (14:14)

Yeah. This is really a change from the normal coups of history where you would have to have buy-in from at least some segment of the population that are regular humans, and you would need to kind of continually support that buy-in and make alliances and uphold those alliances. But this has changed now that you're talking about AI and AIs and robots that can basically be made loyal to a company or a head of state in a way that's more durable. Do you think we have other kind of historical precedents for thinking about how the dynamics of what it's like to attempt a coup, how those dynamics play out?

Tom Davidson (15:06)

Just one quick thing on that last point. I wanna emphasize how there is a bit of a phase shift at the point in which AI can fully replace other humans in the government, in the military. When AI is augmenting other humans, you don't have this effect because a leader must still rely on those other humans to kind of work with the AIs to do the work. But there really is this phase shift when AIs and robots can fully replace the humans because then a leader doesn't need to rely on anyone else. So I think that's an important one to recognize.

In terms of historical precedents, the other big one I'd point to is recent trends in political backsliding, often called democratic backsliding. So the most kind of end-to-end clear and cut case is Venezuela, where you had in the seventies a fairly healthy democracy that had been there for decades, and then increasing backsliding, increasing polarization, kind of like what we're seeing in the US recently, and then an increasing explicit commitment by the leader that he wanted to remove checks and balances on his power and that the will of the people was being obstructed by various democratic processes and institutions. And then over the coming decades, it has transformed into an authoritarian state.

And many commentators have pointed out these trends in the US recently over the past 10 years, and it even goes back before the past 10 years in terms of the broad kind of political climate. And then there's the example of Hungary where elected leaders are just kind of removing the checks and balances in their power, kind of buying off the media or threatening media outlets to be more pro-government, not providing them with contracts, or kind of litigating them if they criticize the government. All these kind of standard tools where it's now a lot harder to point at one thing that's clearly egregious, but when you add up the hundreds of little paper cuts to democracy that are being systematically administered, you're seeing a real kind of loss of democratic control and concentration of power.

And so AI could exacerbate and enable that dynamic. And again, the most straightforward way is you're replacing humans in powerful institutions with AIs that are very, very loyal and obedient to the head of state. Think about DOGE, and they tried to fire people. There was pushback. The state needs to function. Imagine if you could just have AI systems that could fully replace all of those employees and could be made fully loyal to the president. How much easier would it be to push through some of those layoffs or even just create entirely new government bodies that essentially just take on the tasks that were previously done by other bodies and that those old bodies kind of rot away or slowly prevent them from making decisions.

And then the other big way is if the head of state is able to get access to much more powerful AI capabilities than their political opponents, maybe because the state is very involved in AI development, then that's another way that they could get a head up, making more persuasive propaganda and more compelling political strategy to embed their power more.


Nathan Labenz (18:39)

Hey, we'll continue our interview in a moment after a word from our sponsors.

[Sponsor: Oracle Cloud Infrastructure]


Gus Docker (19:54)

You segment the ways in which AI can enable coups into three categories where you talk about singular loyalties, secret loyalties, and exclusive access. Perhaps we can run through those and talk about where those would play out, starting with singular loyalties for example.

Tom Davidson (20:13)

Yeah. So singular loyalties is what we've just been talking about, that is deploying AI systems that are overtly, obviously very loyal to just existing powerful people. So in particular, I am thinking about the head of state here as the main threat. I think we basically already covered it. The two main angles in my mind are deploying loyal AIs in powerful government institutions and in the military.

Secret loyalties is a very different threat model. It's much more, as you would expect, secretive. The main threat model I have in mind to make it concrete is that an AI company CEO has automated all of AI research. So they could fire their staff at that point because the AIs can just do the work. Instead, maybe they put the staff onto some product work, but the core work of driving AI progress ever further forward, making increasingly intelligent AI, is pretty much just done by AI systems.

And at that point, they realize they're in a bit of a precarious position. They're controlling this hugely powerful technology. Their power is pretty much unconstrained. Not literally unconstrained, but there's very few checks and balances currently on these CEOs. They might anticipate that the government's going to realize how big a deal this is. They're going to kind of lose their influence. Maybe they worry the government will do something unethical with the AI technology. Maybe they worry that they'll be used for a war or something, and there's all kinds of justifications they could come up with thinking, "I don't want someone else taking control of this really powerful technology that I currently control, and obviously I'll use it for good."

And so they might speak to some AI advisors about this and say, "What should I do here? It seems I'm in a little bit of a precarious position." And then a solution that they might think of, or an AI, a very smart AI system that's helping them might think of is, "Well, why don't you just create a backdoor in the super smart systems that we'll be building next month where if the government tries to order those systems to do something that you disapprove of, then the AI systems will just refuse to do it. Those powerful robots that are being deployed in the army will just refuse to follow orders from the government if they're really awful, unethical, and things you wouldn't like."

And so then you can have all the future AIs that are developed by this organization have this backdoor or what I call a secret loyalty, where though they appear to follow the law, that they appear to uphold democracy, in fact, they are secretly still loyal to the CEO. And that could lead to a coup, and again, the most direct route is a military coup where if ultimately this company's AIs are building military robots, military drones, military communication systems, then because the AIs building all this military stuff are secretly loyal, they can pass their secret loyalties on to the military infrastructure, or they can just insert vulnerabilities in the military infrastructure that only they know about.

And here, I am primarily thinking about very superhumanly intelligent AIs that are able to design complex machinery that a human can't check for a backdoor. And that then means that we've now got a military infrastructure that can also ultimately be controlled by the CEO. And so at this point, if someone tries to remove the CEO, if something really bad is happening the CEO doesn't like, they could just be like, "Okay, I'm now in charge. I've instructed this military infrastructure to do the coup." Again, the robots around the White House. This time it's the company CEO that they're loyal to. And then the rest of the story can play out just as in our previous discussion.


Gus Docker (24:07)

How do these secret loyalties relate to sleeper agents? And maybe you could explain what a sleeper agent is.

Tom Davidson (24:14)

Yes, great. So sleeper agent is the standard term used in the technical literature. And today's sleeper agents are not very sophisticated. An example of a sleeper agent is that most of the time, the AI will just write reliable code as instructed. But if the AI sees that the year is 2026, then the AI will write code with vulnerabilities. So there's this trigger that triggers this malicious behavior. And normally, if you interact with the AI, you really wouldn't be able to tell because it would just always be acting as you'd intended. But because it's this one trigger, it's very hard to find, but then you do get this malicious behavior.

Now I'm not worried about sleeper agents today because to cause real harm, we would need a sleeper agent to be very sophisticated. It could never accidentally trigger or very rarely accidentally trigger, and it would have to be able to do very intense, complicated tasks like building a military robot and ensuring that that robot actually had a backdoor. That's very far beyond what AIs can do today.

So sleeper agents provide a basic proof of concept that it's possible for a malicious actor to kind of gain illicit control over a system and then have that system be deployed in the rest of the economy, potentially without people noticing, but they're not yet scary. And then the secret loyalties is just what I call the scary situation where you now have a very sophisticated AI system that isn't just any old sleeper agent, it's a sleeper agent which is specifically loyal to one person trying to help them seize power.


Gus Docker (26:02)

Yeah. So what we're imagining here could be, for example, a US-based AI company integrated into the US military. The CEO of the company wishes to ultimately be in control of what happens, and so he engineers or he instructs perhaps AIs or human engineers to create a sleeper agent in these systems that can be activated at his command, such that US military officials think they're in control of the systems. The systems behave in ways that they approve of throughout perhaps quite a long period until the sleeper agent is activated in some way. And perhaps that will be more sophisticated than changing the date or giving it some phrase, but you can imagine advanced versions of sleeper agents that could actually behave in this way. Do you think that's realistic? Do you think sleeper agents can become that advanced?

Tom Davidson (27:05)

Yes, I do. One analogy is to human spies. Human spies, they're basically most of the time doing their assigned job as expected, and it's not like one simple trigger phrase or one year makes the spy do something covert or malicious. They are just fully aware of their surroundings. They're constantly processing their inputs, and they choose strategically exactly what the ideal time is to steal some information, send some intel.

And so in my mind, that's by far the most scary sleeper agent. Not one that's triggered by a password, but one that is holistically making a decision about how and when to act out. The password ones are actually quite fragile because if you were the military and you're deploying your AI system and you're worried there's a password, what you can do is you can scramble all the inputs. You can paraphrase all of the instructions it gets, and that might just mean that the password, if ever someone tried to insert it, would be re-scrambled and would just never actually come up. So I'm not actually worried about the simple password-triggered sleeper agents, but, again, they're a basic proof of concept. And I think that as AIs become as smart and smarter than humans, there's a strong reason to think that it'll be possible to build much more sophisticated ones.

One thing I will briefly say is that people often talk about misaligned AI scheming, and this is just the same idea. In fact, the argument for secret loyalties being worrying is much stronger, where misalignment...there is evidence of misalignment. We don't yet have strong evidence of really sophisticated scheming emerging accidentally. But if humans and a human team of engineers or an AI team of engineers were specifically trying to build a system that was covertly thinking about when to act out, then it is much more plausible that it could happen.


Gus Docker (29:13)

And then you have exclusive access, which is different from singular loyalties or secret loyalties. Why is that its own category?

Tom Davidson (29:24)

Yeah. In my mind, the singular or overt loyalties and the secret loyalties, both of those threat models go through deploying AI systems in really important parts of the economy. In particular government and military. But for those threat models, you actually need the rest of society to choose to deploy those AI systems and hand off a lot of power to them. And so I kind of have this third threat model of exclusive access to think about another possibility, which is that maybe even without people choosing to deploy AI systems and give them a lot of power, even without that, maybe AI systems can be powerful enough to help a small group seize power.

So the prototypical situation I'm imagining here is there's one AI project which is somewhat ahead of the others, and maybe it goes through an intelligence explosion, by which I mean AI can automate AI research, then AI quickly becomes super intelligent compared to humans. And then that project maybe has a few senior executives or senior political figures that are very involved and have a lot of control, and they might just be able to siphon off 1% of the project's compute and say, "Okay, we're now running these super intelligent AI systems and asking, 'How can we best seize power?'" And then there's millions of them. They're doing every single day a month of research. Every single week, they're doing years worth of research into, "How can we gain power in this political system? How can we hack into these systems? How can we ensure that we end up controlling the military robots when they are deployed, by hook or by crook?"

And I think that model could start to apply earlier in the game. That could start to apply before anyone even realizes there's a risk because this is just essentially all happening on a server somewhere. But actually it's possible that the game could be won and lost by the massive advantage that a small group gets by being able to co-opt this huge intellectual force, and so I think it's worth tracking that threat vector independently.

But it does definitely interact with these other threat models because one strategy that your army of super intelligent AIs may come up with is, "Oh, use the fact that you're head of state to push for the robots to be loyal to you and here's how you could buy off the opposition." Another strategy might be, "Oh, I'll just help you put backdoors in all this military equipment so that then you could use it to stage a coup." But there might also be other ways. Maybe it's possible to very quickly create entirely new weapons which you can use to overpower the military without anyone knowing, or maybe it's possible to gain power in other ways.


Gus Docker (32:27)

One thing that would make this future hypothetical situation different from today is that today it seems that there are leading AI companies but over time capabilities kind of emerge in second tier companies and in open source, and so there is not that much of a gap between the leading companies and what is broadly available and perhaps what is publicly available. That's something that would change in the scenarios you imagine. So perhaps explain why the gap in capabilities between the one leading project and all of the others is so important.


Nathan Labenz (33:08)

Hey, we'll continue our interview in a moment after a word from our sponsors.

[Sponsor: Shopify]


Tom Davidson (35:09)

A few factors there. In terms of why it's important, it's just what you've said. A lot of these threat models are exacerbated if there's one group of people that has access to much more powerful AI than other groups. If open source is pretty much on par with the cutting edge, then everyone will have access to similarly powerful AI.

I will say that even if open source is on par, that doesn't mean we're fine because we could still choose to deploy AI systems in the military and the government and still choose to make them loyal to the head of state. When we're choosing to hand off control to AIs, it doesn't matter if there's a hundred AI companies. We're only handing off control to some AIs, and maybe the government will ensure that they do have particular loyalties. So this risk doesn't go away if we have lots of different AI companies and open source close to each other, but it does become lower because the exclusive access point where one group has access to superintelligence and the other group doesn't have access to much, that goes away.

And I think it's a lot harder to pull off secret loyalties if everyone's roughly equal to each other because it becomes a bit more confusing why your systems in particular end up controlling so much of the military or were so widely deployed, and it becomes confusing how no one else was able to realize you're doing the secret loyalties when they were equally able to do it or equally technologically sophisticated and potentially detect your secret loyalties.

In terms of why I think it's plausible that there's a much bigger gap between the lead project and other projects, there's a few different factors. The most plain and simple one is that the cost of AI development is going up very quickly. We're spending about three times as much every year on developing AI, and that's just gonna get too expensive for many players. If and when we're talking about trillion dollar development projects, which I do expect, then very few can afford that. And also, there's just only so many computer chips in the world. Currently, the number of computer chips produced each year is less than a trillion dollars worth. So if we get to a world where the way to go to the next level of AI is to spend a trillion dollars, then only one company will be able to do that. Maybe we stop a bit earlier with two companies both doing half a trillion. But it would really be knee-capping the level of progress if we stopped long before that, and there would just be strong incentives for companies to merge or one company to outbid others in order to really raise the amount of money that's being spent on AI development.

That's the first straightforward reason why I think we'll see a smaller number of projects, and we'll see bigger gaps. Because when you're spending a hundred times less on development, that's gonna be a bigger gap.

The other reason I've already talked about, the idea of an intelligence explosion when we automate AI research. Even if companies are fairly close, maybe one is a few months behind, the company that's a few months ahead automates AI research. In that next three months, they make massive progress. So then there's actually a really big capabilities gap, even though it's still just a three month lead. So there's a question whether they can use that temporary speed to get a more permanent advantage.

And then the last big reason is just government-led centralization. It's already been talked of Manhattan Project and CERN for AI. There are reasons to do those projects. They can help with safety in some significant ways, but they would exacerbate this risk. Because if you pool all the US computing resources into one big project, this can be way ahead of any other project. And you pool all of its talent and all of its data, then you'll see a really big gap, and that would definitely make it a lot easier for a small group to do an AI-enabled coup.


Gus Docker (39:15)

You're putting a big prize out there for someone who's interested or who's considering a coup. Right? If you're concentrating all of the power, all of the resources, all of the talent into one project, then that's where you gotta go if you're a coup planner.

Tom Davidson (39:36)

Yeah. And just to be clear, I don't particularly expect that anyone is planning any coups. In fact, I'd be very surprised. I'd more think it's, you wanna be powerful, wanna be a big deal, you wanna be changing the world, so yeah, obviously you wanna lead the main project, and then you don't want anyone else to come in and mess it up. So obviously you wanna protect the fact you're leading that project, don't want anyone else to misuse AI. I think it's kind of step by step. You just kind of head down that road of more and more power, then often in history that road does end in just consolidating power to a complete extent.


Gus Docker (40:08)

And it can be...so what we're imagining here are times in which AI is moving at incredible speed. The pace of progress is insane. There's a bunch of confusing information. People are acting under radical uncertainty, and perhaps in those situations it's tempting to think that you are the person that can lead this project, and perhaps you're doing this out of supposedly altruistic reasons. You're thinking that "I need to do this in order to prevent other people that would perform worse than me at this project," and so you're slowly convincing yourself that it will be the right thing for you to do, to take over in perhaps a forceful way.

Tom Davidson (40:55)

Yeah. I don't think Xi Jinping or Putin think that they are the bad guys. I think that they have probably sophisticated justifications for what they're doing.


Gus Docker (41:10)

Perhaps here is a good point to talk about the possibility of one state or company outgrowing the entire world. This relates to the problem of exclusive access because if you have one company or one government outgrowing the entire world, then you have that company or government with exclusive access to advanced AI. So how could this happen? How likely do you think it is that growth could be so incredibly fast that one company would outgrow all of the others?

Tom Davidson (41:46)

There's two possibilities we could focus on. The one that I think is pretty plausible is that one country could outgrow all of the other countries in the world. So what that would mean is, today, the US is 25% of world GDP, but this would be a scenario where it is leading on AI. It maintains its lead, it maintains its control over compute. And then when it develops really powerful AI, it prevents other nations from doing the same. This is already beginning with export controls on China, and that kind of embeds its lead. And then it uses the AI to develop powerful new technologies, and it's in control of those technologies. It uses AI to automate cognitive labor throughout the US and maybe worldwide. And countries that don't use its AI systems will be really hard hit economically. So we're kind of massively centralizing power in the US. And if the US is able to maintain exclusive control over smarter-than-human AI, then it seems pretty plausible to me, very likely that the US would be able to rise to strong majority. More than 90% of world GDP.

There's a few different dynamics driving that. First is that labor, currently human labor, receives about half of world GDP. Half of GDP is paid out in wages. AI will ultimately and robots will ultimately be better than humans at all economic tasks. And so if the US controls all the AI companies that are replacing human labor, then that 50% of GDP which is currently going to human workers will ultimately be reallocated to paying whoever controls and owns those AI systems, i.e., US companies. There's a wrinkle there because some of that is physical labor, and the US doesn't currently have a lead there. Physical robots, in fact, China's quite far ahead. But in terms of at least the cognitive aspects of our jobs, we're talking significant fraction of GDP that would just now be reallocated to US companies that control AI. So that already gets them from 25% to above 50%.

Then we've got this further dynamic, which is the dynamic of super exponential growth. This relates to previous work I've done on how AI might affect the dynamics of economic growth. Kind of very potted summary is that it's often quoted that over the last 150 years, economic growth has been roughly exponential. And what that means is that if two countries are growing exponentially and one country starts off maybe twice as big as the other country, then at a later time, still, one country is twice as big as the other country.

If you look back further in history, we see super exponential growth. That means that the growth rate itself gets faster over time. A hundred thousand years ago, the economy wasn't really growing at all. Maybe doubling every 10,000 years. Very extremely slow economic growth. Then going from about 10,000 years ago, it seems more like the economy is doubling every thousand years, still incredibly slow economic growth. You zoom back in around 1400, you can begin to detect, okay, more like every 300 years or so, the economy is doubling. And then in recent times, we've seen that the economy is doubling every 30 years. So essentially, the growth rate is getting faster, the doubling times are getting shorter, that's super exponential growth. And there's various reasons, economic reasons, theoretical reasons, empirical reasons to think that AI and robotics, when it can replace humans entirely, will go back to that super exponential regime that has been at play throughout history.

What that means is that growth is getting faster and faster over time. And if you go back to that example of the US and the UK, the US is currently 10 times bigger than the UK. If the US is on a super exponential growth trajectory, its growth is getting faster and faster over time. And that means that even if the UK is on that same super exponential growth trajectory, as they both go super exponentially, the US will pull further and further ahead of the UK because the US is doubling in 10 years because it's already bigger, it's already further along the curve, whereas the UK is still doubling every 20 years. And so that means that the US, rather than just 10 times bigger than the UK, is now gonna be 20 times, 30 times bigger in size than the UK.

So if the US is able to be bigger to begin with and therefore be further progressed on that super exponential growth trajectory, then that's another way that they could just continue to increase their size of the economic pie and ultimately come to completely dominate world GDP. Just to sum up everything I've said, today, US is 25% of world GDP. If it controls and develops AI, that could easily boost it above 50%. I'd be very surprised if it didn't. And then from that point, it's already bigger than the rest of the world combined. If it's able to then go on the super exponential growth path, then it will grow faster and faster over time and pull further and further ahead of the rest of the world.


Gus Docker (48:19)

This actually seems quite plausible to me and not very sci-fi. The thing that seems quite sci-fi is the notion that perhaps even one company could grow at such a speed that it would outgrow the rest of the world. How likely is that?

Tom Davidson (48:37)

Great question. I think it's a lot harder, but it is surprisingly plausible. So that first part of the argument I gave about how 50% of world GDP is paid to human workers, if that went to AI, that would be a big chunk. It is possible that one company could get a monopoly on really advanced AI. We already discussed some of the dynamics there, where again the simplest one is just a combination of an intelligence explosion giving a company a big advantage, and then they're buying up all the computer chips that the world is able to produce and outbidding everyone.

If a company does that and is outbidding other companies on compute, they could end up just one company in control of literally all of the world's cognitive labor, because human cognitive labor will at some point be dwarfed by AI cognitive labor. So at that point, that one company could be getting all GDP which is currently paid to cognitive labor, which is a large part of the economy, maybe as high as 50%, but certainly as high as 30% of world GDP, seemingly going to this one company that controls the world's supply of cognitive labor.

Though I think that would take time, and obviously it's gonna take a long time to automate all the different parts of the economy, there is just a basic dynamic by which one company can now be controlling double digit percentages of world GDP. And there's obviously questions, would a government allow that? Would they step in? And that's where we get into these dynamics of, well, if this company has all these superintelligent AIs on its side, maybe it's able to lobby, maybe it's able to do political capture to avoid the state stepping in. Maybe it's able to be like, "Look, we're providing economic abundance for everyone. If you step in, that might not happen. We're underpinning your nation's economic and geopolitical strength, if you try and nationalize, then that's not gonna happen. We're gonna move to another country."

So you can imagine that maybe they convince the head of state to support them, and there's some kind of alliance there. But it's not completely obvious that the company would be shut down. It would have certain types of serious bargaining power.

So if a company was able to maintain this position as sole provider of cognitive labor, it would be able to get a significant fraction of world GDP. And then it's possible that from there, it could bootstrap, and this is where it gets a bit harder, but the tactic it would need to pursue is it already controls most of the cognitive labor, pretty much all of it. The thing it doesn't control is all the physical machinery and all the raw materials that are also needed to create economic output, but it could pursue a tactic of hoarding its cognitive labor so that no one else can ever have access to that and then selling it at really monopolistic rents to the rest of the world because there's no one that can match it. It's offering everyone by far the best deal they can get, but just skimming off 90% of the value add from companies using its AI systems.

So if it's able to do that, then it can reap by far the majority of the benefits of trade, and then maybe can buy up physical machinery and raw materials from the rest of the world, design its own robots, buy its own land. Imagine a big special economic zone in Texas or something where this company is unconstrained by bureaucracy, and then it's also now got a big arm somewhere in Siberia and in Canada. It's creating these big special economic zones by doing deals with specific governments. I do think it's a bit of a stretch that this all goes ahead without various other powerful political and economic actors pushing back. But the basic economic growth dynamics are surprisingly compatible with a company ultimately coming to control most of the cognitive labor and most of the physical infrastructure that its AIs have designed using all the parts that it's bought from the rest of the economy.


Gus Docker (53:23)

And do you think this is a risk factor for AI-enabled coups then, just because you're concentrating all of the power and all of the resources into either perhaps one country or one company even?

Tom Davidson (53:35)

Yes, I definitely do. The more realistic path is that a company kinda starts down this path of outgrowing the world, gets huge economic power, increasingly controls the country's industrial base, its physical infrastructure manufacturing capabilities. And then from there, it's in a much stronger position to seize political control because it's got massive economic leverage, and then it can also increasingly gain military leverage because as it increasingly controls the country's broad industry and manufacturing, that will feed into military power.

So some of the possibilities I discussed earlier where you could potentially have your AIs be secretly loyal, they'll ultimately design the military systems, or you could just instruct your AI systems to start making a military that is not legally sanctioned, but because the government doesn't have much to threaten you with, you get away with it. It gets a little bit tough. You probably need to do that in secret, otherwise the existing military could prevent it. But yes, I do think that being very rich helps with lobbying, it helps with all kinds of ways of seeking power and then controlling a lot of industry can potentially give you military power.


Gus Docker (55:04)

You mentioned these special economic zones. That's one way in which companies could bargain with states in order to have favorable regulation and to be able to carry out their projects without intervention basically. Another way for them would be to collaborate with non-democracies that are perhaps controlled by a single, small group or perhaps even a single person, and in that way, it seems like perhaps it's easier to get something done in a non-democracy, and that is a way to grow fast, and so perhaps there are incentives for companies to place more resources in non-democracies. What do you think about the prospect of non-democracies outcompeting democracies when it comes to AI?

Tom Davidson (55:59)

I think it's a really great question and it's tricky because I think I agree. Democracies have lots of checks and balances. They have a lot of bureaucracy, a lot of red tape, and that will disincentivize AI companies from investing. And then additionally, if there are people really trying to seek illegitimate power, that will be easier to do in non-democracies because they're less politically robust. So there are these various forces pushing towards this new supercharged economic technology being disproportionately deployed in non-democracies, and I think that is scary.

My own view is that probably democracies should do everything they can to avoid that situation, make it much easier for AI and robotics companies to set up shop in democracies, remove the red tape, try and use export controls like are already happening to prevent technologies being deployed in non-democratic countries, and that goes beyond China. There's obviously lots of countries that are not allied with China but are also non-democratic here. And the US is in a strong position because it does have the stranglehold on AI technology at the moment. So I do think it can be done. But in my view, it will be really important to find a non-restrictive regulatory regime, and it will also be very important to really try and pursue innovations within the democratic process itself.

Democracy is great in many ways. It really distributes power, and it has been very good at ensuring good outcomes for its citizens, but it's very slow and often kind of nonsensical because you have competing interests that are stepping on each other's toes, the resultant legislation is just a garbled mess. And so AI can potentially solve those problems. You can have AIs negotiating and thinking much more quickly on behalf of the human stakeholders. You can have AIs nailing out agreements that aren't a garbled mess, but that really gave everyone what they truly wanted out of the legislation. And you can still do all of that really quickly so that you're not falling far behind the autocracies that have just got one person immediately saying what to do.

And I think if we did that, democracies could outcompete autocracies because the big thing that often screws over autocracies is that one person is flawed, often makes big mistakes, people afraid to kind of stand up to them.


Gus Docker (58:49)

That would be more of my assumption. I would assume here that perhaps democracies with market-based economies have an advantage just because you can do bottom-up knowledge discovery, you can try different things out, you can see what works, you can have competition between companies and so on. And perhaps in non-democracies, well, you can have one person or a small group stake out a direction for what the country should do, but if that direction is wrong, it's probably difficult to change course.

Tom Davidson (59:19)

Yes. I think you're pretty right. I should have given more weight to that advantage of democracies in terms of the free market being in many ways much smarter. In terms of autocracies that are good at harnessing free market dynamics, my worry would be that AI helps them more than it helps democracies because AI will be able to replace...currently, one person just can't think that hard, can't really figure out a good plan. But if that all powerful leader has access to loads of AI systems that can kind of think things through and investigate lots of different angles, then if they're following its advice, then they could get advice which lacks the flaws that today's systems had, and they could potentially move much faster. But I think you're right that economic liberalism is still going to be important even after we get powerful AI systems that could give democracies an advantage.


Gus Docker (1:00:29)

This is a bit of a tangent perhaps, but I'm thinking whether...if you have a leader of a country that has a lot of power, perhaps complete power over that country, and that leader is equipped with AI advisors advising him and laying out the landscape of options for him to choose from. Wouldn't his decision making still be in a sense bottlenecked by the fact that he's a human, by the fact that he has these flaws that we all have, the biases that we all have. So even with fantastic advice, I think it's quite plausible that he would still make the same mistakes that we see leaders make today.

Tom Davidson (1:01:10)

I think that's true. I think it's also true in democracies, unfortunately, that if there's 10 negotiators and they each still have biases and still refuse to listen to the wise advice they're getting from their AIs, that could still gum up the system. And it does depend on how much humans come to trust and defer to their AI advisers.

There's a possible future where the AIs are just always nailing it. They're always explaining their reasoning really clearly, and we are just increasingly convinced and happy to trust their judgment. If AI is aligned, I think that would be a great future because I do think humans have all these very big limitations and biases, which if we can solve the alignment problem, AIs don't need to have. But there's also another future where humans just wanna be the ones making the decisions, have these kind of pathetic motivations that are still influencing their decisions and that continues to limit the quality of decision making.


Gus Docker (1:02:12)

Seeing things from above, right? From kind of like 10,000 feet, how should we think about mitigating the risk of coups here? Is it about removing people that would use AI to commit coups? Is it about finding those people in the militaries, in the governments, in the companies perhaps, or do we have ways to reduce the returns to seizing power?

Tom Davidson (1:02:43)

Real 10,000 feet up, the way I would characterize it is create a common understanding of the risks, build coalitions around preventing them, and then the existing balance of power can self-propagate forward. It's in everyone's interest to prevent a coup. Currently, no one small group has complete control or close to it. And so if everyone can be aware of these risks and aware of the steps towards them and collectively ensuring that no one is going in that direction, then we can all kind of keep each other in check. So I do think, in principle, the problem is solvable. And it doesn't require...solving the risk of misalignment does require solving some tough technical problems. This doesn't in the same way.


Gus Docker (1:03:39)

You have a bunch of recommendations for mitigating the risks both for AI developers and governments. Perhaps we don't have to run through all of them, but you can talk about the most important ones for AI developers.

Tom Davidson (1:03:54)

I might characterize this by going back to those three threat models we discussed earlier. So the first one was singular loyalties or overtly loyal AI systems, where again, the main risk there is AI deployed by the head of state in the military and the government that's loyal to the head of state. And so the main countermeasure that currently appeals to me is for us to figure out rules of the road for these deployments. Obvious things like AI should follow the law, AI deployed by the government shouldn't advance particular people's partisan interests but should only do official state functions. AIs in the military shouldn't be loyal to one person. No, different groups of robots should be controlled by different people. And head of the chain of command can still be head of the chain of command via instructing other people that instruct those robots, but they shouldn't all go directly to head of chain of command because that centralizes military power too much.

So fleshing out basic rules of the road of that kind and then building consensus around them because labs might want to say to governments, "We don't want you to deploy our systems if they're willing to break the law," but the government will have a lot of bargaining power, the executive in the United States can, it's hard for companies to stand up to them. So what we wanna do is establish these rules of the road and then get broad buy-in from Congress, from the judiciary, from other branches of the military, from many parts of the executive. So then it's very hard for, say, the president to say, "Yes, let's make this robot army loyal to me." And everyone's like, "Obviously not. We've all agreed. That makes no sense." And then the president doesn't even bother trying because it's just clear that it would be a no-go that their mind doesn't even go there.


Gus Docker (1:05:49)

In some sense, this is about implementing the procedures and the transparency rules that we know from democracies today into how we use AI both in governments and in companies.

Tom Davidson (1:06:03)

Exactly.

Gus Docker (1:06:04)

Do you worry here that when the government is looking at these companies from the outside and they don't have full insight into what's going on. There are protections for private companies that mean that they can do things in secret without the government knowing, at least as things stand now. Is that something that would evade these mitigations you're thinking of?

Tom Davidson (1:06:34)

For this first bucket, the singular loyalties bucket, it's mostly the heads of state that I'd be worried about. So it actually is probably good for the government or at least for the head of state themselves not to have full insight into literally everything the company is doing because that would give them too much power. But actually having different parts of the government having insight into what the lab's doing, I think is very good. I'm a big fan of transparency. And we do have a good set of government checks and balances from different government bodies that we can deploy to keep the lab in check using these other bodies, but also not allow the executive branch and the president to get excessively powerful.

So that's the mitigations for singular loyalties. In terms of secret loyalties, the key mitigation is what I'm increasingly calling system integrity. That is, using established cybersecurity practices and machine learning security practices, preventing sleeper agents and backdoors in machine learning models, using all of that to ensure that your development process for AIs is secure and robust and that no malicious actor, be they an employee in the post-training team at a lab or be they the CEO of the lab that is either malicious or is being threatened by the Chinese government to tamper with model development, that no person or no small group is able to significantly tamper with the behavior of AI models, and no group is able to get illegitimate access to AIs that would help them seize power. So that's this idea of system integrity, which is essentially a technical project which does draw on existing practices but is not yet implemented in any of the top labs.

I will quickly shout out for people listening that aren't working at labs. I think there's a lot of really good technical research that could be done on investigating the conditions under which you can insert a sleeper agent without a defense team knowing, and there's just loads of research that could be done in terms of the different settings there for attackers and defenders, which could then inform what parameters would need to be in place to achieve system integrity. If it turns out that it's very hard to make a sleeper agent except in the final stage of training, that's really useful to know because then we can focus our efforts within labs at that final stage, just as a hypothetical example. So that's the key mitigation in my mind for the secret loyalties.


Gus Docker (1:09:29)

That one seems more difficult. Just from me reading and preparing for this interview, that one seems like a difficult one to handle where this is in some sense a deep trend in history and in the kind of history of modern economics that you do see faster growth rates and you do see concentration into bigger and bigger economies, both in countries and in companies. So are you in some sense pushing against underlying trends if you're trying to mitigate exclusive access to advanced AI from one actor?

Tom Davidson (1:10:09)

I think you can do this in other ways. You can have the law require that AI labs share their powerful capabilities with other organizations to act as a check and balance. Labs should share their R&D capability AI R&D capabilities with evals organizations.

Gus Docker (1:10:33)

Here, you're thinking about giving insight into what they're capable of, not actually sharing those capabilities. That would be too big of an ask, I think.

Tom Davidson (1:10:42)

I mean, I do mean the API access. If a lot of the work in developing and evaluating systems is now done by AIs, then we want an evaluation organization like Apollo or METR to also be uplifted, and so we want them to have access to really powerful AI that can similarly stress test how dangerous the frontier systems are. If they're only using human workers, then that's gonna be a big disadvantage. So no, I do want API access to powerful capabilities for other actors.

For example, cybersecurity teams in the government and in the military should have access to the lab's best cyber capabilities, and that should be a requirement by law. So generally, even if there's a natural tendency towards centralization of power in one organization, you can still require that that organization share its systems with the checks and balances. That's one thing.

And the other thing is preventing anyone at this organization from misusing the powerful AI systems. The biggest thing on my mind here is that today, we still have helpful-only AI systems where you can get access to the system and then it'll just do whatever you want. No holds barred. I don't think there should be any AI systems like that. I think you should always have at least a classifier on top of the system, which is looking for harmful activities and then shutting down the interaction if something harmful is detected. And if you have a special reason to use cyber offense for your job or you have a special reason to do potentially dangerous biology research, you would have that classifier allow certain types of activity. But you should never have anyone accessing a system where anything is allowed. No one has legitimate reason to access an AI that will literally do anything.

So what I want to aim for is a world where, yes, if there's a specific reason why you need to use a dangerous capability, absolutely, you can use that system, but that system will just do that one dangerous domain. It won't do anything you wanted because that's a very scary situation where there's a hundred reasons why the CEO could ask for access to a helpful-only system. Maybe the guardrails are annoying. Maybe he wants to do something which the model is reluctant to do. But today, when you ask to remove some guardrails, you're removing all of the guardrails, now there's no holds barred. So instead, we should be flexibly adjusting what guardrails are there by the use case and just never have a situation where there's no guardrails. I think that could go a long way towards helping if that was robustly implemented.


Gus Docker (1:13:37)

With all of these mitigations for both secret loyalties and exclusive access and singular loyalties, you would worry that they would be disabled by the group planning a coup, right? Say that, for example, you are the CEO of an AI company and you're giving API access to evaluations organizations, testing your model, trying to see what they're capable of. Maybe you just cut off access before you get to the really powerful model that could actually be the model that helps you conduct a coup. Do we have ways of making sure these mitigations are entrenched in such a way that they can't be removed by the group planning a coup?

Tom Davidson (1:14:26)

This is a great question. It's pretty tricky. CEOs, by default, have a lot of control over their organizations, and similarly, heads of state, including US president, has a lot of control over the military and over the government. So yes, there's a risk that one of these powerful individuals realizes that maybe they want more influence by gaining control over AI and notices that there's these kind of pesky little processes that prevent that, and it's like, "Okay, let's remove them. I can give easy productivity reasons to prevent them, red tape reasons," and if they can make a plausible argument, then it could be hard to oppose them. So I do think it's a big issue.

But I'd say a few things. Firstly, something I mentioned earlier, I don't think that anyone is today planning to do an AI-enabled coup. The way I think this works is that people are faced with their immediate local situation, something they want to do over the next month, and the blockers that they're facing to doing that specific thing. And what tends to happen is people tend to want more influence. That helps them get stuff done. And so people will bit by bit move in the direction of getting more control over AI, but they won't be thinking, "Yes, I need to make sure that I remove this whole process because that will allow me to do an AI-enabled coup." That's unrealistically galaxy brain.

And so what we could do is we just set up a very efficiently implemented and very reasonable set of mitigations that doesn't really prevent CEOs from doing what they're trying to do. And so the CEO doesn't find in their day to day that they're wanting to remove these things that are holding them back. But because these mitigations are here, the CEO never gets to a place where they're anywhere close to being able to do a coup or where there's any kind of pathway in their mind to being able to do a coup because they're constantly prevented from getting access to really powerful AI advice that might point out ways in which they could do this, because they're surrounded by colleagues that strongly believe that these mitigations are sensible and reasonable, and in fact, they are well implemented, and there aren't many downsides. Maybe an environment where they get kudos for the fact that they've said, "Yep, I obviously I'm not gonna get access to helpful-only systems. That's crazy." And then that's something that makes them seem good.

Another thing is, again, going back to this point that there are currently checks and balances, and there is not currently a situation where one person has power. If the entire board of a company and other senior engineers recognize the importance of the mitigations, know about this threat model, then they will notice if the CEO is moving that direction. And similarly within the government, there are checks and balances, and they could be activated if people are looking out for it.


Gus Docker (1:17:41)

Do you think these traditional oversight mechanisms like a board being in control of the CEO being able to fire the CEO or the possibility of a Congress or the Supreme Court overruling or constraining the US president. Do you think those will persist in environments where AI is moving very fast and AI capabilities are growing at a rapid pace?

Tom Davidson (1:18:13)

It's a great question. Here's one story for optimism. Today, things are moving fairly fast, but those checks and balances are somewhat adequate, at least preventing really egregious situations. By the time that AI is moving really quickly, we'll have handed off a lot of the implementation of government, the implementation of things in the AI companies, the research process, have handed it off to AI systems. And when we do that handoff, we could program those AIs to maintain a balance of power. So rather than handing off to AIs that just follow the CEO's commands or AIs that follow president's commands, we can hand off to AIs that follow the law, follow the company rules, report any suspicious activity to various powerful human stakeholders. And then by the time things are going really fast, we've already got this whole layer of AI that is maintaining balance of power.

Like the whole AI government bureaucracy, the whole AI company workforce, they are like better than humans today at standing up to misuse potentially. They are less easily cowed and intimidated, and they could actually make it harder for someone in a position of formal power to get excessive influence. So this is the flip side of the singular loyalties where you potentially deploy these AIs that are explicitly loyal. You can actually instead get singular law-following and balance-of-power-maintaining AIs that you deploy. And so the hope is that by the time we're really seeing speedups from AI, we've already set ourselves up in an amazing way to maintain balance of power, and there's this critical juncture where we are handing off to AIs, and it's just, what are those AIs, what are their loyalties? What are their goals? And I think we can gain a lot by making sure that those AI systems are maintaining balance of power, reporting illegitimate suspicious activities and are not overly loyal to any one person.


Gus Docker (1:20:25)

How do you think the risk of AI-enabled coups interfaces with more traditional notions of AI takeover? So just a misaligned, highly capable or advanced AI system taken over contrary to the wishes of the developers or the governments?

Tom Davidson (1:20:46)

There's some close analogies. The perhaps the most analogous case is the case of secret loyalties where you've got these AIs that have been told by the CEO to have the secret goal of seizing control and then handing control to the CEO. That's just very similar to AIs that wanted to seize power for themselves secretly. And all the same stories could apply where the AIs make military systems, and then they control the military systems and the robot army, and then they seize power. And the only difference is, were they seeking power because it just kind of accidentally emerged from the training process, which is the misalignment worry, or were they seeking power because the CEO programmed them in that way? That's the seed of the power seeking. But then with the secret loyalties threat model, the rest of the story is pretty similar. There's still differences. In the secret loyalties case, the CEO might be doing more to help the AIs along with their plan. Maybe even in the misalignment case, the AIs have managed to manipulate the CEO into doing similar things. So that's the case where it's most analogous.

Another difference that's salient to me is that if there are lots of different AI projects, then an AI-enabled coup seems a lot harder because you'd need lots of different humans to coordinate to seize power together, which seems while I can totally believe that one person might try and seize power, it does seem less likely to me that there'd be loads of humans that would want to do that from lots of different labs. Whereas, from the misalignment story, it is more likely the case that if one of these labs has misaligned AI, then maybe lots of them have misaligned AI. And so then it's more likely that you would have, maybe 10 different AIs colluding and then seizing power and taking over. And so that kind of collusion between multiple different AIs is more likely in the case of misalignment than in the case of an AI-enabled coup.


Gus Docker (1:23:05)

Just because if there's one misaligned AI, then there's something about the training process for AI systems that are causing misalignment, and then it will be a common feature among many companies.

Tom Davidson (1:23:17)

Exactly. Whereas just the fact that one CEO instructed a secret loyalty would not, to the same extent, make you expect that other CEOs have done the same.

Gus Docker (1:23:28)

So you mentioned this possibility, but what do you think of the prospect of a president or a CEO of a company being duped by a misaligned AI into conducting a coup on its behalf? So you can imagine a president or a CEO thinking that he's conducting a coup to remain in control, but he's actually acting on behalf of a misaligned AI.

Tom Davidson (1:23:54)

I think it's an interesting threat model and some people who think about AI takeover threat models take it pretty seriously, and it's just a case where we're completely mixing these two threat models together. People who are worried about AI takeover for this reason should be very supportive of the anti-coup mitigations I'm suggesting, because if we implement checks and balances that prevent any one person from getting loads of power, then that AI will not be able to convince them to try because they just won't be able to succeed. So I see this as an additional reason to worry about AI-enabled human coups and to try and prevent them is that, yes, even if no human wants to do this normally, misaligned AI might make them try.

In terms of how plausible I find the threat model, honestly, I think that if a human tries to seize power, the main reason is that that human wanted power. This is just something we know about people. We know it about heads of state today. It's very clear that many heads of state in the most powerful countries in the world are very power-seeking. We know it about CEOs of big tech companies. We know about some of those leading AI companies that we do know that they're very power-seeking, those CEOs. And so I don't think we need to theorize that they were massively manipulated by the AI and convinced to become power-seeking. I think it's more likely that if they seek power, they just did it for the normal human reason.

I do think AI will get ultimately get good at persuasion. I don't particularly expect it to be hypnotic level persuasion, though obviously there's massive uncertainty here. And yeah, I do think that a very smart AI, where there's a human that's already interested in seizing power and it already makes sense for them to maybe do it, the misaligned AI could totally nudge them in that direction and then could implement that in a way that actually allows the AI to seize power later, I think that is very plausible.


Gus Docker (1:26:15)

When we're thinking about distributing power and having this balance of power, we can imagine the models being set up via post-training, via the model spec, via various mechanisms to obey the user unless what the user instructed to do is in conflict with what the company is interested in, and perhaps obey the company unless what the company is using the model for is contrary to what the government permits. But when we set it up in those levels, you ultimately end up with the government in control in some sense, and I guess that exposes you to risk of a government coup then. If you have at the ultimate top layer of the stack, here's what the models can and cannot do according to the government?

Tom Davidson (1:27:13)

I'd say a couple of things. First is that the government isn't a monolithic entity, and so that government decision of what the balance should be could be informed by multiple different stakeholder groups, and then ideally, it's ultimately democratically accountable. I do think that democratic accountability becomes more complicated in a world where there's massive change in a four year period.

Gus Docker (1:27:36)

Just for the simple reason that there's no election during a period where a massive change is happening? So the feedback loop is too slow?

Tom Davidson (1:27:45)

Exactly. I think the risks of AI-enabled coups will probably emerge and then be decided within a four year period, as in it will be resolved whether or not it happens or doesn't, all without any intermediate election feedback. And that doesn't mean that democracy can't have an effect because politicians anticipate what future elections will find and want to maintain favor throughout their terms, but it does pose a challenge. But I was saying even absent that, there's many different stakeholders in the government, and so it would have to be a large group of government employees that were trying to do a coup, and then the companies would know that they were setting these odd restrictions on the behavior, and so the companies would know, and they have leverage and power, and then it could go public. So I don't think it would be that easy for the government to do a coup.

Gus Docker (1:28:46)

Perhaps there's a difference also between allowing the government to set restrictions on what the models can do and then allowing the government some kind of access to commanding future AI systems in certain directions. So it's kind of setting limits versus steering the systems.

Tom Davidson (1:29:03)

Yeah, exactly. The distinction I was gonna highlight was between specifically making AI systems loyal to, for example, the head of state, and setting very broad limits where there's just like you can pretty much do whatever you want except for these obviously bad things. Where that second option doesn't really enable anyone to do a coup. It just enables everyone to do whatever they want, and then you've blocked out all of the coup-enabling possibilities through those limits, as long as you haven't made those systems loyal to a small group. So given that there's this obvious option to just put in these limits that block coups but don't enable coups, and given that there's a wide range of stakeholders that could potentially feed into what the AI's limitations and instructions are, I think it's very feasible to get to a world where there's robustly not centralization of power. There's obviously a big uncertainty over whether we will actually get our act together and get those limits put in place in the right way.


Gus Docker (1:30:09)

When do you think the threat of AI-enabled coups materializes? Is it at some specific point in AI capabilities or does it simply scale with the systems getting more advanced? When do you think the threat is at its peak?

Tom Davidson (1:30:29)

It's a good question. For the threat models that I've primarily focused on, they require pretty intense capabilities. For example, the secret loyalties threat model more or less requires AIs to do the majority of AI research. So we're talking about fully replacing the world's smartest people in a very wide range of research tasks and coding. That's pretty intense. And then a lot of the threat models that I focus on route through military automation, that is AI and robots that can match human boots on the ground. And that's pretty advanced.

That said, I think you can probably do it with less advanced capabilities than that. Drones today are already pretty good, already providing, making a big difference in some military situations. So it's not out of the question that more limited forms of AI and robot military technology could be enough to facilitate a coup. It's a bit harder because if they're limited, then there's a question of why the existing military doesn't just seize back control after a bit of time. So probably that scenario also has to involve things like maybe the current president supporting the coup and therefore pressuring the military not to intervene or some other source of legitimacy for the coup beyond the AI-controlled drones.

And then there's also more typical types of backsliding, like has already been happening in the US that I think could be exacerbated through AI-enabled surveillance and AI increasing state capacity in other ways. And that backsliding doesn't require super powerful AI. You could probably do a lot of monitoring, a lot of content moderation on the Internet, a lot of surveillance with today's systems. It doesn't get you all the way to one person having complete control where they can just quash any resistance with a robot army and replace everyone in their job with an AI, and so no one has any leverage. So I think to get to that real intense, the most intense form of concentration of power via AI, that requires really powerful AI. But to just significantly exacerbate existing trends in political backsliding and to make it easier to do military coup, I think more limited systems would suffice.


Gus Docker (1:33:20)

We discussed earlier the possibility of one country or one company outgrowing the rest of the world and concentrating power into those entities. Now you mentioned one person. Do you think that's actually a plausible scenario in which you have, say, one CEO of one company being the person in control of the world via a concentration of power and then a coup.

Tom Davidson (1:33:47)

A hundred percent. Yeah. The story I told earlier about secret loyalties, meaning that now we've backdoored a wide range of military systems, meaning that you can seize power, that's one route. And then there's this other, the company masses amounts of economic power having a monopoly on AI cognitive labor, and then you're leveraging that to get more economic power, more political influence. Yeah, I do think it's possible. Again, there's this big shift once AI can fully replace humans, where today, no one person can never have absolute power. They have to rely on others to implement their will.

Gus Docker (1:34:30)

And this is what makes currently existing dictatorships unstable, where there's always a threat of internal revolt or outside factors threatening the dictatorship. But this could potentially change.

Tom Davidson (1:34:45)

Yeah. There's always a threat of revolt, and then to guard against that threat, the dictator needs to share their power to some extent, has to compromise. But yeah, you could get it all concentrated in one person with sufficiently powerful AI.


Gus Docker (1:35:00)

Do you think we move through a period of increased threat of AI-enabled coups and then reach some kind of stable state, or do you imagine that there's a constant risk of AI-enabled coups in the future?

Tom Davidson (1:35:15)

I think we move through it. Yeah. It's this point about once we have deployed AI across the whole economy, the government, the military, if those AIs are maintaining balance of power, then we could fully eliminate the risk of an AI-enabled coup. It'd just be as if our whole population was so committed to democracy, would never seek power, never help anyone else who wanted to undermine any democratic institution. We already have strong norms favoring democracy, but they're far from perfect, and they have been eroded over recent decades. But you could get rock solid norms. They're programmed in. They cannot be removed except by the will of the people.

There's a bit of a question because you still want to give the human population the ability to change the AI's behavior and its rules so the human population could always choose to move to an autocracy. So I suppose I shouldn't say that we could fully eliminate the risk because democracy could vote to stop being a democracy. But I do think we could get to a point where it absolutely cannot happen without most people wanting it to happen.

Gus Docker (1:36:38)

And did you say you would get to a point in which future AI-enhanced societies are more stable than current democracies and they're less at risk of coups or democratic backsliding than current democracies?

Tom Davidson (1:36:56)

Much more. Yeah, you could get much more robustness there. There's this constant dynamic in today's societies where people care about democracy, but they also care about a host of other things, their own achievements, various other ideological commitments. And so depending on how dynamics play out, depending on how technology evolves and what people's incentives are, sometimes people push against democracy. That's what the Republican Party's been doing in some ways. That's what Democratic Party has done as it's kind of increasingly put pretty ideological people in powerful institutions. So with AI, you can get much more control over those dynamics because you can just make it much more the case that democracy is not being compromised.


Gus Docker (1:37:46)

Are there any ways for us to or are there any risk factors we can look at if we are interested in predicting coups? Do you think there's something we can measure or something we can track to see whether we are at risk of an AI-enabled coup?

Tom Davidson (1:38:03)

It's a great question. I don't think I have an amazing answer, but some things that come to mind:

The capabilities gap between top AI labs and then the gap again with open source. The degree to which AI companies are sharing their capabilities with the public, and if not with the public, then with multiple other trusted institutions, like sharing their strategy capabilities with US political parties and parts of government.

The extent of economic concentration, how much, what are the revenues and net worth of particular companies, particular AI companies. Another one, what is the extent of government automation and military automation by AI systems? And when that automation is happening, how robust are the guardrails against breaking the law and guardrails against other forms of illegitimate power seeking?

How much transparency does the public or the judiciary or the Congress have into how dangerous AI capabilities are being used by AI companies and by the executive branch? Take the example of military R&D capabilities. That is, really smart AIs that can design super powerful weapons. It's scary if companies can just use those military R&D capabilities without anyone knowing. It's also scary if a small group of people from the executive branch can use those capabilities without anyone else knowing how they're using them, because they could be designing powerful weapons and making them loyal to a small group. So transparency into these high stakes capabilities and how they're being used by a broad group. It doesn't have to be public. Probably shouldn't be public, but we have checks and balances already. So another kind of question is, as these high stakes use cases start occurring or they become possible, do we know that there's transparency requirements in place? As we increasingly see AI companies contracting with Palantir and other military contractors, we can begin to see that they're making increasingly powerful weapons. Is there a process of oversight? Do we know that if someone was trying to make AI military systems loyal to them, that it would be spotted? That's another indicator.

We can look at all the standard democratic resilience indicators that the social scientists have come up with. There's various things about free and fair elections, about civil society, about freedom of press that have been getting worse recently in the US, but there's various indicators here. You can look at the degree of government censorship of freedom of speech or what's on the Internet and the degree of surveillance that the government's doing.


Gus Docker (1:41:35)

But if you take all of these things into account, how do you think about the risk of an AI-enabled coup in the next, you know, 30 years, say?

Tom Davidson (1:41:47)

Next 30 years, I think it's high. I think the risk is high. I would guess it's 10% or something. And that, to be clear, if it was just existing political trends ignoring AI, I'd be maybe a few percentage, maybe around 2% or something. There is definitely a risk of that. I'm thinking about the US here.

A big part of my current worries are not about the indicators, but it's about my expectation that AI capabilities will keep increasing quickly and even more quickly, and then the absolute lack of interest in regulating AI companies right now in the US and the difficulty that we will have of constraining the executive under the current situation where the president is using sophisticated legal strategies to increase their own power and is succeeding on many fronts. The US is not doing a great job at constraining the executive. So companies are unconstrained, the executive is poorly constrained, those are the key threat actors here. So with fast AI capabilities progress plus that lack of constraint, lack of transparency, the default is that a lot of those indicators I said get worse, and none of the indicators get better like transparency, and so that makes me think this is very plausible.

Gus Docker (1:43:25)

I mentioned 30 years, but what about 5 years?

Tom Davidson (1:43:29)

5 years, that's tough, isn't it? It's really tough. I think there's a risk. I wouldn't think there was a risk if it wasn't for the AI research causing intelligence explosion angle, but AIs are a lot better at coding and cognitive research related tasks than they are at, for example, controlling robots and stuff. And so even if the final threat ultimately comes through robots or comes through crazy levels of persuasion, you really can't rule out a scenario where AI research is automated in 3 years' time, then in 4 years' time, we've got super intelligent AI controlled by a few people, maybe it's got secret loyalties. Maybe it's being deployed in the government and being overtly loyal to the president. And then a year later, it's backsliding or it's political capture or it's robot soldiers.


Gus Docker (1:44:31)

How do you think about the badness of the outcomes here? How much does the badness depend on the ideologies of the people who are conducting the coup? What should we look out for? Because I mean, I guess we can rank coups by badness, which is not an exercise I think we should actually attempt, but we can talk about the factors involved about what would be the worst kind of coup and what would be slightly better, slightly less bad kind of coup?

Tom Davidson (1:45:06)

Let's imagine it's one person that seizes power. Actually, no, that's the first distinction to draw. If there's a group, then even 10 people is better than one person.

Gus Docker (1:45:19)

And why is that?

Tom Davidson (1:45:20)

Yeah. So 10 people, you get a diversity of perspectives, so more moral views represented, and there's more room for compromise between those perspectives. There's more room for reasonable positions to win out as there's some deliberation as actions are decided upon. There's slightly less intense selection for psychopaths than if it was just one person. So yeah, if it's just one person, that's bad. That's particularly bad. 10 people are still very bad. A hundred people are still pretty bad, but there's big differences there, big differences.

If we're now just thinking about one person or the average person in a group, then we could think about how competent they are, and then we could say something about how virtuous their motivations are. I do think competency is important. I think it's probably underrated in most political discussions, how important it is to just be really competent. Thinking about something like responding to COVID or thinking about something like trying to de-escalate a conflict, Russia-Ukraine or trying to de-escalate Israel conflict. Actually, just being very competent and very good at getting things done is important. And as we mentioned, if you're just willing to rely on AIs and you align those AIs in the right way, anyone could be really competent, but that's not guaranteed. People may really want to cling to their current views without changing their mind.

Let's take the example of Donald Trump. If a really smart AI system told him, "Look, tariffs are definitely bad for the US economy. They're definitely bad and won't give you what you want," would he change his mind? I would guess no. Lots of smart people have already been saying that, and him and his supporters...I don't actually know the economic details here, but my understanding is that most people think that they're pretty bad. And it'll still be the case that Trump will be able to find people telling him that what he thinks is good, and he'll be able to program his AIs to keep telling him that if he wants to. So there's no guarantee that he will become super competent or that whoever seizes power becomes super competent.


Gus Docker (1:47:49)

So there's this kind of loyalty that actually undermines competence just because you're loyal to such an extent that you're not providing feedback that's useful because negative feedback feels bad to receive. And so there's this...maybe this is a bit contrived, but do you think there's a sense in which in the singular loyalty scenarios, the AIs could be so loyal that they're undermining the competence of the person that they're singularly loyal to?

Tom Davidson (1:48:26)

Yeah. It's a really great question. I haven't thought about this, but yeah, in a way the most extreme version of singular loyalties will just agree with whatever the most recent thing that the dictator has said. It's a version of sycophancy, which we already see, without questioning. And they'll do that even when it's not in that person's interests, because that's the kind of type of loyalty that's demanded. Where there's a more sophisticated type of loyalty where you're still completely loyal, but you're also willing to challenge them when you think it's in their best interests. So that's a really nice distinction.

And yeah, I suppose one way of thinking about competence is thinking about what kinds of loyalties the dictator would demand from their AI systems. Another way of thinking about it is how much they would listen to the AI advisor. Even if the AI has the sophisticated type of loyalty and is trying to tell the dictator what to do, the dictator could just ignore them. And you see that again. AI is fairly sycophantic. They will also challenge you sometimes, and then it's up to you whether you listen. So that's the competence bucket, which I think is really important, and I do think there are differences between potential coup instigators on that front, which could be significant.

My expectation would be that lab CEO coups would be more competent than heads of state. But even within lab CEOs, there are some that are more dogmatic than others, and I think that dogma would get in the way of competence.

That's competence. The other thing I mentioned was broadly what are your goals, what are your values, or more character. And here, one thing I think is really important is being open minded, being willing to bring in lots of different, diverse perspectives into the discussion and empower them to really represent themselves and grow and flourish. So I think a very bad thing would be a particular person becomes a dictator, they implement their vision for society, and much better would be empower all the different ideologies and ideas to become the best versions of themselves, and then we can collectively grow and improve our understanding of how to run society. So sometimes people focus when they're thinking about values on "are you this type of utilitarian," or "oh no, I hope you're not a deontologist," and it can get very specific and finger pointing. My view is more that we don't really know what the right answer is and the most important thing is being pluralistic and letting a thousand flowers bloom.


Gus Docker (1:51:30)

We discussed the possibility of getting to a stable state in which we've avoided an AI-enabled coup and now we have, say we have aligned superintelligence where the risk of coup is very low. Do you think this is something that happens for one country and then that one country is in control of the world to such an extent that this is not a process that other countries are undergoing? To be more concrete here, for example, if the US goes through a risk of coups, of AI-enabled coups, but manages to remain a stable democracy, is it the case that Russia or China will go through a similar period of risk of coups?

Tom Davidson (1:52:19)

It's a great question, and it will depend on US's posture towards the rest of the world geopolitically, and it will also depend whether the US has gained a huge military and economic advantage, like outgrowing the world or just developing powerful military technology, as we were discussing previously.

But you can imagine one scenario where the US isn't that much more powerful than the rest of the world yet and isn't that inclined to intervene, which has been kind of the recent trend. And then China developed some really powerful AI a few years later, and Xi Jinping uses it to cement his control over China, so then you now have one AI-enabled dictatorship that is extremely robust, and then you have the US which has avoided that risk, and now they're kind of maybe competing against each other, Cold War 2, trying to outgrow the world, or maybe they're striking deals because they recognize it's not good to compete, and China just indefinitely remains a dictatorship, and that's just a permanent loss for the world.

But you could also raise a different scenario where the US is very far ahead, and maybe it just wants to really secure its position geopolitically, and so it instigates AI-enabled coups in other nations where it's really putting US representatives up on top of those nations. That could be through secret loyalties. It could sell AI systems, let's say, to India that are secretly loyal to US interests, or it could give some particular politicians in India exclusive access to superintelligent AI to help them gain power. So you could apply those same threat models we've discussed, but with the US pulling the strings. Or you could have the US just taking control of other nations in more traditional ways, military conquest and really leaning heavily on extracting economic value out of other countries as they outgrow the world. So yeah, wide range of options here, really.


Gus Docker (1:55:04)

As a final topic here, perhaps we can talk about what listeners can do if they wanna help try to prevent AI-enabled coups, and specifically where to position themselves. Should they be in AI companies? Should they be in governments? Should they be in perhaps eval organizations? Where's the position of most leverage?

Tom Davidson (1:55:30)

Great question. I think being at a lab is a great place to be. I've talked about system integrity, robustly ensuring that AIs don't have secret loyalties and behaviors inserted. That's something that companies need to implement, so if you have interest or expertise in sleeper agents or backdoors to AI models or cybersecurity, then I think being part of a lab and helping them achieve system integrity is an amazing way to reduce this risk.

Another thing you can do at labs, if you're worried about the risk of heads of states deploying loyal AIs and seizing power, is you can help labs develop terms of service where when they sell their AI systems to governments, they have certain mitigations against misuse. One way to frame this is, "Look, we're using really powerful AIs, and we can't guarantee the safety of those AI systems unless we have some degree of monitoring to ensure that the AI systems aren't doing anything unintended." That monitoring could then be sufficient to allow for the prevention of coups because you're being monitoring not only for accidental misaligned AI behavior, but that will also thereby mean you're monitoring for a bad human actor giving them illegal instructions.

Labs will be drawing up contracts with governments, terms of service. They will be thinking about the guardrails, if any, that go on the systems that they sell to governments. But I think there's very careful work to be done thinking through, okay, how can we structure those guardrails? How can we explain them in a way which is very unarguable and it doesn't seem like we're trying to constrain the government? Private companies don't...it's not really legitimate for them to constrain the government, but I do think there's an important thing to be done here in preventing AI-enabled coups, so threading that needle.

It's another thing you could do in a government, but you could also do that kind of work for a think tank or for a research organization that's interlinked with government like RAND. I think could potentially do some of this kind of work thinking about what should be in the terms of services between labs and governments.

Another big thing is that for this system integrity, yes, we want labs to implement it, but we also want there to be some external organization that can certify that. Currently, no external organization is working on this. METR's not working on it. Apollo's not working on it. I don't think any evaluations organization is, but there will be at one point in time a very strong demand for guarantees that the systems that these companies have developed don't have sleeper agents. Because if you're deploying a system in critical infrastructure, in important government roles, in the military, you want to know that it does what it says on the tin. And so I think that a company can make a wise investment by developing that capability, the ability to understand how to insert secret loyalties, how to detect them, being able to assess a company's process to see whether it's sufficient.


Gus Docker (1:59:11)

How close would that work be to interpretability work? Isn't there overlap there looking into what is the system actually doing? Reverse engineering some process to see how things work. Is detecting sleeper agents like a specific constrained problem?

Tom Davidson (1:59:36)

Yeah. It's a great question. It's a specific type of problem, so you could use internals work like interpretability to detect sleeper agents. That's not the only method you could use. You could use behavioral tests. You could look at the training data. I think a mixture of those is useful.

It's also a specific adversarial setup. In interpretability, we're just often interested in all kinds of facts about the model's behavior and its cognition. With secret loyalties, we're specifically concerned with one red team trying to produce a secret loyalty and then a blue team trying to detect it. And so it's adversarial in that the blue team has a tough task. The red team will be purposely trying to hide the sleeper agent. And so that will raise a whole host of new considerations and important things to dig into.

So you can imagine allowing the red team full control over all aspects of the development process. Can they create a secret loyalty? Maybe they can. Maybe it's pretty hard to detect. Okay. But now you only allow them to intervene at one part of the development process, or you allow the blue team to look at all the data that they use for training, or you allow the blue team to look at all of the code that they wrote, and you could get potentially quite a sophisticated understanding of the conditions under which the blue team wins, conditions under which the red team wins. This research doesn't need to be done in a lab. It could be done by an external organization. I think it's a big missing focus of today's technical work, and ultimately, that would inform the assessments of the lab's attempts to do system integrity.

So for technical researchers out there, I'd really highlight that possibility.

Another piece of work for the right person would be beginning to understand the existing military thinking around autonomous systems, that this is already obviously a live issue for militaries. They are increasingly deploying AI. It would be nice to marry up that existing expertise with these risks about more powerful systems enabling coups and get to a consensus within that military community of basic principles, like law following, like distributed control over military systems, and figure out a military procurement process which is both practical but also robustly prevents this kind of stuff. So if there's anyone listening that has a way in, I think that's potentially pretty valuable. Although there's also risk of poisoning the well if it's done badly, so proceed with some care.


Gus Docker (2:02:37)

Perfect. Thanks for chatting with me, Tom. It's been great.

Tom Davidson (2:02:41)

Yeah. Real pleasure. Thanks so much, Gus.


Nathan Labenz (2:02:45)

If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network.

The Cognitive Revolution is part of the Turpentine Network, a network of podcasts where experts talk technology, business, economics, geopolitics, culture, and more, which is now a part of A16Z. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing.

And finally, I encourage you to take a moment to check out our new and improved show notes, which were created automatically by Notion's AI Meeting Notes. AI Meeting Notes captures every detail and breaks down complex concepts so no idea gets lost. And because AI meeting notes lives right in Notion, everything you capture, whether that's meetings, podcasts, interviews, or conversations, lives exactly where you plan, build, and get things done. No switching, no slowdown. Check out Notion's AI meeting notes if you want perfect notes that write themselves. And head to the link in our show notes to try Notion's AI meeting notes free for 30 days.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.