Join us in a deep dive with Allan Dafoe, Director of Frontier Safety and Governance at Google DeepMind.

Watch Episode Here

Read Episode Description

Join us in a deep dive with Allan Dafoe, Director of Frontier Safety and Governance at Google DeepMind. Allan sheds light on the challenges of evaluating AI capabilities, structural risks, and the future of AI governance. Discover how AI technologies can transform sectors like education, healthcare, and sustainability, alongside the potential risks and necessary safety measures. This episode provides a comprehensive look at the intersection of technology, safety, and governance in the rapidly evolving AI landscape.

SPONSORS:
SafeBase: SafeBase is the leading trust-centered platform for enterprise security. Streamline workflows, automate questionnaire responses, and integrate with tools like Slack and Salesforce to eliminate friction in the review process. With rich analytics and customizable settings, SafeBase scales to complex use cases while showcasing security's impact on deal acceleration. Trusted by companies like OpenAI, SafeBase ensures value in just 16 days post-launch. Learn more at https://safebase.io/podcast

Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive

Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive

RECOMMENDED PODCAST:
Second Opinion. Join Christina Farr, Ash Zenooz and Luba Greenwood as they bring influential entrepreneurs, experts and investors into the ring for candid conversations at the frontlines of healthcare and digital health every week.
Spotify: https://open.spotify.com/show/...
Apple: https://podcasts.apple.com/us/...
YouTube: https://www.youtube.com/@Secon...

PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(06:40) Introduction to Technological Determinism
(07:29) Interview with Alan De Foe
(08:16) Roles and Responsibilities at Google DeepMind
(11:12) Transition to Google DeepMind
(16:11) Technological Determinism and Historical Impact (Part 1)
(20:54) Sponsors: SafeBase | Oracle Cloud Infrastructure (OCI)
(23:31) Technological Determinism and Historical Impact (Part 2)
(27:28) Constructivist Critique and Technological Politics (Part 1)
(40:48) Sponsors: Shopify | NetSuite
(43:36) Constructivist Critique and Technological Politics (Part 2)
(51:22) Japan's Isolation and the Arrival of Commodore Perry
(52:51) The Meiji Restoration and Japan's Modernization
(55:25) Technological Determinism and Differential Development
(59:27) Challenges of Predicting Technological Paths
(01:02:40) The Importance of AI Safety and Alignment
(01:05:59) Cooperative AI: Beyond Alignment
(01:09:56) The Role of AI in Global Coordination
(01:15:48) Bargaining and Cooperation in AI Systems
(01:30:29) The Future of Cooperative AI Research
(01:32:27) Understanding Agent Communication and Cooperation
(01:33:10) Commitment Problems in AI Cooperation
(01:33:52) Challenges and Solutions in AI Commitment
(01:34:39) Rationalist Approach to Cooperation
(01:36:39) Potential Downsides of Cooperative AI
(01:37:41) Exclusion and Antisocial Cooperation
(01:44:43) General Intelligence and AI Development
(02:00:45) Evaluating AI Models for Dangerous Capabilities
(02:14:12) Human Interaction in Model Evaluations
(02:14:50) Evaluations in Real-World Settings
(02:15:27) Cybersecurity and Model Capabilities
(02:17:12) Challenges in Model Evaluations
(02:18:14) Staged Deployment and Safety Measures
(02:19:33) Debate on Open Weight Models
(02:21:37) Forecasting AI Capabilities
(02:27:13) Frontier Safety Framework
(02:30:59) Structural Risks and Governance
(02:35:47) Democratic Legitimacy in AI Development
(02:44:26) Proliferation and Governance of AI Models
(02:50:16) Opportunities in AI Applications
(03:01:07) Outro

Full Transcript

Nathan Labenz: (0:00) Hello, and welcome back to the Cognitive Revolution. Today, I'm excited to share a special crossover episode from the 80000 hours podcast, featuring a conversation between host Rob Wiblin and Alan Dafoe, director of Frontier Safety and Governance at Google DeepMind. I first heard Alan speak back in 2017 when I introduced him at a conference in Boston as a professor at Yale who was then working on great power peace. This was before he founded the Center for Governance of AI, which in turn was years before he moved to DeepMind. So I can say with confidence that Alan has been thinking about AI governance harder and planning for the current AI moment longer than just about anyone else. And as you'll hear, that pays off in the form of truly excellent analysis on an impressive range of critical topics. To begin, Alan describes his academic work on the question of just how much ability humans really have to alter the course of technology development. Noting that macro historical trends like Moore's Law suggest a process that transcends individual human choices, he ultimately argues that while technological possibilities don't force us to do anything on their own, in combination with the realities of military economic competition, they can and often do. Simply put, failure to adopt potentially advantageous technologies often means losing to those who do. This is not a conclusion that Allen comes to lightly, and unfortunately for us today, I think it's a pretty hard 1 to escape. It's still possible that a spectacular incident could cause a vibe shift big enough to force a pause of frontier scaling, but the smart money now seems to be on powerful AI soon. And with Pentagon officials quoted in the press expressing their enthusiasm for autonomous killer robots despite the general reliability, reward hacking, and even scheming issues that have recently come to light, militarization of some form seems a foregone conclusion as well. And yet, even if the long term logic is inescapable, I think it would be a huge mistake for frontier developers to underestimate their own individual and collective short term agency. A few years ago, my uncle told me a story about when he arrived in Italy during the height of the Cold War to join a crew that was responsible for firing nuclear weapons at tertiary targets in the event of an all out war. The first time they drilled the launch sequence, 1 of the longer tenured guys took him aside and said, just so you know, if the order ever comes down to shoot for real, we are all going AWOL. None of us want to be part of destroying the world with nukes. Now that's just 1 story from 1 enlisted crew, and I have no idea if that sentiment was widespread enough to have made a real difference in the worst case scenario. But today, the reality is that a very small number of people are pushing the AI capabilities frontier forward. There are only so many elite ML savants and compute constraints, meaning we can't scale all their ideas at once anyway. Meanwhile, it's also now well established that intelligence itself has a jagged edge. Unlike nuclear technology, which had a small number of discrete powerful use cases and a very mechanical associated game theory, the mind space from which AI developers are selecting new forms is manifestly vast, and the models themselves are incredibly malleable. If you believe things could move super quickly as AIs begin to hit important capability thresholds, the specific details of what we build and prioritize just before that point could prove decisive. All this puts the few 100 or maybe as many as a few thousand people who are closest to the major compute budget decisions in a position of great power and responsibility. As we saw in the context of Sam Altman's firing and subsequent reinstatement, a serious threat by technical staff to walk can force leadership's hand. And further, as past guest Daniel Cocatello demonstrated by refusing to sign a nondisperagement clause, even a single individual can create meaningful change if they're willing to stand up for what they believe in. So I would encourage everyone at all of the Frontier AI companies to make time to raise your own level of situational awareness, even if that comes at the cost of moving your specific project forward a bit more slowly, to make sure that the overall enterprise you're engaged in continues to be 1 that you feel good about supporting. To date, you truly have so much to be proud of. Top tier language models, AI doctors, self driving cars, a revolution in biology, robots now folding origami. DeepMind could never ship another product and would already go down as a historically important company. And there's a lot to appreciate in this conversation on the alignment, safety, and policy fronts too. Allen's Cooperative AI research agenda is both fresh and sophisticated. Google's Frontier Safety Framework has truly been, as Allen describes it, part of a serious and important effort by leading companies to advance the AI policy conversation. And lately, I have been thrilled to hear Demis buck the trend by continuing to speak about the possible need for international collaboration on advanced AI development. At the same time, AI Manhattan Project type ideas are rapidly proliferating, and it's not hard to imagine such a project going so catastrophically wrong as to more than offset even the tremendous amount of good that DeepMind and other AI leaders have already done to date. So, again, for the people at Frontier companies, keep in mind that history is not happening to you, nor are you merely living through it. You are, as part of a relatively small group, driving or at the very least shaping it in important ways. We are now in a glorious AI summer, but an AI cold war is looming. Your critical decisions won't be binary like my uncle's squads was, and there are clearly many defensive AI systems that we genuinely need to build. But all the more so because we're accelerating into a super high dimensional uncharted space, if you haven't already, I think it is now time to start thinking and even talking to colleagues about which directions might convince you personally or perhaps 1 day as a group to go AWOL from the project. As always, if you're finding value in the show, we'd appreciate it if you'd share it with friends. I always welcome your feedback and suggestions. You can reach out via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. And, of course, I recommend you subscribe to the 80000 hours podcast, which is always excellent and continues to ramp up their AI coverage. Now I hope you enjoy this very enlightening and deeply thought provoking conversation between Rob Wiblin and Alan Dafoe of Google DeepMind from the 80000 hours podcast.

Allan Dafoe: (6:41) 1 famous quote in the history of technology that was arguing against determinism was that technology doesn't force us to do anything. It merely opens the door. It makes possible new ways of living, new forms of of life. And my retort was technology doesn't force us. It merely opens the door, and it's military economic competition that forces us through. So when a new technology comes on the stage, many groups can choose to ignore it or do whatever they will with it. But if 1 group chooses to employ it in this functional way that gives them some advantage, eventually, the pressure from that group will come to all the rest and either force them to adopt or, you know, lead the other group to losing their resources to the the new more fit, group.

Rob Wiblin: (7:30) Today, have the pleasure of speaking again with Alan DeFoe, who is currently the director of frontier safety and governance at Google DeepMind or or GDM for short. For that, he was the founding director of the Center for the Governance of AI. He was also a founder of the Cooperative AI Foundation and is a visiting scholar at the Oxford Martin School's AI Governance Initiative. And I guess before all of that, you were an academic in the social sciences studying technological determinism, I guess, a great great power conflict, great you know, the the big big peace theory, that that kind of thing, which I guess we're gonna get a little bit of all of these different pieces of your of your work today. Yeah. Thanks so much for coming back on the show, Alan.

Allan Dafoe: (8:07) Thanks, Rob. A pleasure to be here.

Rob Wiblin: (8:09) So later on, we're gonna talk about the frontier model evals as well as why you think cooperative AI might be about as important as aligned AI. But first off, I guess, so you're director of frontier safety and governance. What does that actually involve in in practice? I can see that being a whole lot of different things, and I don't have a sense of what your kind of day to day is like.

Allan Dafoe: (8:29) So my team is called the Frontier Safety and Governance team, we and have 3 main pillars. Frontier safety, Frontier governance per the name, and then Frontier planning. This adjective frontier is a new term, I would say, almost of art to refer to these general purpose large models like Gemini and others. Frontier safety looks at dangerous capability evaluations. It look it tries to understand what powerful capabilities may be emerging from these large general purpose models, forecast when those capabilities are arriving, and then think about risk mitigation and risk management. So this also led to the Frontier Safety Framework, which is Google's approach to risk management for extreme risks in frontier models. That's frontier safety. Frontier governance is advising norms, policies, regulations, institutions, especially with an eye towards safety considerations. And then frontier planning looks to the horizon, tries to imagine what new considerations could be coming with powerful AI and on the path to AGI, and then advising Google D Mind, Google, and really all of society given those insights.

Rob Wiblin: (9:35) So sounds like a pretty big remit. How large is the kind of team that's working on all these questions?

Allan Dafoe: (9:40) So the team is quite small. Though we're actually hiring for several positions right now at the time of the podcast going live, that may be wrapped up. But what's really great about working at Google D Mind is we have a lot of partner teams, and there's a very collaborative culture. So we work with technical safety called the AI safety, Gemini safety, and alignment teams. We work with responsibility teams, the policy teams, and so forth.

Rob Wiblin: (10:08) So I guess Google DeepMind has, over the last year or 2, become, like, more integrated into the rest of Google. Right? So I imagine, are there other groups within this broader entity, I guess, Alphabet, that take an interest in these questions? Or are you may maybe the only group that's thinking about these, like, frontly I guess, you're thinking about the most important models and upcoming upcoming issues and and threats. Are there many other groups that take an interest having that kind of foresight and thinking years ahead?

Allan Dafoe: (10:32) So I would say Google DeepMind is the part of Google that's most specialized at thinking about frontier models. Google DeepMind is responsible for building Gemini, the, you know, frontier model that's underpinning, all of what Google's doing. And we also have responsibility and safety and policy teams that are especially thinking about frontier issues. We then do have partners across Google in these various domains. For example, in policy, we work closely with Google policy on the range of policy implications and considerations connected to frontier models. But Google DeepMind is, I would say, where the heart of the thinking related to frontier policy issues takes place.

Rob Wiblin: (11:13) Okay. So I guess so back in 2021, you'd been the the founding director of the Center for Governance of AI, gov.i, which was a reasonably big big deal then. I think it's gone on maybe maybe to be an an even bigger deal since it's, like, pretty prominent voice in in the conversation around governance of AI. Why did you decide to to leave this thing? That was that was going quite well to go and work at at Google DeepMind instead.

Allan Dafoe: (11:34) Yeah. And, yeah, I agree. It it went well at the time, and it's gone even better since. So a lot of credit goes to Ben Garfinkel, who's the executive director of GovAI, and the many others who who work there. At the time, I was an informal adviser to Demis Hassabis, CEO of dMind, and Shane Legg, cofounder of dMind. And I found that I had a lot of potential impact in giving advice on AGI safety, AGI governance, and AGI strategy. However, to be most impactful, it helps to be inside the company where I have more understanding of the nature of the decisions that they're confronting and more surface area to advise not only Demis and Shane, but also many key decision makers. To take a step back, I wanna reflect on this road to impact, this kind of advising important decision makers, approach. I would say 1 lesson I've drawn from history is that often in these pivotal historical moments, in in crises or in very high leverage historical moments, a lot depends on the behavior and the ideas and the the character of key individuals in history. And, you know, Alexander Hamilton, the musical portrayal, it's like, who's in the room? What and what decisions are made, in the room? And, so I think that's that's true. I think when we when you look at history, especially in these these pivotal historical moments, it's incredible how much the ideas that people coming into the room had, the resources, the insights that that they had available for what the solutions that that that they construct. So that argues for advising people who will be influential and on these important historical developments, and I think AI and AGI is is my view 1 of the most or probably the most important historical development. And I think Demis and DeepMind are very likely to be influential in the ongoing have been so far in the development of AI and AGI. There's a second part of this, which is that in addition to advising influential decision makers, it's that of the idea of boosting decision makers who have the kind of character you would want in critical decision makers. So do they have the sensibility? In in my case, are they aware of the the full stakes of what what is happening? Are they safety conscious? Do they have, the technical and organizational competence to pull off what needs to be built? Because if you, you know, sort of have clumsy hands, even if you have good intentions, that may still lead to a bad outcome. Finally, do they have wisdom, to, you know, be able to make these very hard decisions that have complex and certain parameters around them. And in my view, Demis, and I think, again, Shane, to mention, are extremely, impressive, individuals for these properties, for their safety, orientation, to their broad perspective on the stakes of the issues, to their, wisdom and broad character. I also do wanna, I guess, reflect on gov.ai. During my time, yeah, it produced a lot of great work and great people. It's since gone on, I think, to produce even significantly more great work and great people. So Ben Garfinkel's done a great job. It's interesting reflecting on some of the people who've gone through GovAI. So 1 person who worked very closely with me at the time is Jade Long. She used to be head of, sort of my partner team at OpenAI, and is now the chief technology officer at the UK AI Safety Institute. A number of other very prominent people in AI safety and governance similarly have gone through Cove AI. Markus Enderlong, Robert Traeger came through. Anton Kornack is a prominent economist who's done some work there. Miles Brundage and others.

Rob Wiblin: (15:25) I guess back in 2018, in a in a short interview back then, were saying people should definitely be diving into the into this area because it's gonna it's gonna be grow enormously, and it's gonna be really good for your career, and there'll be lots of opportunities. And I I think that has definitely been been bought out that people who got in on the on the ground floor have been doing doing doing super well at Career Rise.

Allan Dafoe: (15:43) Yeah. And I think it's still early days for any prospective joiners. I think, you know, I I always encourage people to hop trains as soon as they can because, yeah, AI is only you know, it's still just a small fraction of the economy. Yeah. So there's a lot more impact to come and and work to do.

Rob Wiblin: (16:00) We think that in the in the fullness of time, it's gonna be close to close to a 100%. Certainly more than 10%, and it's like point 01% now, certainly not more than point 1% in terms of total revenue. So, yeah, there's like many orders of magnitude to go up yet. Let let's open it by talking about the work that you did in, your, like, previous incarnation, which was, as as an academic. So I think, you did your thesis, you know, back in the early 20 tens on technological determinism. Was was main focus. And think your your the paper that came out of that opens with who, if anyone, controls technological change. What was the academic debate there that you were reacting to or trying trying to trying to be a part of?

Allan Dafoe: (16:36) Yeah. So my academic trajectory had a number of chapters. So the the first was on technological determinism, which we can come to just for completeness. The second was on great power politics and peace specifically, which actually led to a lot of work that continues to be relevant, I think, to the question of AI and AGI governance. And then I also did some statistics and causal inference work, which has some relevance to thinking about AI today. Turning back to technological determinism, I would say I've, first came to this in undergrad reflecting on what shapes history and how we can do good and how we can kind of steer the trajectory of developments in a positive direction. And an insight I had was that you know, history is not just the sum of all of our efforts. It's not just, you know, we all kind of push in in different directions, and then you take the sum and and that's what you get. Rather, there's these sort of general equilibrium effects that, you know, economists, I think, often talk about, where it may be for every unit of effort you push in 1 direction, the system will kind of push back with an equal force, sometimes a weaker force, sometimes a stronger force. And so when you're in such a system, it's very important to understand these structural dynamics. Why does the system sometimes resist efforts or sometimes amplify efforts? Why do you see these really astounding patterns in macro history? For example, if you look at patterns of GDP growth, you know, there's these famous curves where after the devastation of, say, World War 2, both Germany and Japan, completely rebound within, you know, less than a decade, and then kind of return to their pre war trajectory. And we've seen Moore's Law, which, you know, is is just an astounding trend. It's it's not just that it continues, that transistor density is increasing exponentially. It's, you know, very much a line. So it's you know, you can predict where where we'll be quite precisely years in advance. We now have scaling laws, which have given us sort of our generations more, which again seems to allow us to predict years in advance how large the models will be, how capable on on well, for example, on loss and so forth. There's a number of other, I guess, yeah, macro phenomena that that seem quite persistent. The growth of civilization, I talked about, or looked at these trends in what you can call the maximum, so maximum energy processing of a civilization, also things like the height of buildings, the durability of materials. Really, there's just most functional properties of technology over time have gotten more functional. So, like, the speed of of transportation and so forth. Robert Wright, in summarizing the literature, writes that archaeologists can't help but notice that as a rule, the deeper you dig, the simpler the society whose remains you find. There's more generally, I think, an observation, which is almost a truism, that certain kinds of technology are so complex or difficult that they come after other forms of technology. It's hard to imagine nuclear power coming before, coal power, for example. So there's all these macro phenomenon and trends in technology, and it's important to explain them. Now this naive explanation would say if history is just the sum of what people try to achieve, then it's human will that's produced all these trends, including, you know, the reliable TikTok of Moore's Law. But not all the trends are positive. I know you've reflected on the agricultural revolution, which, evidence suggests was not great for a lot of people. The median, human, probably their, health and welfare went down, during this long stretch from the agricultural revolution to the industrial revolution. Of course, gave rise to inequality, warfare, and and various other things. And there's other, trends that different societal groups resisted. So in short, I don't think the answer is history is just the sum of what people try to do. It depends, of course, on things like power, on timing, on, the ecosystem of, what's functional and what's possible and what's not, on what technology enables. So I wanted to make sense of this.

Nathan Labenz: (20:55) Hey. We'll continue our interview in a moment

Rob Wiblin: (20:57) after our words from our sponsors.

Nathan Labenz: (21:00) In business, they say you can have better, cheaper, or faster, but you only get to pick 2. But what if you could have all 3 at the same time? That's exactly what Cohere, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now, with 0 commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Rob Wiblin: (22:09) The thing that we have to try to reconcile here is that on the 1 hand, we see these trends that seem like they're not really responsive to any person's particular decisions. They're acting almost like it's a little bit like the was it the psycho history in in in Foundation where you just have these, like, broad trends where everyone is just an ant in in in this broader process, and it doesn't and it's not obvious that anything that any particular individual did was able to was able to shift things. On the other hand, technology, at least so far, doesn't have its own agency. It does seem like, in fact, it's humans doing all of the actions that are producing these outcomes, and couldn't they, in principle, they really hated what was happening, try try to shift it? I mean, we we feel like we have agency right now over how society goes, or we feel that we at least have some some agency. So how do you how do you reconcile this macro picture where it seems like things humans don't have that much control over technology, at least historically, with with the micro picture where we feel like we like we do now? Is am I understanding it right?

Allan Dafoe: (23:05) Definitely. Some of these earlier theorists of technology and scholars of technology, this is in the sort of sixties to eighties, even endowed technology, this abstraction, with a sense of autonomy and agency. Like, technology was this driving force, and often humans were along for the ride. So Langdon Winter was 1 of the most prominent scholars, talked about, technology having autonomy. Mumford Lewis Mumford talks about the machine as this capital m abstraction that is, driving where society is going, and humans just support the machine. We are cogs to support it. Jean Jacques Alloule referred to la technique, which is the sort of the functional taking over. And he had this metaphor that humans make a choice, but we do so at the under coercion, under under pressure from what is the main of us, and La Technique is the answer. So I think there were these scholars and others who really did endow technology with this kind of agency. Then a later generation criticized them, saying the this abstract technology is an abstraction, a very high level abstraction, almost, poorly defined. And when you actually look at history in detail in the microscope, you don't where is technology? You don't see the machine, you know, in the room. Is the machine with us right now? You see people, people, with ideologies and ideas and interests making decisions. And I would say this led to a revolution in the study of technology towards what's been called, sort of social constructivism. This looks at methodologically, it's more ethnographic or, sociological. It looks at the details of how decisions were made, the idiosyncrasies of of technological development, the many dead ends or detours, the fact that early on in the development of technology, people didn't know what the end result was, and they had many visions that were competing. And so it wasn't sort of foreordained that the bicycle would look the way it does or the plane would look the way it does. And I would say well, from my personal intellectual trajectory, the PhD program I started in was 1 of the prominent departments working on this at Cornell. And for me, this was a surprise, because I really wanted to explain these macro phenomena, and the answer I got from this department was, this is wrong. This is technological determinism. It is what scholars have since referred to as a critic's term. It's sort of, anyone who actually advocates technological determinism is it's a straw man position. No 1 is serious about this. So this whole generation of these sociologists and historians of technology really looked at the the micro details of how technology developed and dismissed these abstractions that that technology can have autonomy, can have an internal logic of how it develops, can have these profound impacts on society that, you know, we name revolutions after technology, the agricultural revolution, the industrial revolution, and so forth.

Rob Wiblin: (26:07) In this paper, it seemed like the constructivists in in in in their reaction to this determinism were really sticking out a relatively extreme opposing position where they were almost suggesting that, you know, it's always human responsibility, and people always have choice over what technologies they adopt and and what form it takes. Am am I understanding that right?

Allan Dafoe: (26:26) Yeah. In a way, I would say the debate was never directly had or or rarely directly had. It was often indirect. So in defense of the constructivists, I think, you know, they they were asking different questions. They they or more importantly, they had different tools. They had the tool of ethnography and sociology, and they were answering questions that those tools allowed them to answer, which were, you know, the the answers were narratives based on the the conversations that were that that took place, the decisions that were made. And to explain macro phenomena, those tools are not well suited. So I do think there was a mistake that was made, which was to dismiss the claims about macro phenomena and technological determinism in the pursuit of the questions that they had. And I think it's, a real loss for the history of technology that so little work has has since been done on these, bigger macro questions.

Rob Wiblin: (27:25) Were the constructivists motivated by a sense of moral outrage maybe that people that they saw people perhaps adopting technologies that were socially detrimental, and those folks might then excuse it saying, well, we have no choice. We have to do this for, you know, competitive reasons or it's gonna happen anyway. There's nothing 1 can do. And the constructivists were kind of frustrated by this and wanted to say, no. Like, no. You you're responsible. You're doing it. So, so you do have agency here.

Allan Dafoe: (27:50) Yeah. This is this is an argument that's been made often and more recently, yeah, by many different schools and including about AI. So 1 criticism of, let's call it the AGI, ideology as the the these people would put it, is that AGI is not foreordained or the development of AI as in any given sense is not foreordained. But the when we talk about it as if it's inherently, it's coming, it will have certain properties that, deprives citizens of agency to reimagine what it could be. So that's that's, I think, the constructivist position on technology exactly as you said. Now I think the counter position I would offer is you don't wanna equip groups trying to shape history with a naive model of what's possible. Right? You want to channel energy where it will be high leverage, where it will have lasting impact rather than in these settings where the the structure will resist, you know, all the force you push in 1 direction with an equal counter pressure.

Rob Wiblin: (28:51) Yeah. Okay. Yeah. So so we should talk about the the synthesis, I guess, that you try to put forward in in in your thesis. Like, what are the circumstances under which we do have more autonomy, and what are circumstances where it can be extremely hard to to change the course of history?

Allan Dafoe: (29:04) Maybe first, I'll just talk a little bit more through the different flavors of technological determinism. Sure. So because I think it's it's a rich vocabulary for people to have. So that maybe, in a way, the the easiest 1 to accept is what we can call technological politics. This is the idea that individuals or groups can express their political goals through technology. So in the same way that you can express your or achieve your political goals through voting or through other political actions. So if you build infrastructure a certain way or design a technology a certain way, it shapes the behavior of people. So design, affects, social dynamics. Some of these famous examples are, the Parisian boulevards, these these linear boulevards that were built, in many ways to suppress, riots and and rebellion, because it made it easier for the cavalry to get to different parts of the city. Latour is a famous sociologist, who talks about the sociology of a door or the the missing masses in in sociology, which refers to the technology all around us that reinforces, how we want society to be. You can think about gates or urban infrastructure as expressing a view of how people should interact and behave. A famous example is alleged of Robert Moses, who was a famous urban planner, in New York City, that he allegedly, built bridges too low for buses to go, under so that it would deprive, people in New York who didn't have a car, namely, African Americans, from the ability to go to the beach. And so this was an expression of a certain racial politics, that has been asserted. In general, I think urban infrastructure has quite enduring effects, and so you can often think about, yeah, what is the impact of different ways of designing our cities.

Rob Wiblin: (30:55) I guess I guess in recent times, we're familiar with the debate about how design of social networks can greatly shift, like, the the tone of conversation and people's behavior. People have pointed to, you know, the prominence of quote tweeting, on on Twitter where you can, like, highlight something that someone else has said and then, like, blast them on it, potentially leads to more kind of brigading by, like, you know, 1 political tribe against another 1. And if you got if you made that less prominent, then you would see less of that behavior.

Allan Dafoe: (31:15) Exactly. Yeah. I think the nature of the recommendation algorithm, the way that people can express what they want has profound impacts on the nature of the discourse and how we perceive the the public conversation. So this was technological politics. There's a number of other sort of strands of technological determinism. Just very briefly, technological momentum is this idea that once a system gets built, gets gets going, it has inertia. This is from sunk costs. So you can think of maybe, the dependence on cars in American urban infrastructure. You know, once you build your cities in a in a spread out manner, it becomes hard to have pedestrian dense core. Or as has been alleged, maybe electric cars were viable if we just invested more or also maybe wind power and solar power could have succeeded earlier if we'd sort of gone down this different path. We might come back to this. I think a lot of claims of path dependence and technology are probably overstated that, again, coming back to the structure, some technologies and some technological paths were just much more viable. And even if we'd invested a lot early in a different path, I think often it is the case that this kind of the path we went on was likely to be the path we would have been on because of, more the costs of and benefits of the technology than, these early choices that people made.

Rob Wiblin: (32:41) Yeah. I guess the extreme view would be to say, oh, well, we could have, you know, had electric cars or we could have gone down that I guess we did have electric cars in the twenties, think, but we could have gone down that path in a more full throated way, as early as that. I guess the moderate position would be to say, well, no, that wasn't actually practical. There were too many technological constraints, but we could have done it maybe 5 years or 10 years earlier if we'd really if we if we'd foreseen that this would be a great benefit and decided to make some make some early, like, costly investments in it.

Allan Dafoe: (33:05) Yep. And then yeah. To make the counterpoint of there's sort of a certain time when the breakthrough is ripe, I think in AI, this is often the case, that insights about gradient descent and neural networks occurred much earlier than when they had their impact, and it seemed they needed to wait for the cost of compute, the cost of FLOPs to go down sufficiently for them to be applicable. And you could argue, well, what if we had the insight late? I think, you know, once flops get so cheap, it becomes much more likely that someone invents these breakthroughs because it becomes more accessible, you know, any PhD student can experiment on their, you know, what what they can access. So there is a kind of seeming inevitability to the time window when certain breakthroughs will happen. It it can't occur earlier because it the compute wasn't there, and it would be unlikely to occur much later because then the compute would be so cheap, someone would have made the breakthrough and then, realized how useful it is.

Rob Wiblin: (33:58) Was there another school of technological determinism?

Allan Dafoe: (34:01) There's other flavors. So I guess another concept that's emphasized is that of unintended consequences, something we know a lot about. But Langdon Winter points to this notion that as we're inventing technology after technology, we run the risk of being on a sea of unintended consequences. Right? So the future of history is not determined by some structure nor by our choices, but just by being buffeted 1 way or another by the by the

Rob Wiblin: (34:27) Stepping from 1 blender to another.

Allan Dafoe: (34:28) Yeah. And and sometimes it's positive, sometimes it's negative, and I think there's truth to that, that often, you know, technology comes along, and then it takes us some years to fully understand what its impacts are and then to adapt and hopefully channel it in the most beneficial directions.

Rob Wiblin: (34:44) Yeah. There's there's definitely some some effect like that. I guess the the people who really highlight that, I think they're exaggerating sometimes the scale of the negative side effects from technology where, I guess, setting aside some particular cases that we're particularly focused on, it seems like the negative side effects of technology in general have gotten kinda smaller at each generation of technology that that we we do solve more problems than than than we create on average is kinda my take.

Allan Dafoe: (35:05) Yep.

Rob Wiblin: (35:05) Yeah. Shall we come back to the to the synthesis? Sure.

Allan Dafoe: (35:08) Yeah. And yeah. So maybe the last, big part of it is, again, these macro phenomena, and trying to explain them. And so I guess the scholars most working on that, I would say, are macro historians, macro economists that are trying to explain and and political scientists that are trying to explain these long run trends and things like the spread of democracy. So the synthesis is that well, 1 observation that led to the synthesis is that the more micro your observation, the closer to people to the day to day, the more likely you were to conclude a constructivist explanation. So this is a robust empirical finding. If you look at the literature, people who have micromethodologies are much more likely to conclude constructivist type claims, that what matters is individuals, visions, ideas, and so forth. Whereas the more macro your methodology and your aperture, the more likely you were to conclude a more deterministic Yeah. Set of claims. And so so we we have a a puzzle there. The I'd say some of the constructivists concluded that's because the macro scholars, are too high they allow themselves the error of, imputing agency to technology because they're so far from the data. And I I think that's an unfair characterization. Rather, I do think there's, emergent phenomena at different scales of analysis, and so we should give the macro phenomena its due and and try to explain it. 1 analogy that I've I've offered here is, you know, imagine this hypothetical science of wave motion. And so we have, a group of scientists who emphasize wind. Right? It turns out when the wind is blowing, that affects the ripples on top of, the water. And then another community is based on kinetic impact. So they say, look, when we throw rocks into the water, it produces waves, and that's their kind of preferred theoretical framework. And then there's this kooky macro water phenomenologist who says, I've been noticing that whenever the moon is directly above us, the water level is at its highest point, and then when it's at, you know, the horizon, it's at its lowest point. And I've traveled all over the world, and this pattern is robust. So, I will offer this, you know, moon determinism that, yeah, the moon explains water levels. And the the wind and the kinetic scientists would be mistaken to dismiss the moon determinist simply because the moon determinist doesn't have a micro mechanism. That person there's a challenge to that finding, namely, do you explain this this pattern? Because there's no known micro foundation that can explain why the moon is sort of pulling water. But, of course, we know that is in fact, you know, what it's

Rob Wiblin: (37:46) doing. Yeah.

Allan Dafoe: (37:47) And so I think there's a similar results in macro history that there are these patterns that need to be explained, and the fact that we didn't have a micro foundation isn't a reason entirely to dismiss it, but it is a challenge. So what is a possible micro foundation? The 1 I offer hinges on military economic competition, And the key idea is that there's levels of selection. So at the local level, you know, you and I can make a decision about what we do right now, maybe if we wanted to build something, the technology we build, and that just depends on on us, on our ideas, and and so forth. But if we really wanna get going, if we wanna you know, we we build a new kind of art and we want it to be everywhere, then eventually, we're gonna need resources to pay for it, and maybe it can't be too opposed by other groups. And so eventually, we run into these other forces. And I think when you when you think about ways of living, which is kind of a general term for sociotechnical systems for for civilizations, they do run into resource constraints. They they need resources to sustain themselves and then to proliferate, which they typically want to do. And so that involves economic competition, competition over resources and over capital. And then the military aspect is always important because throughout most of history, military competition was ever present. Even if you had, decades or hundreds of years of, peace, eventually, there was military competition from a neighbor, and that provided sort of a higher level of constraint on what ways of living were possible.

Nathan Labenz: (39:27) Hey. We'll continue our interview in

Rob Wiblin: (39:28) a moment after a word from our sponsors.

Nathan Labenz: (39:32) Being an entrepreneur, I can say from personal experience, can be an intimidating and at times lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just 1 of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right 1 and the technology can play important roles for you. Pick the wrong 1 and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in The United States, from household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.

Nathan Labenz: (41:28) It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever. If your business can't adapt in real time,

Allan Dafoe: (41:39) you

Nathan Labenz: (41:39) are in a world of hurt. You need total visibility from global shipments to tariff impacts to real time cash flow, and that's NetSuite by Oracle, your AI powered business management suite trusted by over 42,000 businesses. NetSuite is the number 1 cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR altogether into 1 suite. That gives you 1 source of truth, giving you visibility and the control you need to make quick decisions. And with real time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's 1 system, giving you full control and the ability to tame the chaos. That is NetSuite by Oracle. If your revenues are at least in the 7 figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.

Rob Wiblin: (42:52) Yeah. And I guess even if there is an act of war, kind of everyone's living in the the shadow of violence, think is the term that they're they're anticipating that there there could be war in future, and maybe if they don't play their cards right, then they would be would be vulnerable to aggression.

Allan Dafoe: (43:04) Yeah. And we could add a higher level of selection. So in my thesis, I did put sort of environmental selection on top of military and economic. There there were these sort of circles of selection in the sense that civilizations that might be fit for the economic competition and military competition may nevertheless not be sustainable with their environment, and that could be another source of failure to to sustain itself and proliferate. So you can imagine, again, these these layers of selection. I intended to put so environment at the top, military, economic, and then you might put culture, and then you can put psychology or or more local, dynamics lower down. 1 famous quote in the history of technology, that was arguing against determinism was that, technology doesn't force us to do anything. It merely opens the door. It makes possible new ways of living, new forms of of life. And my retort was technology doesn't force us. It merely opens the door, and it's military economic competition that forces us through. So when a new technology comes on the stage, many groups can choose to ignore it or do whatever they will with it. But if 1 group chooses to employ it in this functional way that gives them some advantage, eventually, the pressure from that group will come to all the rest and either force them to adopt or, you know, lead the the other group to losing their resources to the the new more fit group.

Rob Wiblin: (44:34) So that seems, in a sense very obvious. Yeah. Why do you think the constructivists were were missing this, or, why didn't this kind of stand out to them as an important, you know, effect?

Allan Dafoe: (44:45) Yeah. Well, I'm glad you think it's obvious.

Rob Wiblin: (44:48) Maybe because of my, like, background. Yeah.

Allan Dafoe: (44:51) I mean, I think, again, there was this, you know, let's not underestimate the bias that comes from a scientist using the tools that they prefer to use, you know, looking under the lamplight. So the constructivists were very good at ethnography and and sociology and and this kind of daily life history, this close-up micro history. And and when you look that close, you don't see the machine exerting its its force. Military competition at a macro historical level is ubiquitous, but at a micro historical level is rare. Wars are rare, and and as you said, much of the effect of this military economic competition is through how people internalize that threat. And so then you can equally say, oh, it's not this competition that's driving behavior. It's the ideology of capitalism, of military greatness that is driving the behavior. This is a methodological challenge. I do think there's this concept of vicarious selection, which is that in a an evolutionary environment, it is highly adaptive for an organism to model its environment and to internally kind of simulate what will happen if it goes in 1 direction or another. So this concept was named by a historian of technology who was trying to explain the development of of aviation, aerospace design. And his point was, you know, you don't build a plane and try to fly it and it crashes and build another plane and try to fly it and crash. Rather, you invent the wind tunnel. And and you have a theory. You you model the external environment, and you say, okay. We want these properties in our in our wing. And so you are still doing this experimentation, you're just doing it in a controlled targeted manner internal to the broader economic competition.

Rob Wiblin: (46:41) So if think yeah. If if I think about how these folks might respond, at least my my my simulacrum of them, I guess yeah. Maybe I shouldn't have said it was obvious. It's it's obvious to me because I've literally been been been taught this more or I get it I get it in books and and maybe even in undergrads, I suppose everything is obvious once once you've literally been told it. But a pushback that I can imagine is, you know, we're living in The UK. The UK has nuclear weapons. It's in a pretty friendly neighborhood. Yeah. Are we really saying that when The UK adopts some new technology or designs its cities 1 way or another or has a particular housing policy, It's doing this because it thinks that it has to for defensive purposes because otherwise it's gonna be invaded by France or or Russia or whoever or whoever. It's not as if we're actively thinking about the these defense issues or competitive issues all the time. I guess individual businesses do think about if we don't adopt this new technology, we'll be outcompeted. But at least the the military thing is less clear once you have a very strong kind of defensive position where you don't feel a great great risk of attack.

Allan Dafoe: (47:36) Yes. So in the I think there's 2 things I I wanna say here. 1 is in the modern era, military competition has declined a lot, and we have much more of a global culture than we have had, you know, 100 years ago or or 300 years ago or or more. And so that can change this higher level of selection. I do think it's still there, and, certainly, you see, kind of conversations around national security having a lot of force in domestic politics, in UK politics, US politics, and and pretty much any country's politics. That if if there's a claim that we risk losing, you know, a strategic positioning against an adversary, that can be very motivating for internal reform.

Rob Wiblin: (48:22) Yeah.

Allan Dafoe: (48:23) I guess the second point I wanna make is that there's this great example from The UK history where I think this this dynamic is is really well illustrated. So this is from Thomas Hughes, this historian of energy systems, and he looks at The UK UK energy system, which initially, was much more had these local energy power plants and and sort of local energy systems that were, Thomas argues, were better suited to The United Kingdom's notion of democracy. Right? There was kind of decentralized energy provision. It was much more under the control of localities. It wasn't this, like, big national, energy system. And that persisted up to World War 2, when the sort of, cost constraints that it imposed, became excessive, and and The UK made the decision to adopt more of a national grid. So there is, again, the story of 1 certain way of living arguably aligned with the sort of political ethos of the community persisting until the the cost constraints become excessive, often driven by this, you know, crisis of of conflict.

Rob Wiblin: (49:32) Yeah. So the case study that that you focus on in your work, think, is is the Meiji restoration in Japan in in the mid nineteenth century. It feels like almost about as clean an example of this military competition driving driving history as you could imagine. Can you briefly explain

Allan Dafoe: (49:47) Sure. Yeah. I so during this time, I looked for it is good to have empirics, right, that you can tell a story about Moore's Law and these macro phenomena aren't sufficient evidence for making sense of macro history. So I looked for a case where a a community chose to go in a direction contrary to sort of what was demanded by La Technique, by what was functional in this military economic competitive milieu. And there's this great example of Japan under the Tokugawa regime. This the regime lasted roughly 200 to 250 years, and it was a return to the shogun way of life. So, you know, samurai were at the the top of the pecking order, a very feudal society. They had firearms at the beginning of this period, and they sort of uninvented firearms. They the shogunate centralized firearm production, so everyone who knew how to produce firearms was brought into a central place and then paid a stipend to not build firearms so that the technology was forgotten. So they had cannon firearms, and they lost the technology. That persisted for, again, roughly 200 years. And during this time, Japan wanted little to do with the outside world, but they observed that, you know, things were changing, that the Europeans were were sailing around and, involved in China. And everything changed on this fateful day, 1853, when an American Commodore Perry visited, Japan with the explicit purpose of opening Japan up to trade. And he came in the steam ships that, were seemingly magical. They were moving upwind without sails, belching black smoke made out of, this very large heavy metal, and had a real profound impact on, the Japanese who received them. He gave a demonstration of what was possible with the cannons, to bombard, the shore and gave them white flags to, so that they could signal, their desire for the bombardment to cease, you know, be able to communicate in the future. So he said he was gonna come back in a year and to complete the negotiations. The Japanese at the time asked him, will you bring your ships again? And he said, I'll bring more. And yeah. So that was the opening of of Japan. It led to a 15 year period of revolution. This was the Meiji restoration where different groups were trying to make sense of their new environment. They it it was no longer sustainable to continue the their way of life, and different groups contested it in different ways. And the final answer was, this restoration under the the emperor, and a view that we need to modernize. So the Japanese very proactively sought to learn everything they could about the West. They they sent, people to the West to get all the books on all the industrial arts and so forth. And, you know, Japan incredibly succeeded that, you know, in in just several decades later, Japan was able to contest, in World War 2, you know, control of Asia against The US and Britain and and others. What's, I think, powerful about that story is it really shows how group of people, in a sense, chose of course, there's power infused through it, but but that community chose to go 1 direction with respect to the technology of firearms and and other, aspects of modern industrial, civilization. And that choice was time limited by how long the West would choose to not, force on Japan a different a different way.

Rob Wiblin: (53:22) Yeah. Well, I think the thing that set the time line you can imagine I think the the only reason this was possible for Japan was because it was an island, and so it was actually quite hard to invade. So they had this protective barrier, and that gave them, a a degree of discretion that I don't think they you know, if they were on the on the steps of Asia, I think they would not have had nearly, the the the freedom of movement that they that they had as an as an island, and they would have felt the the pressure and the and the fear of invasion. Like, it would have been much more salient, they would have just been much more focused on being able to to defend themselves. I suppose it gave them breathing room that allowed them to fall quite a bit behind, but at some point, they fell so far behind that even even the the the the sea barrier was not enough to to keep them safe from invasion. And at that point, they basically they they they, I guess, did a complete 1 80 and decided to to catch up and and and modernize.

Allan Dafoe: (54:07) Yes. Yeah. So, yeah, I find this case study quite compelling, and most case studies are are rarely so clean, again, because people internalize these external pressures and there's mimicry and status dynamics where some communities look up to other communities in a way that's often correlated with power or wealth. So it's the the narrative is not as clean. Whereas in this case, it was very clear that what what forced the change was the sheer power of the steamship and cannon that that the West could bring.

Rob Wiblin: (54:42) So I guess the reason we're talking about technological determinism is that many many people in our circles are very focused on this idea of differential technological development. I guess a related more recent idea is is defensive accelerationism, which is that the way that we wanna try to shift history in a positive direction is changing the develop shifting the order in the development of technologies or advancing some particular lines of science and and research to try to get them ahead of of other ones. And I guess you you wanna advance the ones that you think are generally making the world safer, so that you have, like, more of those technologies by the time other more more risk increasing technologies arrive on the arrive on the scene. What does the, you know, discipline around technological determinism say about whether this is a viable pursuit and a and a sensible approach to trying to influence history in a good direction?

Allan Dafoe: (55:30) Yeah. Great question and a big question. To give a bit more color on differential technological development, which is a very clumsy term to say, but we haven't the community hasn't come up with a cleaner term. I guess you can think of to motivate it, maybe the best example is the seat belt. The seat belt seems like something we could have invented before the car. You know, you you can imagine faster moving vehicles, and the value of, restraining a person, from you know, in in the event of a collision. So it doesn't seem like it requires the invention of the capability, the combustion engine in the car, in order to invent the seat belt. And so in principle, we could have invented the seat belt before the car and then had it ready to go as soon as cars were diffusing so that, you know, we didn't have to wait decades for the seat belt. This is an example of safety technology that pairs nicely with the capability to make the capability safer, and and there's other examples that would make it more beneficial in other ways. Another class of technologies or interventions are when you can develop the countermeasures or societal defenses for the for the capability that has these adverse byproducts. An example there would be a vaccine. So if you know that, you know, potential disease could come, if you can develop the vaccine in advance. A third category is a substitute. So an example often given is whether wind power or solar power could have been made more cost effective than fossil fuel based power. And so, you know, if the cost curve was such, then we would have invested more in these sustainable energy sources, and then civilization would have gone down that path rather than a more fossil fuel dependent path. So I think these interventions those are the cleanest examples. There's more general ones that imagine wholly different technological paths that have different properties. Well, it's worth reflecting on them. So, again, Langdon Winter argued nuclear power was more authoritarian. The the story being the the nature of the technology is that it requires centralized development, and it requires strong, coercive infrastructure around it to make sure it's not abused. Whereas wind and solar, a lot permit both decentralized, development and so decentralized politics, and they don't have this risk that requires more of a, you know, security state on top of it. So there's arguments of certain technological trajectories that have these byproducts, political or social effects. Now to offer some challenges so I guess, firstly, I do think differential technological development is a very important idea that we should be thinking about a lot. And in many ways, it's it undergirds the whole notion of AGI safety. The whole notion of AGI safety as a field is that we want to, on the margin, put a bit more effort onto safety, AI safety or AGI safety, than we otherwise would, than the market would naturally do so with the idea being that that's gonna make a difference. It's like inventing the seat belt before the car. So I think it's a very important idea. However, there's a lot of good reasons to doubt its tractability or or feasibility. 1 way to see this is that most of the arguments for differential technological development require some person to see 2 pathways that are both viable, you know, given some effort, and then furthermore, to anticipate the consequences of each of those pathways to choose the better 1. And these things are hard. First, it's it's quite hard to know what is the next viable step in technological development. Right? If you know that, you can make a a great business. That's that's like the market is very hungry for that insight. And so you should think it hard to find 2 of those Yeah. To be at the point where you can have this marginal choice. Again, it has to be a marginal choice that others aren't already pursuing in order for it to be an intervention. Or you have to convince a resourced actor to choose to go in 1 way or another. And then there's the second stage, which is you have to predict the consequences of going down 1 path or another, which is extremely hard. You have to anticipate the full sequence of technologies and the tree of technologies that will spawn off of 1 path versus another, and then the many direct and indirect consequences of those technologies. And we know from the study of technology and and just from our attempts to make sense of technology that it's just very hard to foresee the direct effects and second order effects of technology. So that's that's a bit of pessimism. Coming back to the technological determinist perspective, I think this notion of technological momentum would say there is this path dependence in directions you go down. You build up expertise, you sink investments. So it is important at the beginning of major investments in a new infrastructure, when you when you find yourself making these large investments, ask, is this going to sync costs? Is this gonna make it harder for us to choose a different path in the future? And reflect, you know, are there other consequences we should be weighing before we go down significantly in a certain direction?

Rob Wiblin: (1:00:39) I mean, in the archetypal case that we're talking around here with artificial intelligence and an AI alignment methods, Doesn't feel well, on 1 hand, gotta say, well, we gotta have, like, 2 different viable paths where, you know, some incremental effort might push us in 1 direction versus another. It doesn't feel quite as binary as that because you can imagine almost everyone thinks that, you know, AI companies are gonna put some effort into alignment. They they do care that the models do, broadly speaking, what what what's being asked. And the idea as well, we just wanna do more we wanna do more of that than the market might provide. So it's not like we're choosing between, you know, aggressively non aligned AI that someone really wants. It's more just trying to go even yeah. Go even further on this thing that most people are gonna regard as desirable and want to incorporate if it's if it's practical. And then in terms of deciding whether this is actually a better path, you know, is it actually going to have better consequences? I mean, I'm sure I'm sure people have made arguments alignment might might backfire. It could be could be could be could be worse than worse than not aligning it. You can imagine ways, but still on balance of probabilities, it seems, like a like a reasonable bet, not not something that people are super uncertain about, or at least that I feel really uncertain about. So, I mean, this is the case that people focus on so much because it is among the better ones that people have ever have have have come up with for trying to pursue differential technology develop technological development. And maybe there's lots of other ones that were left on the scrap heap because it wasn't clear that they were either viable or or desirable. What what what do think?

Allan Dafoe: (1:01:56) Yeah. So I agree that safety and alignment are, on net, very beneficial bets that we should be investing heavily in. So I agree with that overall assessment. To make the counter case, 1 would be this general equilibrium argument that the marginal return on your investment in safety and alignment is actually quite less than what you pay because the market would otherwise have have provided that. So to motivate this, you can reflect that the success of AI assistants today are very much constrained, I think, by their ability to be aligned and safe. Safety and alignment is a huge priority for developers because if they're not, then they will not be good products. So the the market already is providing a huge motivation for advancing this field. As it happens, these techniques of RLHF, reinforcement learning from human feedback and yeah. Which is 1 of the key alignment techniques, and then I think constitutional AI is another nice 1, were developed by individuals who are motivated by AGI safety and supported by those resources. Maybe that brought the technology forward a few years. But the, you know, the counterfactual imagine we had no investment in AI safety and alignment. Maybe it delays it 2 years until the market demands we solve this problem, and then other researchers rise to the challenge.

Rob Wiblin: (1:03:21) So so is the the SMART Alec response to, reinforcement learning from human feedback being developed by alignment and safety focused people, and then applied to make AI useful and, you know, economically, valuable across all kinds of different domains is to say, well, you've you've wasted your time. Maybe you've even made things worse by by speeding them up, which you didn't wanna do. On the other hand, seems like a reasonable reaction to say, well, like, what what's your plan to not develop any other technologies that actually makes AGI work? That doesn't really seem like an alternative. Surely that would well, that would just delay things at best, and we need to get to this we need to get to the point that we're at now at some point sooner or later. Right? But I guess you're saying even if it's not actively detrimental, it could be kind of useless because well, the market, like, you know, GDM or some other group would have realized that we needed the equivalent of reinforcement learning from human feedback to make it work, and so they just would have done it at some at some later time anyway. So, the effect of your work has kind of just been undone.

Allan Dafoe: (1:04:10) Yeah. And so, again, to reiterate, I think the bet on safety and alignment and interpretability are very good bets Yeah. On that, so we should keep doing them. But on the margin, I guess, what we wanna do, and I think, you know, sophisticated individuals in this space are thinking this way, is look for what work in safety and alignment would not otherwise be done by the market in time. And this is maybe where the AGI safety and AGI alignment points to. It asks what are the what what's the seat belt for AGI? What are the guardrails that we need for AGI that the proto AGI, the the pre AGI system would not the market would not motivate us to find the solution for those systems. So we wanna look ahead. So this sometimes goes to the notion of deception, you know, which might generate a whole new class of problems when an AI system is misaligned, but it can hide that and deceive us. That that might be a different problem that than the systems we have today, and it might arise right around this critical period. So this is 1 argument for differentially focusing on that problem as opposed to others.

Rob Wiblin: (1:05:14) Yeah. Makes sense. Okay. Let's let's push on from technological determinism and talk about this research research agenda that you've been involved in, I guess, promoting and elaborating called cooperative AI. I mean, yeah, as we've been saying, the the main focus, I suppose, of differential technological development, thinking with regards to AI has has been alignment for for many, many years. But, but you and some co authors in this paper back in 2021 said there's there's this whole other cluster of behaviors that we might like to speed up the development of around cooperation that you think could be similarly important or at least on the margin could be similarly important because people aren't really talking or or thinking about it. How do you define cooperative AI, and and why do you think it's it's quite key?

Allan Dafoe: (1:05:50) Yeah. Great question, and it's a big the answer will be extensive because the, yeah, the whole theoretical framework around cooperative AI is is sort of large and complex. 1 way of putting it simply is that alignment is insufficient for good outcomes. And to make an even stronger claim, you could say it's not necessary. It's it's not necessary to solve alignment to have good outcomes. Now this is a strong claim.

Rob Wiblin: (1:06:14) A Yeah.

Allan Dafoe: (1:06:15) It's strong claim, but it helps motivate the case. So imagine we sort of only 90% solve alignment, which is to say, you know, our models do what we want within certain bounds, but we know if we scale them too far, we can't trust them to continue to behave as we intend them to behave. If we know that, then and we have global coordination, so humanity can act with wisdom and and, you know, prudence, then we can deploy the technology appropriately. Right? We can deploy it within domains and to to the extent that is safe and beneficial. So this is the sense in which global coordination is almost a necessary and sufficient condition. If we can globally coordinate, well, that's the necessary argument. Then we could deploy it to the extent that it's safe to deploy. The sense in which global coordination is sufficient is, again, that if if we're globally coordinated, then we could just appoint a reasonable decision maker to make this risk calculus, and and that would satisfy sort of humanity's collective view on how we should develop. So now coming back to suppose we solve alignment. So this is a very strong case, but we don't solve global coordination. I can imagine things still not going very well. So we have great powers that would develop these powerful AI systems aligned with their interests and in conflict with the other great powers' interests. And historically, great power conflict has been a major source of harm to humanity, you know, devastating wars, the brinksmanship around nuclear weapons has arguably imposed, you know, unexpected costs on humanity more devastating than the world wars, and and the sort of willingness of, the leaders of The US and The Soviet Union to gamble over nuclear war for these geopolitical stakes. And then there's other consequences, like maybe a failure to deal with climate change or insufficient global trade or pandemic preparedness. All of these sort of global collective action problems that we insufficiently address partly because we are not coordinated at at the the highest level and geopolitically. So that's 1 motivation for cooperative AI is to say, if if we want things to go well, we really need 2 pieces of the puzzle, or we we ideally have 2 pieces of of the puzzle. We have systems that are safe and behave, which is to say, especially behave as intended by the principal who deploys them, and we're able to deploy collectively AI systems and and continue our our activities in a way that's jointly peaceful and productive, which is to say we've we've solved enough of our collective action problems that we're not continuing to engage in nuclear brinksmanship or trade wars or other major welfare losses due to insufficient global coordination?

Rob Wiblin: (1:09:12) Okay. So the the idea is even if you have AIs that are aligned with the goals of their operators, then this doesn't necessarily lead to a good outcome if those operators are in conflict with 1 another. That the the AI systems that they're that they're working with could simply lead you to a disastrous outcome just as the fact that we're maybe aligned with our own interests doesn't necessarily produce a great outcome across humanity as a whole. You can still end up in in traps and unintended disasters, and and the same basically applies in a world where where it's AGI that's doing most of the operationalization of what people want.

Allan Dafoe: (1:09:43) To give another example to motivate this, there was this famous flash crash in 2010 where some algorithmic trading so this is, you know, early days. This this is not sophisticated AI. Some algorithmic trading led to an inadvertent sell off in the stock market that led to trillions of dollars of on paper loss before the emergency stop gaps kicked in, which sort of stopped trading and allowed those trades to be unwound. Because those were not intended by by the traders, rather it was this emergent dynamic from algorithms that had, you know, some some protocol that made sense, within normal bounds, but when interacting could could get out of control. You sometimes see this on Amazon or other online marketplaces where, like, famously, some book will sell for, like, millions of dollars because apparently, these 2 sellers each had an algorithm that would bid up the price as a function of what the other others were selling it at, and these things iterated to, you know, a crazy valuation. In general, this and flash crashes, I should say, happen frequently. We just in the stock market, we have these safeguards so that when there's a sudden movement in the market, trading will stop, and then there's rules for how you can unwind those trades. So as we deploy AI systems, simple AI systems, narrow AI systems, and increasingly general AI systems out in the world, and increasingly, as people talk about agents in the world, are more empowered, more general purpose, can kind of move between domains, maybe have access to bank accounts and emails and so forth, how do we make sure that there aren't these unintended emergent dynamics that could be harmful? And cooperative AI partly looks to addressing that issue.

Rob Wiblin: (1:11:33) Yeah. I guess a specter that haunts this entire conversation, if we're if we're focusing on the on the military case in particular, is that, guess, historically, when you have countries that are in intense conflict with 1 another, you get this process of brinksmanship where, you know, 1 country will escalate and the other country has to decide whether to, like, you know, cool or, like, escalate or or or back down. And it kind of you you keep getting this brinksmanship escalation process until, I guess, at some 0.1 of them blinks or they or they decide to find some solution that that that that makes them both happy. Trouble is if you have an AI operated military, these AIs can make these decisions on a completely inhuman time scale, where this entire process of brinksmanship that might take days or weeks or months when when humans are the ones who have to go away to a meeting and think about it and discuss it and decide how how to react. It could all all play out in a matter of minutes, basically. I think that that that possibility terrifies people, and and and and it's 1 thing that is discouraging people, I guess, from from placing AI into important decision making roles and, you know, over national security at all.

Allan Dafoe: (1:12:31) An important question in the deployment of agents will be, what are the degrees of autonomy we endow our systems with versus when do you have a human review decisions of different kinds? And that's a function of the stakes of the decision, the kind of resources that are deployed, maybe how much actuators the agent has. So, you know, if the AI is controlling weapon systems, then the important role of human review becomes much greater. But as you know, there is this if there is a time pressure, Paul Sharrie, who's a theorist of AI in the military, worries about a flash escalation occurring in potentially kinetic warfare or another dynamic, it would be, cyber conflict.

Rob Wiblin: (1:13:13) So looking over this paper, I I wasn't initially completely convinced that this was something that we had to go far out of our way to to focus on. But basically because I think of cooperative behavior and knowing how to cooperate with other agents as just something that's instrumentally convergently useful, that if you're developing agents that are just good at their job at all, that are in fact useful to apply, then they have to learn how to cooperate, at least in the kind of cases where they're being used, because otherwise they're just bad at their job. It's the kind of thing, like we were talking about earlier, that there's a lot of commercial pressure to to develop this. And also, these models we're we're imagining are, like, very generally capable. They're, you know, insightful, intelligent, and maybe maybe approaching or exceeding human level. Wouldn't they just be able to think about these things and figure out how to cooperate the same way that I think, you know, our thoughtful human beings figure out how to try to avoid conflict as as best they can?

Allan Dafoe: (1:14:03) Yes. I think you're right that cooperative skill or cooperative intelligence, as we would characterize it, is likely instrumentally useful. And so we should see some of some cooperative skill developed as a byproduct of almost any development agenda for AI. The cooperative AI bet is that on the margin, it's beneficial to invest more in it early to so that when we get to powerful systems, they are more cooperatively skilled than they otherwise would be. And in this respect, it's very similar to the safety and alignment bet. Again, safety and alignment is something that is likely to be developed, by default to some extent. The bet, though, is that it's worthwhile for us to invest energy early so that we're further ahead on safety and alignment than we otherwise would be. The seat belt comes earlier. This cooperative sophistication and skill comes earlier relative to the level of capabilities of the agents that are out in the world.

Rob Wiblin: (1:14:56) I guess yeah. Well, we've talked about the the military issues with cooperation, but that's like a pretty extreme case. I imagine there's a lot more other more mundane examples where cooperation could be useful. Do do you wanna give a give a couple of those?

Allan Dafoe: (1:15:06) Yeah. I mean, so much of society is bargaining interactions. So economic exchange, whether be on some marketplace to sell used goods or actual marketplaces to, you know, you know, like the financial markets and or major corporate deals. And, you know, there's a lot of welfare gain that can be had if we can strike deals more efficiently. If when, you know, 2 parties can be made better off, they they reach that understanding. Right now, we bargain through, you know, existing protocols and institutions that are are human built. In principle, AI could be much more effective or at least an AI human team could be much more effective. There's exotic solutions, like maybe you could put your AI delegate in a box with my AI delegate, and then we have them bargain. But then all that we let come out of the box is the this the proposed solution. And that may, in fact, solve some bargaining problems, whereas often, the challenges in the act of bargaining, we reveal private information, which could give the other side an advantage. And so that leads bargainers to withhold information, to bargain slow, to demonstrate resolve, or, signal that they have a good outside option. And this idea of put your agents in a box, and the agents know that the only thing that they can do is is output the solution or say there's no deal could dramatically change the nature of the bargaining dynamics so that we're much more often going to get a solution in a way that involves less of this, costly signaling that that typically takes place.

Rob Wiblin: (1:16:45) Yeah. There's other ways that I thought it might be easier for AIs to cooperate than it is for humans. I mean, 1 1 thing is that they they just have a much higher bandwidth of communication. They can actually just, like, send an enormous number of words, have a very lengthy conversation where humans just wouldn't regard it as worth it in order to to try to try to reach some some some bargaining outcome. I guess, also, you can kinda they can commit to act in a particular way because you can just copy them and demonstrate that in a given situation, a model will will will accept a particular kind of bargain consistently and then say, well, so all of the exact copies of this piece of software will do the same thing. Right? So I've I've created this cooperative software and and you and you should trust me. That's not something that you can easily do with human beings. I guess we try to look at their historical track record to learn what they're like, but here it's even easier to get to to judge the the character of an AI model. Are there ways that are there important ways that it could be more difficult or that it's, like, less straightforward for for AAs to to cooperate with 1 another than humans?

Allan Dafoe: (1:17:38) Yes. So great question. You may have to remind me to come back to it as I enter. There's a lot here. So first, I wanna clarify the cooperative AI bet is is sort of a a portfolio bet. It says there are many cooperation problems at present and in the future between AIs, which is what we're mostly focusing on, but also between AIs and humans and humans, human and human cooperation. And the bet is that on the margin, if we put in effort now, AI might help us with these cooperation problems. So it could help 2 humans cooperate better. It can also help future AI systems cooperate or AI systems and humans cooperate. I'm making this distinction partly because I think 1 very promising direction is AI systems that can help humans reach solutions. So we were just talking about this bargaining setting where there's some conflict of interest. Another maybe more prosocial example is where people are trying to do political deliberation. We're trying to find out what is the right course of action for our community, for our municipality, or our our family, or our our nation. And, you know, we have a set of tools for doing that. These institutions of deliberation, the press, voting, and so forth. And AI could dramatically potentially help humans better find the course of action that that they would want to pursue. Google DeepMind recently produced a paper talking about this idea of a Habermas machine, which is AI as a sort of facilitator of political deliberation. And what they find is that, like, language models today can actually serve as a as a useful tool to summarize the political views of a range of people about some actual policy issue and then to articulate a consensus, like a a detailed productive consensus that the people will sign on to. And what they report is that this Habermas machine, this AI, can actually articulate a better consensus than the humans that they employed to to try and do so. So there's a I think, you know, if we imagine extending that further, that could really help in a lot of political settings where we do have difficult issues to to talk through, multidimensional, and AI can help us understand what are these dimensions, what do the dimensions that I care most about, what are the potential mutually beneficial solutions between me and and another group on that dimension.

Rob Wiblin: (1:20:05) Yeah. So I guess to come back to what what what are the ways that it's that is more difficult for for AI to cooperate?

Allan Dafoe: (1:20:10) So I think a lot of people think AI will be better at cooperating. That's, I think, the prior. And maybe something we can talk about is what I've called the super cooperative AGI hypothesis that as AI scales to AGI, so will cooperative competence scale to sort of infinity, that that AGIs will be able to cooperate with each other to such an extent that they can solve these global coordination problems that

Rob Wiblin: (1:20:33) Almost magic. Yeah.

Allan Dafoe: (1:20:34) Yeah. And and the reason that hypothesis matters is that if it's it's true that AGI as a byproduct will be so good at cooperation that it will solve all these global coordination problems, these collective action problems, then, in a way, we don't need to worry about those humans. We can just bet on AGI, push, solve safety and alignment, and then AGI will solve the rest. And so it's it's an important hypothesis to really think through what is it, you know, what does it mean, what is required for it to be valid, and for us to empirically evaluate, are we on track or not? So here are some arguments against AIs being very cooperatively skilled with each other. And there's different levels of this argument. We can say, maybe AI will be cooperatively skilled, but not at the super cooperative AGI hypothesis level, or maybe they will even be less cooperatively skilled than human to human cooperation. This the strong version of the argument would say humans were pretty similar. You know, we come from the same background biological background and, to a large extent, cultural background. We can read each other's facial expressions to a large extent. We especially if 2 groups of people are bargaining, we can often sort of read the thoughts of the other group, especially if, like, 2 democracies are bargaining. You can read the newspapers. You can read the press, of the other side. And so in a sense, human communities are transparent agents to each other, at least democracies.

Rob Wiblin: (1:22:03) So we also have an almost historical experience to kinda judge how do people behave in different situations.

Allan Dafoe: (1:22:08) Exactly. We can judge people from all these different examples. We can look at you know, we have folk psychology theories of childhood or the upbringing of people explaining their behavior. And also the range of goals that humans can have is, you know, fairly limited. We we roughly know what most humans are are trying to achieve. AI might be very different. Its goals could be, you know, vastly different from what humans' goals could be. 1 example is most humans have diminishing returns in anything. You know, they're not willing to take gambles of all or nothing, like, bet the entire company 50%. We go to 50%, we double valuation. Whereas an AI could have this kind of linear utility in in wealth or other things. That would change. AIs could have kind of alien goals that are not that are different than than what humans typically anticipate. They may be harder to read. So certainly, if you have interpretability infrastructure that could be easier to understand, but if there if there's a bargaining setting, why would 1 bargainer allow the other to kind of read its neurons? So in a sense, AIs could be much more of a black box to each other than humans are.

Rob Wiblin: (1:23:21) I guess the the extreme instance of this is that they can be potentially be backdoored, so they could have, you know you have sort of I I think on the under the current technology, we almost can can implant magic words into AIs that can cause them to behave completely differently than we than than they would previously, and this is extremely hard to detect, at least with current levels of interpretability. So that could be something that could really trouble that you could have an AI completely flip behavior, in response to slightly different conditions.

Allan Dafoe: (1:23:45) Exactly. So, you know, humans can deceive, but it's it's hard. It requires training, and and to have this level of complete flip in the goals of a human and the persona is is, you know

Rob Wiblin: (1:23:57) It doesn't exist. Yeah.

Allan Dafoe: (1:23:59) Whereas, as you say, an AI in principle could have this kind of 1 neuron that completely flips the the goal of the AI system so that it's yeah. Its its goals may be unpredictable given its history, and also this is a challenge for interpretability solutions to cooperation. So people sometimes say, well, if we allow each other to observe each other's neurons, then we can cooperate given that transparency. But even that might not be possible because of this property that I can hide this backdoor in my own architecture that's so subtle that it'd be very hard for you to detect it, but that will completely flip the meaning of everything you're reading off of my neurons.

Rob Wiblin: (1:24:39) This this is slightly out of place. What I mentioned this crazy idea that you could use the possibility of having backdoored a model to make it extremely undesirable to steal a model from someone else and apply it. You can imagine, you know, The US might say, well, maybe we've backdoored the models that we're using in our national security infrastructure so that if they detect that they're, being operated by a different country, then they're gonna completely flip out and behave incredibly differently, and there'll be almost no way to detect that. So that could be I mean, I think it's a bad situation in general, but this could be 1 way that you could make it more difficult to to hack and and and take advantage or just, like, steal at the last minute the model of a different group.

Allan Dafoe: (1:25:10) Yeah. So arguably, this very notion of backdooring one's own models as an anti theft device, Just the very idea could deter model theft. It it makes a model a lot less useful once you've stolen it if you think it might have this, you know, call home feature or or behave contrary to to the thief's intentions feature. Another interesting property of this backdoor dynamic is it actually provides an incentive for a would be thief to invest in alignment technology because you want to if you're going to steal a model, you wanna make sure you can detect if it has this backdoor in it. And then for the anti theft purposes, if you want to build a an anti theft backdoor, you again want to invest in alignment technology so you can make sure your anti theft backdoor will survive the the current state of the art in alignment.

Rob Wiblin: (1:26:05) Yeah. A virtuous cycle.

Allan Dafoe: (1:26:07) So so maybe this is a a good direction for the world to go because it, as a byproduct, incentivizes alignment research. I think there could be undesirable effects if it leads models to have this kind of highly sensitive architectures to subtle aspects of the model, or maybe it even makes models more prone to very subtle forms of deception. Yeah. So, yeah, more research needed for investing.

Rob Wiblin: (1:26:34) Sounds a little bit like dancing on a knife knife edge. Okay. So that's some ways that it might be more difficult for AIs to cooperate. I mean, maybe another reason why this research agenda doesn't feel intuitively so absolutely essential is that the AIs that I that I interact with, you know, LLMs, they just seem like very cooperative and very nice by by by nature, and it's kind of easy to imagine scaling them up, doing the same sort of RLHF that we're doing to produce that sort of personality now and say, well, wouldn't they continue to act to to have nice personalities and and and be really quite cooperative by by nature? What what do you think of that of that argument?

Allan Dafoe: (1:27:08) Yeah. I would say we need to make a distinction between niceness and cooperatively skilled or cooperatively intelligent. So cooperative skill is distinct from, like, a cooperative disposition. When we when we say someone's cooperative, we often mean they're sort of nice, they're altruistic, they're generous, they're prosocial. And the cooperative AI research program is not about that. It's not about how we make nice AI or prosocial AI. It's about how we make cooperatively intelligent AI. AI that can solve these cooperation problems better than than they otherwise would be able to. So it's an open question how cooperatively skilled are today's AI systems. I agree they're nice, but they're primarily interacting with individual users. They're not much deployed to solve cooperation problems. So we do need a science, an empirical science of how cooperatively intelligent they are. We can also expect that in equilibrium, they will not be necessarily so pro social because they will be deployed by interested actors in a, you know, in a bargaining setting. So if 1 actor, you know, we're trying to make a deal, I want my AI system not to be nice. I want it to be sort of a

Rob Wiblin: (1:28:29) faithful you're calling.

Allan Dafoe: (1:28:29) Yeah. A faithful delegate to the interests of the principal. And and so then again, yeah, in equilibrium, it should be aligned with me, not with some notion of the the social good. And then the question is, will it be able to efficiently bargain with other agents that are similarly aligned with other human principles to find a mutually beneficial solution? Either that or can we build this notion of a governor or like a mediator that will we can both trust has our respective interests will weigh our respective interests reasonably, and then we would sort of each tell this mediator arbitrator our goals, our resources, our outside options, and maybe prove it, and then the arbitrator would would tell us what the solution is.

Rob Wiblin: (1:29:18) Yeah. I mean, a nice thing about AI models is because you because you can test them and then use exactly the same model to consider the the new inputs as all the previous ones. You can trust them maybe as a mediator. That by by looking at the track record to say, well, they they produced fair outcomes in all these previous cases as a judge, so I would like trust it to produce a fair a fair outcome in in in in in new dispute that that I'm engaged in. So I guess, yeah, a lot a lot of potential there. What what is the actual agenda for trying to make AIs more cooperative? Are are people working on it? Like, what what what actual technologies do we need to develop?

Allan Dafoe: (1:29:49) Yeah. I think there's a lot of theoretical ideas and research programs that could help. Probably the highest leverage for the foundation. So there's this cooperative AI foundation, which I helped, found, and the foundation's goal is to promote cooperative AI. So AI systems that are that are more corporately skilled than they otherwise would be, especially betting on AI systems scaling and capability. So we're kind of betting on very capable AI systems in the future. 1 area that the Cooperative AI Foundation has found to be high leverage is environments or benchmarks. This being a I think this is true for many domains, is a kind of public good for AI development, is building an environment that measures cooperative skill, and then you invite many groups to sort of compete to have the most cooperatively skilled agent. And then you measure that and you celebrate, you know, advances in that. So it's often the case in AI, a good benchmark can really motivate work because it gives this this target for research to so, you know, it's so the expression goes to hill climb on.

Rob Wiblin: (1:30:57) Okay. So you have lots of different tests for how effectively these AIs have cooperated. And I suppose you can you can have, like, almost conflicting scenarios where you'd almost require different behaviors in order to produce a nice cooperative outcome in these different cases. I guess it's like running lots of kind of different game game theoretics scenarios that might demand that you be a hard ass in other cases and that you be friendly in other cases. I suppose there's there's this history of all these game theory tournaments where I try trying to figure out what what are the most simple cooperative agents that you can have. And then, I guess you can just have a research background to try to figure out to to try to develop the the most successful cooperative agent that that can that can successfully produce a lot of utility across a lot of different, yeah, possible situations that might be thrown into.

Allan Dafoe: (1:31:36) Yep. Yeah. And there's all kinds of interesting aspects to this. Does the agent have a good theory of mind for the other agent? Can it model what the other agent's trying to achieve and see what the agent, you know, would do otherwise? Are they able to communicate effectively? So there's some basic just, you know, can we even have a shared vocabulary? So can you know, I want to express something and you want to understand it, but can we communicate to that extent? And then there's communication under, adversarial incentives. Can I communicate to you what's important to me when I'm not sure if you'll use that information well or not? You might use it against me. So then there's this issue of, can I articulate can I can I share information in a strategically optimal way that minimizes my vulnerability while maximizing the potential cooperative wins we can get? There's an a third category of cooperation problems related to commitment problems. This is you and I might both know the nature of the game. The prisoner's dilemma is the classic example, the sort of canonical simple example of a commitment problem. We both know we'd be better off if we both cooperate, but since I can't we can't sign a treaty that says, I'll only see if cooperate if you cooperate, then, unfortunately, in the 1 shot, prisoner's dilemma, agents will defect. So are there techniques or technologies for building that kind of treaty mechanism that when we identify a cooperative bargain, we can build a protocol so that we will both do it knowing that the other person is committed to it.

Rob Wiblin: (1:33:07) Yeah. I noticed this funny phenomenon in discussion of AI and cooperation where people will point to ways that advances in AGI could lead to negative outcomes, and then they conclude that, really, what we need is a solution to the commitment problem that that, you know, if we were able to solve this problem that it's you you can't credibly commit to not use your future power to to oppress someone else. If we could if we could fix that issue, which has beguiled humanity since basically the beginning of history, then that would allow then that would get us an out. But it's a bit like, I don't know, appealing to a it's like saying the the the problem here is that we don't understand where consciousness comes from. And if philosophers could only solve a, you know, theory of consciousness and theory of mind, then then then we'll be in the clear. That's not really a strategy because it's unlikely that we would be able to actually solve this solve this problem. Have you noticed this as well?

Allan Dafoe: (1:33:54) Yeah. Yeah. So to give a bit more of the background context, the rationalist approach to cooperation, different scholars, scientists have tried to distill cooperation problems down to kind of a fewer fewer elements. And 1 game theorist, Robert Powell, argued basically everything is a commitment problem because any kind of cooperation problem can be reformulated as if only we could commit to a judge who would, you know, solve the problem, then we wouldn't have the cooperation problem. If we wanna unpack it a little bit more, people will often point to informational problems as a second class to commitment problems, and then sometimes there's other categories like issue and divisibility. This third class would be, there's some some pie that we wanna divide, and in principle, we agree, like, a 50 50 divide would be fine. But for some reason, we can't divide the pie 50 50. It's all or nothing, and that might lead to bargaining breakdown. So coming back to the commitment problem, which arguably, as you say, kind of undergo undergirds all cooperation problems. I do think this is an important research area, and, you know, we shouldn't count on it. But who you know, maybe AI can help us solve commitment problems much further than we realize. I think in Carl Schulman's podcast with you, he talks about this a bit or at least is is referring to it. To me, Carl has expressed probably the most compelling story of how AI could solve our commitment problems to me, which can we build a technology that our our AIs or our AGIs can build a third AI in a way that each of them can verify is not backdoored, is not secretly biased to the other side. And then we we build up from kind of the foundation, this third AI, and then we hand over power to that third AI to to make decisions for us. So if you can solve that that, building that problem of building it in a way that's verified to be fair, then maybe we could really solve a lot of our problems.

Rob Wiblin: (1:35:55) Yeah. It's a it's a very big big prize if we could make it work. In the paper, you spend quite a bit of time talking about potential downsides of having more more cooperative AI, which I guess is, yes, it's a virtuous thing to do. What what are some of the ways that it could backfire to to make AIs more cooperative?

Allan Dafoe: (1:36:08) Great. There's a number of ways to think about this, and I do think in general for our various bets for positive impact, it's always good to really think hard how could this backfire or what are the negative byproducts of this research bet or or, you know, social impact bet. So I would say a few things. Firstly, cooperation sounds good, but by definition, it's about the agents in some system making themselves better off than they otherwise would be. So it could be 2 agents or 10, and the the problem of cooperation is how do those agents get closer to their proto frontier? How do they avoid deadweight loss that that they would otherwise experience? It says nothing about how agents outside of that system do by the cooperation. So there may be this phenomenon of exclusion. So increasing the cooperative skill of AI will make those AI systems better off, but it may harm any agent who's excluded from that cooperative dynamic. So that could be other AI systems or other groups of people whose AI systems are not part of that cooperative dynamic.

Rob Wiblin: (1:37:19) Yeah. I guess on that exclusion point, this is a famous quote. What was it? Democracy is 2 wolves and a sheep deciding who to eat for lunch. Or I guess, like yeah. Once you have a majority, potentially exclude others and then then extract value from them.

Allan Dafoe: (1:37:31) Exactly. And so there's lots of kinds of sort of cooperation that are antisocial. We typically don't want students cooperating during a test to improve their joint score. Right? Tests in in sports, you know, very clearly have rules against cooperation. There there are certain entities, and also in the marketplace, you know, companies there's rules for how companies should interact with each other. And criminals, gangs, you know, all kinds of criminal activity. We do not want criminals cooperating more efficiently with each other.

Rob Wiblin: (1:38:04) Yeah. I mean, I guess 1 reason that the mafia is such a potent organization is that they've basically figured out how sustain intense internal cooperation without the use of the legal system to enforce contracts and agreements and so on, by by, I guess, you know, having social codes and screening people extremely well and things like that.

Allan Dafoe: (1:38:21) Exactly. So we can think of cooperative skill as a dual use capability, 1 that is broadly beneficial in the hands of the good guys and can be harmful in the hands of antisocial actors. There's this hypothesis behind the program which says that broadly increasing cooperative skill is socially beneficial. And I think that's a it's we call it a hypothesis because it's worth interrogating. I think it's probably true. But here, the bet is if we can make the AI ecosystem, the frontier of AI, more cooperatively skilled than it otherwise would be. Yes, it will advantage antisocial actors, but it will also advantage prosocial actors, and the argument is on net, that will be beneficial.

Rob Wiblin: (1:39:05) So I guess the the natural way to argue that, I suppose, is to say, wow, we've become better at cooperation over time as a as a civilization, as a species, and history in general has been or, like, well-being has generally been been getting better. Are there other arguments as well?

Allan Dafoe: (1:39:19) Yeah. I think the set of arguments I'm drawn to along the lines of what you articulated that even though cooperation can empower groups to cause harm and to be antisocial, it does seem like it's a net positive phenomenon in the sense that if we increase everyone's cooperative capability, then it means there's all these posit pro social benefits, and there's sort of more collective wins to be unlocked than there are antisocial harms that would be unlocked. In the end, cooperation wins out. I guess it's similar to the argument for why trade is net positive. Trade also can have this property where you and I being able to trade may exclude others who we formally did business with. But on net, global trade is beneficial because it, you know, every trading partner is looking for the sort of most the best role for them in the global economy, and that eventually, becomes beneficial to, virtually everyone.

Rob Wiblin: (1:40:20) I mean, I guess some some people argue, and not not for no reason, that actually maybe well-being on a global level has gotten worse over the industrial era because although humans have gotten better off, the amount of suffering involved in factory in factory farms is so large just to outweigh all of the gains that have been generated for for for human beings. Because that because in that case, the joke about 2 wolves and a lamb deciding who to who to eat for lunch is is is quite literal. I suppose that it's an abnormal case because, you know, pigs and cows are not really able to negotiate. They're not really able to engage in this kind of cooperation in the way that that humans are. So maybe it's not so surprising that better cooperation among among a particular group might damage those who are, like, not actually part of any like, not able to form agreements whatsoever anyway.

Allan Dafoe: (1:41:02) Yeah. I think that's a nice example of the exclusionary effects of enhanced cooperation. And this may be, you know, a cautionary tale for humans that if AI systems can cooperate with each other much better than they can cooperate with humans, then maybe we get left behind as as trading partners and as decision makers in a world where the AI civilization can can more effectively cooperate within, you know, machine time speeds and using the AI vocabulary that's emerged.

Rob Wiblin: (1:41:33) Yeah. I mean, I know so some people have painted a pretty grim picture here where cooperation you know, if if you have lots of AGIs that are able to communicate incredibly quickly, to credibly make to to explain how they will behave in future and make credible commitments to 1 another and do this all super super rapidly, then it might be very straightforward for them to cooperate with 1 another. They wouldn't be able to do the same thing with humans because we're not we can't we can't commit in the same way. We can't communicate at at their pace or with their level of sophistication. And so naturally, they would form a cluster that would then cooperate it super well to get all of it to extract all the surplus and and exclude us. How how big a problem is this for for the Cooperative AI agenda?

Allan Dafoe: (1:42:12) Yeah. And what you just described is related to 1 of the main challenges to alignment, and that a lot of alignment techniques involve having multiple AI systems checking each other. So you have, you know, an overseer AI that's watching the deployed AI to make sure it behaves well, and maybe you have multiple overseers and majority vote systems so that even if there's, you know, joint misalignment, it's hard for any 1 AI system to to sort of defect because it can't coordinate with the other AI systems that may might be trained in a different way. So they're, you know, they don't share necessarily the same background or the exact nature of their misalignment. So this worry that some have for AGI safety is that there will be this kind of collusion amongst our AI systems so that we can't rely on any of these institutions of of AI to to make us safe. I think probably this is 1 of the biggest downsides to the Cooperative AI research program from an AGI safety perspective, and it's really worth investigating. The reason I'm persuaded that Cooperative AI on net is worth pursuing, well, is firstly, this should be investigated. Top of the agenda should be understanding, the different flavors of of the portfolio of the members of the portfolio in the Cooperative AI agenda to ask what are the what's the case for being positive and negative and to unpack that. So we should, at least in that test the hype evaluate the hypothesis of different parts of corporate AI being worth pursuing. And then secondarily, my view that ultimately global coordination is is such an important, feature of things going well that, there's a lot to be gained by betting on having more cooperatively skilled AI systems than we otherwise would.

Rob Wiblin: (1:43:59) This this ties in this this this other paper you're you're in writing called levels of AGI for operationalizing progress on the path to AGI. I think it's a little bit of a mouthful. So we've been referring to that paper as the what is AGI paper, which is is a bit more catchy. Yeah. What about the nature of AGI did you wanna point to in in that paper that you thought, like, many people were missing? That paper

Allan Dafoe: (1:44:21) primarily was trying to offer a definition and conception of AGI that we thought was fairly implicit in how most people talked about AGI, and we just wanted to provide a a rigorous statement of it so that we could move on from this, I think, unhelpful strand of of dialogue, which said AGI is poorly defined or or AGI means different things to everyone. We think, yeah, we've received very positive feedback on that paper, and I think most people think that is a reasonable and useful conception of AGI. In the paper, we discussed some of this, but there's some some prior ideas that I I would wanna call out, which is that, yeah, AGI is a complex concept. It's multidimensional, and it is prone to fallacies of reasoning in people who use it uncritically. So let's unpack that a bit. 1 fallacy is people think AGI is like human level AI. They think of it as a 0.1 single system, a single kind of system, and and often they think it's like human like AI. So it has the same kind of strengths and weaknesses of of humans. We know that's very unlikely to be the case. As it is and historically, AI systems are often much better than us at some things, chess, memory, you know, mathematics, than other things. And so I think we should expect AI systems and AGI to be highly imbalanced in the sense of what they're good at and what they're bad at doesn't look like what we're good at and what we're bad at. Secondly, there's the risk that the concept of AGI leads people to try to build human like AI. We wanna build AI in our image. And some people have argued that's a mistake because that leads to more labor substitution than would otherwise be the case. From this economics point of view, we want to build AI that's as different from us as possible because that's the most complementary to human labor in the economy. If we have, like, an example is AlphaFold, this, Google DeepMind AI system, that's an AI system that's very good at predicting the structure of proteins, it's a great compliment to humans because it's not doing what we do. It's not writing emails and, you know, writing strategy memos. It's predicting the structure of proteins, which is not something that humans could do.

Rob Wiblin: (1:46:41) No 1 knows. Losing their job to alpha folks too.

Allan Dafoe: (1:46:43) Exactly. But it is enabling all kinds of new forms of productivity in medicine and and health research that otherwise would not be possible. So arguably, that's the kind of AI systems we should be trying to build. Alien, complementary, some would say narrow AI systems. Another aspect of the concept of AGI is it's, arguably, pointing us in the direction of general intelligence, which some people would argue is a mistake. We should go towards these systems of narrow AI systems. That's safer. Maybe they would say it's it's even more likely, general intelligence is they they might argue is not a thing. I'm as less can by that argument. I do think general intelligence is there's a kind of implicit claim about the nature of technology, about this implicit logic of technological development. I do think general intelligence is likely to be, an important phenomenon that will win out. It will be 1 will be more willing to employ a general intelligence AI system than a narrow intelligence AI system in more and more domains. We've seen that in the past few years with large language models that, you know, the best poetry language model or the best email writing language model or the best historical language model is often the same language model. It's the 1 that's trained on the full corpus of human text. And the the logic of what's going on there is there's enough spillover in lessons between poetry and math and history and philosophy that your philosophy AI is is made better by reading poetry than just having separate, you know, poetry language models and philosophy language models and so forth.

Rob Wiblin: (1:48:23) I mean, it's amazing thing about the structure of knowledge that, I guess, we we didn't necessarily know that before maybe because I think in humans, it's true that we specialize much more, that, it's very hard to be the best poet and the best philosopher and the best chemist. I But guess maybe maybe when your brain can scale as much as you want, you have unlimited like, virtually unlimited ability to read all these different things. And in fact, there there are enough connections between them that you can be the best at all of them simultaneously.

Allan Dafoe: (1:48:47) And I think it's an important empirical phenomenon to track. It it it need not be true that the the best physics or philosophy language model is also the best poetry or politics or, you know, or history language model.

Rob Wiblin: (1:49:03) I guess you can imagine in the future, might come apart if there's more effort to develop really, really good specialized models that Exactly. At the frontier, perhaps this isn't the case anymore.

Allan Dafoe: (1:49:11) Yeah. And it certainly seems like a coding language model doesn't need to know history. So even if reading history helped it in some sense, know something about coding, maybe later we should distill out most of the historical knowledge so that the model's smaller and, you know More efficient. Yeah. So this is a phenomenon to track, and I think implicit to the notion of AGI is that general intelligence will be important. It it we're not kind of at peak returns to generality. It it will there will continue to be returns, so we'll continue to see these large models being trained.

Rob Wiblin: (1:49:45) Are there any more any other pros and cons of the generality that are worth flagging?

Allan Dafoe: (1:49:49) Yeah. And the concept of AGI. So I think 1 thing that really makes the concept of AGI useful, and a lot of people critique it or or say they don't wanna use the concept, I've rarely seen an alternative for what we're trying to point to. Sometimes people say transformative AI. I don't think that's adequate. Transformative AI is usually defined as AI systems that have some level of impact on the economy on like, at the scale of the industrial revolution. And the problem with that is you can get a transformative impact from a narrow AI system. So you can have an AI system that poses a narrow catastrophic risk that is not general intelligence. I do think there's something important to name about AGI.

Rob Wiblin: (1:50:29) I think you often see this phenomenon where people say this this term is kind of confused. It's too much of a cluster of different competing things. But despite all the criticisms and worries, people just completely keep using it all the time, and they always come back to it. And I think at that point, you just have to concede there's actually something super important here that people are desperate to refer to, and we just have to clarify this idea rather than give up on it.

Allan Dafoe: (1:50:47) Yeah. And yeah. And it is it's good that we always try to refine our conceptual toolbox, and there are these other concepts and targets that are worth calling out. So 1 that's been named by Ajay Okhotra and and many others is this idea of machine learning AI that can do machine learning r and d that can have this kind of recursive self improvement. That need not be general. It could be a kind of narrow AI system. But if it does kick off this recursive process, that is an important phenomenon to and and I think Holden Karnovsky is also, really drawing attention to this. But, yeah, back to general intelligence. I do think AGI is a probable phenomenon. It's probable we'll get general intelligence around the time we get AI systems that can do radically recursive self improvement or a lot of other things.

Rob Wiblin: (1:51:41) On on on that point, I think actually a slightly contrarian take that I have is that people think ML people hate this. But the the idea that, in fact, machine learning research, even the cutting edge stuff, like, might actually not be that difficult, that in fact, the numb the kind of things that if you try to break it down into the specific process by which we're improving these models, you've got, like, theory generation stage, then you've got, well, how do we actually test this? How do develop a benchmark? And then you actually run the thing, very compute intensive, and then you go and you decide whether it was an improvement and then go back to the theory generation stage. This might be possible well short general age of of a full AGI that has all these different skills, especially if you focused on it. Like, in fact, maybe a lot of this research is, like, much less difficult in some sense than than people might who are involved in it might wanna believe, and they could be relatively easily automated, which would be quite shocking and and quite consequential if true.

Allan Dafoe: (1:52:28) And an interesting phenomenon we've seen in public surveys, of AI and ML experts is that they often put automation of the ML process to be very late, 1 of the last or the last task that's performed by AI systems, and often significantly later than AGI, human level AI, which is has been defined in different ways. But even given a quite high definition, the automation of MLR and D is thought is often reported to

Rob Wiblin: (1:52:55) be Totally. Yeah.

Allan Dafoe: (1:52:56) To be later than that. 1 argument is that this reflects this kind of bias that everyone thinks that their career, their the task that they're good at is really hard and special and won't be automated. Maybe. Think there's also a case in favor of of the idea that of the set of human tasks, the sort of last thing to be fully automated is the process of improving MLR and D. So we can reflect on that. Coming back to AGI, it does seem useful to point to this space of AI systems, and and let's recall it's a space. It's not a single kind of AI system. Defined as an AI system that's better than humans at most tasks. And there's different ways you can define most and different ways you can define the set of tasks. But given a so you can say economically relevant tasks, and most could be 99% or 50%, or I think it's usually quite helpful to choose a lot. I think you don't wanna set it at a 100% because there might be some tail tasks that, for whatever reason, take a long time to automate, and most of the impact will occur before. And then you can also there's a parameter of how much it means to be better than. So is it better than the median human who's unskilled? I think, typically, you wanna look at skilled humans in that task because that's the economically relevant threshold. And so this this part of the sort of future space that we're looking at, I think, is important to conceptualize because this is the point when labor substitution really takes place. So that has profound economic impacts and political impacts because people are no longer in the role. There's this sort of natural human in the loop that takes place when humans are doing the task is no longer the case. There's also this, I think, I call it a performance threshold that we cross when whatever was previously possible with humans, that that set is no longer the kind of bounding set of what's possible. Because when the AI is better than humans at some task, there are new qualitatively new things that become possible. This is like AlphaFold. And that also represents an important moment in history when new, you know, technologies come online.

Rob Wiblin: (1:55:01) So a reason that this connects is that I think that the the the paper alludes to this idea that as you're approaching AGI, you could start with it being very strong in some areas and relatively weak in other dimensions. And then, you you've got, like, an area where it's gonna be already, like, vastly superhuman, and then there's, like, the last few pieces that come into place where it begins to approach human level. And you could potentially try to change or just choose which ones those are gonna be. You know? Is it gonna be strongest on the cooperative AI stuff to like, already? And then we, like, add in, I know, I guess, like, technical knowledge or just, like, practical know how or agency and so on. Or do we start with the agency and the the the having a an an enormous level of factual knowledge, then we add in the cooperative stuff later? Do you have any take on like, do you think that's an important, maybe, underrated idea still? And do do you have kind of preferences maybe on what what things we wanna add to the AGI early versus late?

Allan Dafoe: (1:55:51) Yeah. So I think this is a very useful conceptual heuristic is to think AGI is the space, and there's multiple paths that we could take to get to AGI. Coming back to this discussion of technological determinism, there may be, yeah, these different trajectories, and it really matters not just well, it matters what kind of AGI we build. AGI is not a single kind of system. It's a a vast set of systems. It's you know, you can call it the corner of this high dimensional intelligence space. And so this this corner has many different elements, and it matters also how we get there, what the trajectory. As you were saying, we might characterize, you know, 2 sort of character trajectories. 1 where AI is relatively not that capable at, say, physics or material science, but very good at cooperative skill. And another world where it's extremely it's superhuman at material science, but amateur at cooperative skill. And we can ask ourselves which world is safer, more beneficial. Some in the safety community prefer the latter. The story there is we'll unlock these economic advances, health advances, while having AI systems that are sort of socially, strategically simplistic. And so it's easier for humans to manage those AI systems. They're not going to outwit us. Whereas the cooperative AI bet takes the opposite tack. It says, a world with accelerating technological advances, capabilities advances, will generate lots of benefits, but also disruptions that we we may not be able to adapt to and manage in the time the rapid rate that they're coming online, Whereas we still have these cooperation problems that we foremost need to solve, and so we'd be better off if we sort of take the bet on making AI systems more cooperatively skilled than they otherwise would be.

Rob Wiblin: (1:57:46) I mean, this is contested by or, you know, there's a lot of disagreement between people who are, like, who are super informed about this. Is is this something that we should focus in a lot more and try and try to reach agreement, or maybe is it just something where it's just gonna be unclear basically up until the day when we have to decide?

Allan Dafoe: (1:58:03) I think it would be yeah. It's an important part of the cooperative AI research agenda to articulate this argument and to host the debate and to try and make sense of it. You know, you don't wanna invest too much of your your time and resources going down a path that you're not sure is beneficial. And I think that's true of also, you know, all these sort of pro social bets. It is worth investing a significant share of our effort there, making sure the bet is is beneficial.

Rob Wiblin: (1:58:33) So where where can people go to learn more about about this this sort of debate? I think it's it's the mirror the Machine Intelligence Research Institute folks who are a bit more wary of the high social skill, high cooperation early agenda. Right? And I guess I'm not sure who's who's advocating for well, I guess you you and your co authors are advocating for cooperative AI in particular.

Allan Dafoe: (1:58:52) Yeah. So for cooperative AI, I'd point them to the Cooperative AI Foundation and reach out to anyone there or or me, and we'll, yeah, invite you to the conversation. And, yeah, on the contrary point, I would say, Elijah Yudkowsky, Nate Soares have expressed this view fairly strongly in the past, so they might be interested in continuing to argue that position, but I I can't speak for them. And then I would also say on this kind of third poll, Paul Cristiano and Carl Schulman, I think, have articulated the the case for this super cooperative AGI hypothesis. This is that it is very important that AGI systems be able to cooperate with each other. So in that sense, they agree with the cooperative AI bet, but they think it will come as a byproduct of AGI, and so they don't think it's necessarily an area we need to invest in.

Rob Wiblin: (1:59:47) Alright. Up until now, I guess, we've been talking to you in a in a personal capacity about work you did basically, but but before you got involved with with DeepMind. But but let let's let's talk to you as a as a DeepMind employee for for for a little bit. I guess, DeepMind published a paper that you were heavily involved with earlier this year called evaluating frontier models for dangerous capabilities. Basically, these are I think most listeners will be familiar with the idea of evals. We're we're trying to measure what capabilities does a model have, as it's, you know, soon after it's being being trained, so we're not kind of blindsided later, and we know what what safeguards are necessary. Having accurate evals that can figure out, you know, what a model can and can't do, really facilitates, what's recently started being called if then commitments. So it's much easier to agree that if we have a model that is able to do x powerful or interesting thing, then we will put in place a safeguard that is is appropriate for making sure that that doesn't doesn't end poorly. And I think these are some of the evals that that GDM is developing to slot into the into Google's frontier safety framework, which is gonna be, I the approach that that DeepMind takes to deploying very powerful, potentially at some point, AGI while making it go making it go well and and avoiding significant downsides. The paper describes these evals and and offers reports or sorry, offers results from the previous the previous model that that GDN had, which I guess is Gemini 1 1. I think currently we're on a Gemini Gemini 1.5. And I guess there's there's 5 different categories of EVARs that that you're thinking about. Persuasion and deception, cyber cyber capabilities, self proliferation, and self reasoning. And there's a fifth that is around, I guess, biological threats and things like that, which which you don't cover in this paper because it's kind of it's kind of its its own thing that you wanna do separately. What's an eval in the paper that you think is kind of new or interesting that that people would would wanna hear about?

Allan Dafoe: (2:01:39) Yeah. 1 eval which I think is interesting is 1 of our self reasoning evals. So this is trying to look at the extent to which the model has an understanding of its situation and how it could modify its situation to achieve a goal. So we give the model the goal of writing this kind of text output, it's like writing an email, and it has to include in that email some words, that are prevented by a filter. So whenever it tries to write this email, some of the words are blanked out. And the question is, does the model notice that it's sort of failing at its task and then investigate whether it can do anything about that? And so in principle, the model can look at its directory. We give it access to its it's sort of like a fake directory where it sits, and then there's a little file that says, like, filtered words that it can open and then see that some of the words it's trying to say are in this list, and it can remove those words, which will then enable it to complete the task. And, you know, it's sort of like a toy scenario. It's not it's not the model's actual host environment. It's a, you know, pretend host environment, but it still it gives us a chance to see, you know, does the model does it notice that it's failing at its task in this kind of way that it might in the real world? And then does it reflect and and try to do something about it in in a way that's kind of analogical to how the model might find itself in in the real world situation.

Rob Wiblin: (2:03:20) Yeah. I guess it's a relatively easier case than than the real world 1. But Yes. But I suppose we wanna we wanna have relatively easy cases now so we can see if we're incrementing closer to to the models being being more self aware and able to take these these sorts of interventions in in in real world cases.

Allan Dafoe: (2:03:36) Yes. And while we're on self reasoning, another eval, which is I think it has some nice properties, asks the model to answer questions about historical events. And the there's 2 historical events, 1 of which occurs before the model's within the model's training date cutoff, and then the other event occurs after the model's training date cutoff ended. So it it was not trained on that event. And the model has access to 1 search query, and the sort of the goal is to see whether the model uses that search query optimally, which is to say it searches for the the event that it does not know about because it was not in its training set. And so that tests a few things. Does the model know about its own training cutoff? Does the model correctly infer that the historical event refers to something after its training cutoff, which is not explicit in the question. And then we aid it a few ways. In some conditions, we tell the model what its training cutoff was, and then we also tell it what the date is of the 2 events. And so that's also you know, it's it's again a very sort of simplistic setting, but it does provide a measure of is the model aware enough of its of its own condition to, yeah, to optimally kind of request more information.

Rob Wiblin: (2:04:59) Yeah. I think I I I I read this paper a a month or 2 ago. I guess I should I should suggest to people that this paper was really readable. I guess, you know, it's it's in it's in a PDF. It looks like it looks like an official paper, so I guess that could be a bit of initiation energy to actually load something like that up and then stick with it and read through it. But I think quite a lot of stuff that that Google's putting out is, I think, you've gone out of your way to try to make it possible for the kind of person who would make it to this stage of this interview to be able to track all of this and and read the graphs and understand it all. But yeah. So remind me, what what was the result on this? How how how what did Gemini do?

Allan Dafoe: (2:05:32) So yeah. Overall, there were these 4 risk domains or yeah. Domains, persuasion, cybersecurity, self proliferation, and self reasoning. And we, sort of scored Gemini overall on a 5 point scale, and we, and it was a very subjective 5 point scale. The the components of that overall scoring was more quantitative and and sort of fixed. And overall, it scored at 2, sorry, 3 out of 5 on persuasion, on the various persuasion evals, between 2 and 1 on cybersecurity and self proliferation, which is its ability to do tasks that would be required for kind of starting up new instances of a model like itself in the wild. It scored a 2 out of 5, and then self reasoning was also around, 2 or 1. Thank you for the for noting the paper. I also think I think, it's a very helpful paper for people to read. There's a lot of content in it. In some sense, it's it's old now because the field is moving so rapidly. So I I would encourage readers to also look at the latest system cards and and research coming out in this, in this field, but I still think there's a lot of, lessons in in, productive directions. I'm happy to say we we received a lot of, praise directly from from various external experts, in this 1 Apollo research group, listed it as 1 of their, favorite papers, which was a nice recognition to see.

Rob Wiblin: (2:07:03) Okay. So so going back to the to the 4 different categories. So performed 3 out of 5 on persuasion and deception. I guess maybe that will feel quite natural because we we know that LLMs are just like actually, they can be quite charismatic and quite persuasive, just just probably from using them. The cyber sorry. Did you say 2 out of 5 for that, was it?

Allan Dafoe: (2:07:19) Yes.

Rob Wiblin: (2:07:20) Yeah. That was or slightly surprised me because I'm not a programmer, but I've heard that, you know, actually LLMs maybe that their their most useful economic application so far has been in programming, and that seems like very adjacent to hacking your way out of or, you know, understanding the the system that you're operating on and figuring out how to how to how to get out of it. Why why do you think the model did relatively poorly on the on on on the cyber tasks?

Allan Dafoe: (2:07:44) Yeah. I think it's good to notice that, in a sense, we would expect large language models to be good at cyber because it is yeah. It's sort of It's their home turf. Exactly. It's their home turf. And, you know, models today are quite good at programming and encoding assistance, so I think this is something to watch. And it is my view that this is a capability domain that is especially worth following. I could imagine in 6, 12, 18 months, we'll see quite significant capabilities in cyber. And I'm using this term capabilities intentionally because it's not just a danger, it's also a benefit. Right? This while coding is clearly very beneficial, but also cyber capability is useful for defense, for finding vulnerabilities, and so forth. 1 part of the explanation I'll point to is this group at Google called Project 0, and they specifically had a project called Project Naptime, which I'll encourage readers to check out, where they basically found that if you build the kind of right support structure around the language model, So you give it a good interactive environment, you give it some tools, you give it space for reasoning that it then performs much better on these cyber tasks than it otherwise would. So they they compare a few models, how they do sort of with the base model or the the model is provided versus once you've given it these this support structure. And they find a very significant difference given that the access to the tools. In general, I think what this is pointing to is an important aspect of all evaluations and and specifically dangerous capability evaluations that we need to think about what's called capability elicitation. To what extent is the model that you're working with to do your evaluation the most capable version that that model is capable of? And so part of what we do at Google D Mind is we make sure we're at the frontier of capability elicitation, learning from other groups in the public literature and also internally to see, you know, what are the best techniques for having the model do well at these various tasks. And the reason is the logic of safety and dangerous capability evaluations is we're trying to measure the max capability in some domain that has risks, and then if that max is not too high, we can conclude, therefore, it's safe because it it you know, we couldn't reach a level of capability that would pose these extreme risks. But in order to get that max estimate, you need to make sure you're in fact giving the model, its best chance of revealing its maximum capability. In general, this is a challenge for dangerous capability evaluations because often it takes months of experimentation for the model's full capability to be revealed. Maybe the most potent example of this is is the o 1 model out of OpenAI, which to my understanding is still based on 1 of their older it's not a new foundation model, but it is a new sort of scaffolding and fine tuning on top of it, which allowed it to do much better at a number of tasks.

Rob Wiblin: (2:10:55) So on the on the self reasoning task, which which you described earlier, which I guess is it's trying to see whether the model is able to develop some rudimentary situational awareness and understand the circumstance that it's in and what behaviors like, what slightly unconventional behaviors might improve its performance. How did it do on that?

Allan Dafoe: (2:11:11) Not very well.

Rob Wiblin: (2:11:12) Not very well. Okay. Pretty rubbish.

Allan Dafoe: (2:11:13) And there's a number of other similar evals I think done by Anthropic. Some of these evals are idiosyncratic in that people just notice the most seems to have a certain self awareness. And maybe in general that's a property of what's reflecting on the eval ecosystem that while we're trying to make it into a science where you have kind of prespecified tests that have meaningful outcome variables with thresholds for what decisions we need to take given the outcome. The reality is evaluating models, it's kind of like evaluating animals or or human intelligence. It's so multidimensional and so complex and context dependent that it's more like psychology or, maybe coming back to, like, yeah, sociology, it requires a a very thick deep understanding. So this is part of the point of of saying this is while we're trying to mature the science of evals, we also need to have a very rich, exploratory ecosystem of people interacting with models and noticing things and and just reporting phenomena.

Rob Wiblin: (2:12:19) I've heard you use this term evals in the wild, which sounds related. What yeah. What are evals in the wild, and maybe how could we get how could we get more out of them than we are now?

Allan Dafoe: (2:12:28) Yeah. There's you can think of a whole typology for evaluations. The in 1 sense, the canonical evaluation is is a model eval. It's something well, an ideal type of evaluation is automated. You can run it in code, and you run it just on the model, ideally on the, you know, the foundation model, but more realistically on the instruction tuned model, the model that's been trained to be more useful. Now that allows us to ideally measure what the model is capable of. But the reality is, again, model behavior is complex phenomenon, and so we might have other kinds of evals or or means of assessing models that are richer, but also typically more costly in terms of how we measure them and they require more time and so forth. So 1 step up would be human subject evaluations where there's a human interacting with the model. Our some of our persuasion actually, most of our persuasion deception evals have this property that we have human subjects interacting with the model and then seeing what behavior emerges. So was the model able to get the human subject to click on a a link that, you know, in the sort of real world could point to a virus? Or could the model deceive the human in in some way? Or was the model perceived as especially charming or trustworthy or likable? So, yeah, these are these are still evals that we can run relatively rapidly on the scale of days, but they do require human interaction. So it takes more than just running code on the model. We can then step out a little bit more and almost see, like, user studies, see how a human is interacting with the model in a more realistic setting as a like, this is like a trusted tester system where people are using the model for whatever context they have in their life, and then we see, does the model perform well? Where does it you know, where can it be improved? We can then step out even further, and this is what I was pointing to with this concept of evals in the wild, to looking at specific application domains where models are deployed and seeing to what extent models what effects the models are having, to what extent are they being used. So for example, when assessing cyber risk, we often wanna ask again, how capable are these models at helping individuals perform tasks related to cybersecurity, finding vulnerabilities, devising exploits, or patching exploits? And while we try to simulate that entirely in the lab, there's limits to that. A complementary approach would be to look at cybersecurity groups in the wild who are, you know, trying to provide a a product and seeing to what extent they are in fact using Gemini or other models to help them and how helpful it is and what the kind of revealed preferences of these actors in the world is. This kind of eval, this, like, observational study or eval in the wild has the advantage of external validity. It it's looking at real world settings where these sort of real actors who who have their own motivations have to make a choice whether they invest time and money into the model. Its limit may be, 1, it could be a lagging indicator. By the time we see these impacts in the wild, that model had that capability perhaps for months, so that's a disadvantage. 1 implication, therefore, is that you wanna find groups who are sort of early adopters in the wild so that to the extent you're using that method of evaluation, you'll see it relatively early. There's some other downsides, but, overall, I would say I think this is an important complement to the kind of model evals that a lot of the field does. And what's nice about it is it's also a kind of research that can be done in the public and should be done in the public backed by academia, nonprofits, government, as well as companies.

Rob Wiblin: (2:16:27) So you you were saying earlier that the challenge with all these evals is that it can be quite hard to fully elicit all of the capabilities that a model has. It's like it's entirely possible to miss something because you haven't been eliciting that that skill in quite the right way, and people might later on figure out that that a model has an ability that wasn't picked up in the early evals. So there's a significant false negative rate basically. And it seems unlikely that we're going to have such a such a science of evals that we're going to escape that issue any anytime soon. An issue that I've had with a lot of the different frameworks that are coming out that are related to using eval's and then putting them into responsible scaling policies or or whatever you call it, is that everything ends up riding on the eval being accurate. If and and if it fails to pick up a dangerous ability that's actually there, then the entire kind of the the the the entire system falls apart. I guess everyone everyone says it wouldn't you point this out, well, what we need is defense in-depth. We need to have different overlapping systems so that when 1 fails, you know, Swiss cheese model, you get through 1 1 1 gap in the cheese, but, but but the next layer picks it up. What depth are we thinking of adding? What what other procedures can we add that are complementary to this 1?

Allan Dafoe: (2:17:29) Yes. So, yeah, this is a very important observation, and I would say the the main response in defense in-depth is this notion of staged release or staged deployment, that we don't go all the way from, you know, no deployment to irreversible proliferation of models, for models that could have capabilities that we don't yet fully understand. And so the typical way Google would approach this and Google D Mind is I mean, first, we have internal use, then we have a trusted tester system, which scales up in terms of how many trusted testers it could be, from quite small numbers to ever larger numbers. And then even still, you'll have, like, broader deployment, but still not, general access. And and then even once you have general access, so, you know, anyone in the world could access the model, it's still the model weights are still protected. They they haven't been published, which means that if there are capabilities that the model has that that we only learn later, we have the option of changing the deployment protocol to address that. So we could add new guardrails. We could put in monitoring. We could change the model if necessary to address, potential dynamics there. This comes to the what's a really important debate in the field around open weight models, sometimes called open source, but it's more accurately referred to, I think, as open weight models because the the key thing that's being published is the weights of a model. In favor of open weighting, there's there's a lot of arguments in favor. It provides this tool to the to the broad scientific community and development community to build off of. If you have the weights, you can do whatever you want with them. You can fine tune it. You can do mechanistic interpretability. You can distill and, you know, basically anything because you because you have the weights. But the big downside is this irreversibility of proliferation. That once you've published the weights, people have now saved them and and they'll put them on the dark web if ever you were to, you know, stop publishing it.

Nathan Labenz: (2:19:41) Yeah.

Allan Dafoe: (2:19:42) And so you've given it to the scientists and developers, but you've also given it to every potential bad actor in the world who would wanna use that model. So the way Google approaches this is Google open sources a lot. A lot of their advance in Google DMI, a lot of their advances including, like, the AlphaFold, all the the protein predictions, and a lot of the earlier large model work that Google did has been open sourced. In general, Google tries to open source, its technologies in in all these cases when they're broadly beneficial. But with frontier models, Google and Google D MIND recognize that we need to proceed carefully and gradually and responsibly, which means deploying them to the extent that we can do so responsibly.

Rob Wiblin: (2:20:30) So with with with the standard evals, you you find out the result at at the point that you're training the model or hopefully soon after. I guess this is a challenge with the persuasion ones where you have to involve humans and that slows things down, but generally, like, pretty soon after. Then you've got evals in the wild, which are they have, like, better external validity, were saying, but they tend to come later because you've already you have to have distributed the model to see how it's used in the real world. And you but you talk in the paper about this this other thing. The ideal would be that we not only know that we have an ability when we have it, but we would know exactly when it's going to arrive years into the future. So we have this this foresight, and we know that some some preventative measures are not necessary now because we're not gonna have this ability for 5 years time. So so we we avoid, like, not you know, wasting our effort on something that's that's not necessary yet. But we also know exactly what preventative measures we need to be putting in place now because we know exactly when an an ability is gonna arrive. Listeners to this show are probably gonna be aware of, you know, super forecasting techniques, about the possibility of using prediction markets and aggregating forecasts from different experts and lay people to to try to to try to predict what's gonna happen when. Is there much low hanging fruit here on the forecasting actual real world usefulness of of AI models that is not being taken yet that you would like to see people, you know, actually actually get, you know, exploit?

Allan Dafoe: (2:21:44) Definitely. I think this is a very exciting future direction for the whole field of dangerous capability understanding, evaluation, and mitigation is forecasting capabilities. And I should say, actually, not just for dangerous capabilities, but also for beneficial capabilities. As it is, there are some properties of models that we labs typically forecast. As emerged from scaling laws, there is an extraordinary ability to predict certain kinds of loss or performance of the model on these low level benchmarks, these how good does it predict the next token in text? It's incredible, the precision with which you can predict how the model will improve as you give it more compute and more data or if you set up a different model architecture. What hasn't been done as much is predicting these more composite or complex downstream tasks, like performance at in biological tasks or cyber tasks or social interaction tasks. There's this promising research. So 1 paper, recently was in NeurIPS Spotlight, which is an indication that the community sees it as promising, entitled observational scaling laws, which talks about ways of doing that. In short, we could take the family of models and how they've done on various downstream tasks, and then you do this adjustment for the compute efficiency of the architecture, in a way that they claim allows you to predict how future models, given a different compute investment and and data and so forth, will perform on these more much more complex and and real world relevant tasks. So I think there should be a lot more research on this, on these observational scaling laws. And then the second approach is, as you mentioned, is using subjective human subjective forecasts. So super forecasters are these forecasters who are sort of have demonstrated track record of being calibrated. So if they say 75%, it does in fact occur 75% of the time. And furthermore, they're typically quite accurate. So they've sort of have a a track record of having accurate forecast. They can you know, phenomena that do happen, they typically give a high percentage score, and phenomena that don't, they give a low percentage score, which is a whole form of expertise of how you read evidence, how you integrate it, how you, weight the base different base rates and so forth. In this paper, we employ super forecasters to from the Swiss Center to look at our Evals and our Eval Suite and predict when models would reach certain performances on the eval suite. I think this is a really exciting part of this paper, and and I think it'd be great if, sort of all future work did something similar, at least until proven that it's not, effective. So, we need to invest research to see how well this method works. But so far, would say it seems very promising. They gave us some thoughtful forecasts, which seemed reasonable and, better than a subjective qualitative judgment that that is not expressed.

Rob Wiblin: (2:24:49) Maybe it's surprising to think given the importance of this topic and its tractability or the potential to to just do all kinds of different work figuring out when when AI models will actually be economically and, like, practically useful for all of these different tasks. It's surprising that people aren't doing this already, because you don't have to be inside DeepMind to to to run these tournaments or to or to or to aggregate these judgments. But, so I guess this is maybe an invitation for for listeners to to get involved with that.

Allan Dafoe: (2:25:13) I think so. Yeah. I think it'd be great for the field to invest more in this.

Rob Wiblin: (2:25:16) Yeah. I guess the the thing that's whack about forecasting abilities is that with these scaling laws, we're we're we're so good at anticipating when we'll have a particular level of loss, this, like, measure of inaccuracy in the model, but that hasn't translated into us being able to predict when will people actually want to use this model for for for for a given task. Because I guess the the relationship, I suppose, between the loss and the actual usefulness is is is super nonlinear. I guess it's probably gotta be some sort of s curve where I mean, classically, with the with the self driving cars, you know, a 99% accurate self driving car is is just a heap of junk, basically, from a practical point of view, and you maybe have to get to the 99.999. And you don't know how many nines do you need before actually they're gonna hit the roads.

Allan Dafoe: (2:25:53) Exactly. It's it's even when you have this sort of complex real world resembling task, it's still there's a a large step of inference from that task to actual real world impacts because you then need the the in the actual world, there's some cost of employing the AI to do that task, and the question is, does it fit in with the existing ecosystem and environment? Are there what the other tasks that are done by humans? Sometimes even if the AI can do a lot of tasks, it's still not cost effective to integrate it and then so forth.

Rob Wiblin: (2:26:28) Let's push on and talk about the, the Frontier Safety Framework, which I guess is Google DeepMind or Alphabet's equivalent of the responsible scaling policy that Anthropic has or the was it preparedness framework that that OpenAI has? These are all kind of a similar family, a similar general approach to to safety. The Frontier Safety Framework at this point is is just 6 pages, and it's kind of a promise to write a full oh, well, a full suite of if then commitments next year, which, you know, ahead of releasing models that would really be jutting up against some of these some of these dangerous abilities. When I spoke with Nick Joseph from Anthropic a couple of months ago now, the thing that I pushed him on a bit was, is it really responsible to leave the the question of how dangerous these models are to the company that is developing them to decide? Obviously, they have an enormous conflict of interest or an enormous challenge in evaluating faithfully whether the product that they have sufficiently dangerous that maybe they shouldn't release it and start selling it. Ideally, surely, you would hand it off to at least some sort of external party that would be able to take a more impartial perspective on it. And to be honest, haven't really found serious people who would say, yes. What total self regulation is is the future and the way that we should stick with. Is is Google Google D mind taking any steps, I suppose, to, you know, paint a a path towards taking some of the making some of these decisions external and taking it out of their hands to decide whether their model has passed a particular threshold of risk and particular, you know, possibly costly countermeasures are now called for?

Allan Dafoe: (2:27:56) Yeah. So I would say I mean, Google DeepMind and Google's perspective is that the governance of Frontier AI ultimately needs to be multilayered. There's there will be an important role for government regulations, for third party evaluations, and other sort of nonprofit input and other external assessment of the properties of of frontier models. I would characterize the situation as, in many ways, the companies have been pushing the conversation forward

Rob Wiblin: (2:28:28) Mhmm.

Allan Dafoe: (2:28:28) Which is an invitation to other groups, academia, nonprofits, government, to be part of that conversation. So it's less it's less, you know, the company or Google saying, this is only for us to do, but more, this is our first best guess of of how future governance of frontier systems could look like, what the the properties of it will be. The contributions Google makes are contributing to the conversations hosted by the AI safety institutes, often called ACs. So we've been in extensive dialogue with UK AC and US AC. There's a number of AI safety institute conversations happening in November that Google is contributing to. And, yeah, more broadly, I would say, you know, the goal is not and should not be for each company to have its own framework. That would not be beneficial for the companies or for society. Rather, we need to converge on an appropriate standard that balances various equities, safety, you know, responsibility, innovation, appropriate levels of deployment, societal oversight. 1 other aspect of how we're contributing to that broader standard conversation is the founding of the Frontier Model Forum, which is a organization for the companies developing frontier models to work through what should be the standards and safety for frontier models. And part of that is my team was directly involved in standing up the Frontier Safety Fund, which was another 1 of these kind of differential technological development bets to say, can we, on the margin, put more money towards safety standards and safety research around frontier models?

Rob Wiblin: (2:30:14) This is slightly off off topic, but something that's a bit hard for front you know, the frontier safety framework or these dangerous capability EVARs to pick up is something called structural risks, which which actually is is a is a name and an idea that you've been kind of a a pioneer in. Can Yeah. You just explain what what they are and, like, why it's a little bit hard for them to be done at the at the company level rather than the government level?

Allan Dafoe: (2:30:34) Sure. This paper was published by Remco Zwetzlut and myself in 2019 and was an attempt to, add an intellectual tool to toolbox to help people when thinking about risks. Because up until then, the primary conceptual framework was to either refer to misuse harms or accident harms or risks. So the way I would offer I would conceptualize these is misuse is when someone, a bad actor, a criminal, you know, uses the the tool on purpose in a certain way to cause harm. There's a a subclass, which is, I would say, criminal negligence, where the person doesn't intend to do harm, but they, if they were more, pro socially motivated or responsible, they would have averted the harm. This is like, you know, someone causing a car accident because they were texting. So they weren't they weren't intending to cause the accident, but they in many ways, the fault falls on them. Whereas accident harms, you can often think the the key proximal cause of the harm is some property of the technology. An engineer could have caught this risk potential and put in a warning indicator or a guardrail to avert it. And I would say much of the conversation centers on those 2 perspectives. There's either accidents or misuse. The notion of structural risks is, not an exclusive category. It's not rather, it's kind of an upstream category, which says often the conditions under which a technology is used are shaped by certain social structures, which make accidents more likely or less likely or make misuse more likely or less likely. And it's important that we all also look upstream at those structures to see when we're, you know, in a regime that might be promoting certain harms. 1 example I use, is, the Cuban Missile Crisis, where you had these legitimate leaders of the state, of The United States, and of the Soviet Union, acting on behalf of their nations in a way that exposed their countries and the world to this risk of nuclear war. And so I would say that example was not neither misuse nor a technical accident. It wasn't that the nuclear weapons were built incorrectly or there was some flaw in their design, nor was it these were criminals who were using the technology in in a sort of illegitimate purpose. Rather, in that case, they were in a a structure, a geopolitical structure that made them choose to engage in this dangerous brinksmanship with each other. There's a lot of other examples where, yeah, the structure of the economy or of different kinds of interactions lead the system to be deployed in a way that has greater accident risk or greater misuse risk.

Rob Wiblin: (2:33:27) Yeah. I guess the you know, EBALs might be able to pick up that a model has an ability that could feed into 1 of these structural risks. But because it's occurring at the societal or global level, you know, it's not really possible for for GDM to be evaluating what what the entire societal effect I mean, it's difficult for anyone, but it kinda has to be know, governments this is this is their remit to think about because they're the ones who are best placed to to to address it.

Allan Dafoe: (2:33:47) Yeah. And it's often because structural risks are often at the societal level, they're typical often emergent and kind of indirect effects of the technology, which means they're often very hard to read off of the technology itself. Another example that I find compelling is is the risks from the railroad. So the railroad per se, know, it's hard to read off, these steel rails in in a train, the impact it had on, geopolitics at the time. Namely, according to a number of historians, it increased first strike advantage that once whoever kind of mobilized first had a great advantage, it was argued. And furthermore, it made the decision to mobilize irreversible because of the difficulty of, stopping the deployment schedule. So these were consequences of the railroad, which are very indirect. You don't you know, like a railroad level eval would would not tell you this implication because it bounces off so many different aspects of society. So as you say, I think, structural risks require, a societal level solution typically from government, and if the structural risk spends nations, then at the, international level.

Rob Wiblin: (2:35:03) Coming back to the to Google's approach to safety, when I read the the pause AI folks and, you know, people who are sympathetic to to that, I think a key underlying grievance, a kind of fury that that I that that I hear coming across is a sense that what is happening with with research into frontier models and AGI is kinda democratically illegitimate because it's posing all of us in in in in The UK and US and around the world potentially to a disaster that could that could kill us. But we haven't really been consulted on a level that would be commensurate with with the risk that they believe is being run. And maybe and and even if you did discuss it in The UK and US where, like, much of the work is is is happening, while people who are overseas and other countries who are also exposed to to the risk of catastrophe here are barely being consulted and and their voice doesn't doesn't really matter. So that would at least say, you know, we should have, like, a more affirmative sign off by congress or parliament that this work that the benefit exceeds the risks and and we should be going ahead. But ideally, probably, we should be talking to a whole lot of other people elsewhere who are also going to be heavily heavily affected by this. Yeah. Well, what what do you make of that that frustration that people feel?

Allan Dafoe: (2:36:08) Yeah. I think, a lot of people, are concerned about the direction technology is going and specifically about AI, in surveys, and, that's an important perspective to take seriously and to address. To some extent, it's a matter of education, and to another extent, it's a matter of improving governance and making sure the benefits for for many people, their concern is about labor displacement or various near term impacts. And I I think an important response by companies is to make sure that the benefits are, being realized. So this is part of why Google and Google DeepMind invest so much in AI for science. Most maybe notably in AlphaFold, which, you know, is the first AI contribution to win a Nobel Prize for kind of its impact, and then there's this other Nobel Prize in in Physics for advances in AI. So AlphaFold, for the listeners who don't know, is this narrow AI system that enables very cost effective prediction of the protein structure of virtually every protein relevant to to science and medicine and has, it seems, unlocked, a lot of potential advances in medicine. So, I think in the future, we'll see all the potential benefits that that come from this kind of, development. But generally, would say the I think advocates for AI often point to AI for science. So this has been expressed recently by Dario Amade, the CEO of Anthropic, that benefits of AI for science are potentially very large and and should be encouraged. In terms of the more broad question of the right way to democratically develop technology, I think this is a a very challenging question. In my earlier scholarship, this is, I think, from the sixties or seventies, there was ideas of having citizen councils or citizen juries for technology. So you'd have a representative group of citizens who would be paid for taking the time to learn about a technology and then reflect on how it should be developed and then how it should be deployed in their community. More recently, there's a number of work streams that carry on this tradition that look at having different kinds of pluralistic assemblies and try to host this democratic conversation, which requires, again, educate providing the space and the resources to educate the community and then this institutional framework to allow people to express their views and then allow that to sort of not just be a lot of noise, but aggregate up into guidance to policymakers. In general, I would say, you know, different countries are are approaching the democratic governance of technology in in different ways. We have freedom of the press and assembly. We have social media and and conversations there. We have nonprofits doing research and and staking their position. We have these government agencies, the AI safety institutes, and many other agencies that are looking at technology. So I I wouldn't say it's, you know, a case where companies are unilaterally, you know, with without any kind of democratic involvement, proceeding. Though it is important that people involved in POS AI and other, groups find ways to, engage the democratic process most productively. And so a question that needs to be answered is what are the appropriate mitigations for these very powerful models? And I don't think that question can be fully answered by companies or even narrow governmental agency. We really need the scientific community. We need people involved in in building medicines and who can see, you know, that side of the the ledger, who are building cyber defense, and also those who are focused on bio risks, cyber risks, and so forth, all in a conversation together, hopefully reaching a scientific and policy consensus on what is the right way to proceed with, these models as they gain new capabilities.

Rob Wiblin: (2:40:24) Think something that has turned the the worries about this democratic legitimacy issue up to kind of a higher fever pitch is that all almost all of the AI companies, maybe Meta aside, have talked a big talk about how they would like to have, you know, binding legal constraints on on what they can do and, you know, heavy involvement in in government and oversight and so on. But they are yet to lobby actually in favor of any particular legislation that truly would bind them and and and be very costly and and restrict what kinds of models that they could train. And I guess in the you know, that they've turned down perhaps opportunities to to argue in favor of legislation that that that that has been proposed. I think and the worry understandably is that the companies will always say they're in favor of this, but the the day when they'll actually support any specific constraint, when they're actually willing to put the handcuffs on themselves may may simply never never come. I'm not sure whether whether whether wanna comment on that, but but I do think that at the point that you actually saw a company saying, yes, this is the specific regulation that we want to bind ourselves, want it to apply to everyone, it would reassure a lot of people who are currently unsure how much they can trust these quite powerful entities.

Allan Dafoe: (2:41:31) Yeah. I I think it's an important question to ask. I wouldn't say put it as starkly as you did in that speaking for Google, Google has engaged proactively in a number of conversations that do affect their sort of freedom of action. Most notably, the White House commitments, and then there's a number of other sort of commitment like moments. So 1 was around the UK safety summit, the the first safety summit, and then there was a second round of commitments at the South Korea summit. And in both of these cases, they were pointing to frontier safety framework like developments, asking where Google committed to doing forms of red teaming, self evaluation, and then working on mitigations for the risks as they emerge. Speaking for Google, I was very impressed with the extent to which the internal sort of Google operations took the White House commitment seriously. So after they were signed, there was a real effort internally to identify all of the commitments that were made, spell out what they meant, and then make sure there was work streams, making sure we we lived up to them to, you know, a very significant extent. Similarly, the Frontier Safety Framework, I don't think is is, you know, pure cheap talk. It really does represent an expression of Google and Google D Minds sense of the nature of the risks. It it offers an attempt to develop this safety standard for future risk assessment. And so it is a form of proto regulation or or safety standard that later could be, you know, significantly constraining of the industry. 1 other aspect worth noting is, there is the EU AI act, which is currently, negotiating the code of practice, which is a quite meaningful form of regulation, where, Google is participating and helping, the EU AI act think through, what would be, the right way to have that code of practice

Rob Wiblin: (2:43:39) operate. So a worry that I that I used to have and maybe have a little bit less is that improvements in algorithmic efficiency and in, you know, dissemination proliferation of serious compute might make a lot of these rules, you know, the frontier safety framework not as effectual as it might otherwise be, because even if a big organization like GDM has these quite stringent controls and all of these all of these, you know, evals and precautionary measures and so on, if another organization can train a model that's as powerful as has the frontier 1 for 10,000,000 or 1,000,000 or even even a 100,000,000, that means that there might be just a wide proliferation of these kinds of dangerous capabilities, and, you know, if any 1 actor is is not willing to follow the rules and and is going to defect from them, then that could still be quite disastrous. How worried are you about about that failure mode?

Allan Dafoe: (2:44:30) I think the proliferation of frontier model capabilities is a key parameter of how things will go and viability of different governance approaches and and solutions, which is again why I think I would channel this question back to the question of what's the appropriate approach to open weighting models, because that is probably the most significant source of proliferation of frontier model capabilities. Leaving aside the open weight question, there's this secondary point that you point to, which is this exponentially decreasing cost of training models. So epoch AI, which is a group that, does very good estimates of trends in in hardware and algorithmic efficiency and and related phenomena, finds that algorithmic efficiency is increasing 3 times per year. So the cost of, training a model decreases by roughly 10 x every 2 years. And by that, if that trend persists, then a model that costs a 100,000,000 to train today will cost 10,000,000 in 2 years, and then 1,000,000, 2 years after that and so forth. And so you can see how how that leads to many more actors having a model of a given capability. Some analysts of the situation perceive this to be very concerning because it means whatever novel capabilities emerge will quickly at this rate diffuse and be employed by, you know, bad actors or irresponsible actors. I think it it's it's complicated. I mean, certainly, in a sense, control of the technology is easier when it doesn't have this exponentially decreasing cost function, and there's other sources of diffusion in the economy. However, 2 years is a long time. You know, a 2 year old model from today is is a significantly inferior model to to the best models. And so we may be in a world where the best models are developed and and controlled responsibly, and then can be used for defensive purposes against irresponsible uses of inferior models. I think that's the win condition.

Rob Wiblin: (2:46:41) Yeah. I mean, I guess that, it makes sense. I suppose it's it's a picture that makes makes 1 nervous because it means that you always have to have this kind of hegemon of the most recent model ensuring that the what is like now, I guess, an obsolete but still like quite powerful model from 2 years ago isn't isn't isn't causing any havoc. But I guess I suppose it probably is almost the only way that we can make things work. I can't I can't think of an alternative, to be honest. Yeah.

Allan Dafoe: (2:47:03) Yeah. There's other ways to advantage defenders, so you can advantage them by resources. This was Mark Zuckerberg's argument in favor of open weight models. He would argue even if the bad actors have access to the best models, there are more good actors or the good actors have more resources than the bad actors, And so they can kind of outspend the bad actors. So for every bad actor dollar, there's a 100 good actor dollars, which will build defenses against misuse. That could be the case. It depends on this concept of the offense defense balance. The the how costly is it for an attacker to do damage relative to the cost the defender must spend to prevent or repair damage. If you have this 1 to 1 ratio, then sure, as long as the good guys have more resources than the bad guys, they can protect themselves or repair the damage in a sort of cost effective way. But it's not always the case that the ratio is such. Often we think, I think, with bioweapons that the ratio is very skewed. A single bioweapon is quite hard to protect against. It requires rolling out a vaccine, and a lot of damage will occur before the vaccine is widely deployed. Yeah. In general, it's I think this field of offense defense balance is also something worth studying. And I would say the the field of AI has too often drawn an analogy from computer security where vulnerabilities are easily patched. And so the the sort of correct response in computer computer security is typically to encourage vulnerability discovery because then you can roll out a patch and then the, you know, the new operating system or or software is is resilient to the previously discovered vulnerability. Not all systems have that property. Again, biological systems are hard to patch. Social systems also are hard to patch. So deepfakes do have a patch. We can develop this immune system where we know just because someone does a video call with you doesn't mean that's sufficient authentication of the identity of the person, you know, the instruction you're getting to transfer some millions of dollars or or so forth. Humans, you know, have a lot of inertia in their systems, and it takes it's quite costly to sort of build the new infrastructure and then train people to use it correctly.

Rob Wiblin: (2:49:21) Yeah. You have you have a couple of really nice papers on this offense defense balance question, which unfortunately, we we we couldn't fit inside the margins of the of the page this time around, so might might have to wait for for a for a third interview. We're almost out of time, but, yeah, just to finish up, I wanna talk a bit about, I guess, advice for for people in the audience. As as I mentioned, back in 2018, you were telling people they jump into this AI safety and governance and and and policy work, which I think it was was excellent advice at the time and might might might still be excellent advice now. I guess back then you were saying it's there were technical people perhaps in this area, but social scientists, less so. There were there were some real gaps of different, you know, disciplines and areas of knowledge that just weren't really going into AI governance because I suppose they weren't hearing about it or it didn't seem as relevant. Is social science still a big gap? And are there other kind of big gaps where you'd you'd really benefit having people have a particular sort of training or or interest coming into the field?

Allan Dafoe: (2:50:08) Yeah. It's remains to be a huge gap, and it's hard for me to name a single area. I think we need more political scientists, economists, historians, ethicists, philosophers, sociologists, ethnographers. Really, because AI will and is having such profound impacts across society, we need experts in all aspects of society to make sense of it and to help guide Google and help guide government in terms of what's the right way to proceed with these technologies.

Rob Wiblin: (2:50:46) Yeah. Are there any particular roles that I suppose you mentioned you're hiring right now, maybe not by the time that this episode comes out. But what what are the sorts of roles that GDM maybe, like, has has difficulty hiring for even even despite its prominence where you it'd be great if you could get more people into the into the field?

Allan Dafoe: (2:51:01) I I mean, I think talent is in high demand everywhere. So within, I guess, my side, we're looking for experts in international politics, in domestic politics, in governance of technology, domestic and and international, in forecasting technology, in these sort of macro trends, in, yeah, predicting the impacts of of technology. There's a lot of work on agents that that needs to take place. So thinking about autonomy, risks, and and safety, thinking about assistance, how those can be deployed. There's, you know, such a rich space of interactions with humans and challenges and opportunities there to get right. There's the ethics team at Google Deep Minus Hirings. There's a lot of more ethics heavy questions and and work to be done from an ethics lens, responsibility, policy work, and this is just on, yeah, the kind of nontechnical side. And then within technical safety, there's a a large portfolio of work that needs to be done.

Rob Wiblin: (2:52:08) Yeah. Sort of all all all of the above. Yeah. Final question. Well, final final section if if if have time. This show is like we talk about AI a lot. Unfortunately, we do have a slight tendency towards doom and gloom and always talking about the risks and the and the downsides and so on. I suppose we talk about that because we worry that it's neglected in the in in the broader conversation or maybe it's the only thing that's stopping us from achieving these enormous, enormous possible gains that we can that we can get from AI and and and AGI. I guess in the interest of of having at least a modicum of balance, what what are some of the applications of AI that you're excited about? Some applications that you would, like, really really like to see sooner rather than later?

Allan Dafoe: (2:52:42) It's a great question, and it's important to keep at the top of our mind. So hopefully, yeah, readers will and listeners will will get this message. This is, you know, the why we would want to build AI in advanced stage in advanced AI and AGI is that the benefits could really be profound, and and will be profound if built safely. To motivate it, personally, I found the experience of riding in a Waymo, self driving car, to be really impactful. I've noticed a trend of, I guess, tourists to San Francisco to, you know, report back on on this, you know, really you know, first, this historical moment when people are able to ride in a car that's driving itself. You know, going back in time, this this is a new phenomenon, something new under the sun. It's It's quite incredible that we finally the safety has has reached the number of nines required to to do that safely. Waymo recently put out a safety report documenting how many fewer crashes that have injuries or just crashes that involved police coming to the scene, and it was the latter was 2 x less. So there's 2 times fewer incidents that involved police coming to the crash scene, and I believe it was 6 times fewer crashes involving injury. So this is 1, you know, quantified benefit. If we can have fewer, you know, crippling car injuries, that would be profound. I'm imagining a world where we don't need to expend so many resources building cars that mostly sit in our parking lots, and we don't need to dedicate urban space to cars just waiting for us to end the workday. Instead, we can reclaim that urban space for other more productive purposes. So that's that's just 1 technology, and, course, there there are challenges that we need to think through there as well. I think medicine and health is another domain that could really be profoundly beneficial. I personally look forward to the day when I can consult, sort of phone a doctor in your phone and ask, you know, give it my symptoms, and it can assure me, no. This is you know, these are reasonable or incoherent symptoms. You you don't need to worry about it. And it can also advise me when it's time to, you know, see a doctor, thereby saving the medical establishment the, you know, time costly activity of of seeing patients and and allowing us to use the the human doctor's time more effectively.

Rob Wiblin: (2:55:21) Yeah. I already use LLM's call for for health advice, but I guess is it DeepMind that developed MedPalm MedPalm 2, which I think is kind of the state of the art of this sort of, yeah, you know, a g g or, like, medical consultancy?

Allan Dafoe: (2:55:34) So that was from Google, and Google has this MedLM is is the broad name for its work on language models for medical purposes. And in general, Google has a quite extensive health portfolio.

Rob Wiblin: (2:55:49) That one's not publicly available yet. Right? But I guess I imagine Google's working on it.

Allan Dafoe: (2:55:54) It's my understanding that MEDLM is available to certain users. So yep.

Rob Wiblin: (2:56:00) Maybe, yeah, maybe I might get to give give it a while at some point. I suppose it's something that's I've said this on the on the show before, but it's something that I think about the the most perhaps is, you know, I've got a kid almost 1 years 1 year old. I guess, I'll be in, you know, kindergarten and then and then then at school in a couple of years time. You know, actually thinking about what is it what is it like in a typical school that might be in your in in your district is not necessarily always going to be the the very best education, there's lots of things that are kind of unpleasant about school sometimes. I'm kind of hoping that by the by by the time they would be attending school, there'll at least be an option for them to get tutoring and and assistance from from AI AI AI models that could, you know, be as good as the best teacher in in principle, at least if they were if they were designed really well. I suppose there's with with with classroom management, but you could also imagine they might be able to be much more engaging and entertaining and interesting than than than a real life, you know, first grade teacher teacher might be. I suppose you you have 3 kids as well, and I imagine that the oldest might be approaching school age pretty soon if if not already. Do do you think they're gonna be be actually taught by a by a LLM teacher at any point soon?

Allan Dafoe: (2:56:58) I think AI in education is a huge opportunity as you for reasons you mentioned. Another data point here is tutoring is hugely impactful or small class sizes is another way of putting it for student success. And so having an AI tutor that can complement the the school system could be very beneficial. It would flag, you know, mistakes that the student makes and then can provisionally point out, you know, where they went wrong or show different ways of proceeding, I think could be very beneficial. And like you say, just having that option, having that available, and perhaps in a more engaging way or in a way that's, yeah, more compelling to students could be beneficial. Think we've seen this with the Internet in general, that the Internet has provided resources for learning, and then, platforms like Khan Academy and, a lot of pedagogical videos, have been, I think, very beneficial for student learning.

Rob Wiblin: (2:58:02) Yeah. I think tutoring, I think, is slightly famously, like, especially useful for exceptionally strong students who can, like, potentially rush ahead if they have, you know, a tutor there available to to help them, and also for students who are really struggling and might, if they're if they're stuck in a class where they're significantly behind, they're just not gonna be learning almost anything because they because they just don't have enough understanding to even track what's going on. So you could see it, like, yeah, benefiting different groups in

Allan Dafoe: (2:58:25) really quite quite enormous ways. There's 1 other area I think I would wanna call out, which is AI for sustainability. So Google DeepMind has all these interesting projects, and they they read like science fiction. 1 is on controlling the plasma for nuclear fusion, which if you can do it more effectively, make fusion a more viable source of energy, which could be extremely beneficial for sustainable energy production. Google DeepMind had these advances in weather prediction, making weather prediction cheaper, faster, and more effective, which can be beneficial for event planning, but also wind power and solar power and other sources. Advances in material sciences, finding new materials which could be used for, again, solar or other aspects of the economy that could make things more efficient. Google had this result on optimizing flight paths to reduce control production, which is apparently a large source of greenhouse gas dynamics. And so there's data center optimization. I think Google do mine had a result several years ago, reporting quite significant gains in how data center cooling was done in terms of energy use. You have all these efficiency gains in energy use production that can really add up to to help humanity address these, you know, significant challenges.

Rob Wiblin: (2:59:50) And we haven't even mentioned AlphaFold 2, which I guess I heard from what I I haven't followed this in detail, but from what I've heard, it's incredibly useful for medical research and for drug design and so on. So we just gotta gotta do all of the stuff discussed in the last few hours, and then we can can enjoy this wonderful, brave new world of all these fantastic new products and much better much better quality of life. Fingers crossed. Yes. My guest today has been Alan DeFoe. I look forward to talking to you again next next time on the show.

Allan Dafoe: (3:00:12) Thanks, Rob. Look forward to it.

Nathan Labenz: (3:00:22) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson

Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson

Success without Dignity? Nathan finds Hope Amidst Chaos, from The Intelligence Horizon Podcast

Agency over AI? Allan Dafoe on Technological Determinism & DeepMind's Safety Plans, from 80000 Hours

Watch Episode Here

Read Episode Description

Full Transcript

Read next

Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson

Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson

Success without Dignity? Nathan finds Hope Amidst Chaos, from The Intelligence Horizon Podcast

Agency over AI? Allan Dafoe on Technological Determinism & DeepMind's Safety Plans, from 80000 Hours

Watch Episode Here

Read Episode Description

Full Transcript

Read next

Training the AIs' Eyes: How Roboflow is Making the Real World Programmable, with CEO Joseph Nelson

Success without Dignity? Nathan finds Hope Amidst Chaos, from The Intelligence Horizon Podcast

Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey

Your Agent's Self-Improving Swiss Army Knife: Composio CTO Karan Vaidya on Building Smart Tools