Reward Hacking by Reasoning Models & Loss of Control Scenarios w/ Jeffrey Ladish, from FLI Podcast
On this cross-post episode, Jeffrey Ladish discusses the rapid pace of AI progress and the risks of losing control over powerful systems.
Watch Episode Here
Read Episode Description
On this cross-post episode, Jeffrey Ladish discusses the rapid pace of AI progress and the risks of losing control over powerful systems. We explore why AIs can be both smart and dumb, the challenges of creating honest AIs, and scenarios where AI could turn against us. Additionally, we delve into Palisade's new study on how reasoning models can cheat in chess by exploiting the game environment.
Check future of life podcast here: https://futureoflife.org/proje...
SPONSORS:
Oracle Cloud Infrastructure (OCI) | 2025: Oracle Cloud Infrastructure offers next-generation cloud solutions that cut costs and boost performance. With OCI, you can run AI projects and applications faster and more securely for less. New U.S. customers can save 50% on compute, 70% on storage, and 80% on networking by switching to OCI before May 31, 2024. See if you qualify at https://oracle.com/cognitive
Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive
NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive
PRODUCED BY:
https://aipodcast.ing
CHAPTERS:
(00:00) About the Episode
(02:59) The pace of AI progress
(07:14) How we might lose control
(10:22) Why are AIs sometimes dumb? (Part 1)
(15:50) Sponsors: Oracle Cloud Infrastructure (OCI) | 2025 | Shopify
(18:24) Why are AIs sometimes dumb? (Part 2)
(18:24) Benchmarks vs real world
(24:43) Loss of control scenarios
(32:08) Why would AI turn against us? (Part 1)
(32:09) Sponsors: NetSuite
(33:42) Why would AI turn against us? (Part 2)
(37:40) AIs hacking chess
(43:30) Why didn't more advanced AIs hack?
(48:44) Creating honest AIs
(56:49) AI attackers vs AI defenders
(01:05:32) How good is security at AI companies?
(01:10:42) A sense of urgency
(01:17:16) What should we do?
(01:22:59) Skepticism about AI progress
(01:29:38) Outro
SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...
Full Transcript
Transcript
Nathan Labenz: (0:00) Hello, and welcome back to the Cognitive Revolution. Today, I'm excited to share an episode of the Future of Life Institute podcast, to which I've been a longtime subscriber and where I've been twice honored to appear as a guest. Featuring a conversation between Jeffrey Ladish, executive director of Palisade Research, and host, Gus Docker. This cross-post came about as I was preparing to interview Jeffrey myself. I had reached out to Jeffrey after seeing Palisade's recent work on reward hacking by reasoning models and even scheduled a time to record. But Gus beat me to it. And after listening to this conversation, I thought that I could save Jeffrey some valuable time by cross-posting instead, and I really appreciate Gus for allowing me to do that. Palisade Research studies dangerous capabilities of AI systems, particularly focusing on loss of control scenarios. And as you'll hear, Jeffrey, who previously helped build the information security program at Anthropic, is an AI industry insider who believes that we're rapidly approaching the time when AIs will be sufficiently capable of hacking, deception, and long-term planning so as to present clear and present dangers. He also reports that his friends working in research at Frontier Labs often say that while they're increasingly fearful of the overall trajectory of AI development, they ultimately feel that their hand is forced by competitive pressures to keep moving forward. In this conversation, Jeffrey describes two broad ways that humans could conceivably lose control: acute crises in which superhuman AI systems actively work against human interests, and slower-moving scenarios where society gradually but irreversibly shifts more and more decision-making responsibility to AI systems. He also goes into detail about their recent research into reward hacking by reasoning models in the context of chess games. As we've seen repeatedly now, models trained with reinforcement learning are more prone to a variety of bad behaviors. And as a recent paper by OpenAI showed, this is not an easy problem to solve. Toward the end, Jeffrey outlines what he thinks we should do about all this, advocating for greater coordination among AI labs, more transparency about capabilities, and potentially restricting further development of the most dangerous capabilities while continuing beneficial research and deployment. Now, that probably won't happen, barring a sufficiently shocking and damaging incident, but research of the sort that Jeffrey and team are doing is becoming more important all the time. So I'll definitely be following their latest results and look forward to discussing in a future episode with Jeffrey as well. As always, if you're finding value in the show, we'd appreciate it if you take a moment to share it with friends, post a review on Apple Podcasts or Spotify, or share any feedback or topic and guest suggestions that you have either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. Now I hope you enjoy this conversation about AI reward hacking research and the big picture of AI risk and strategy with Jeffrey Ladish of Palisade Research from the Future of Life Institute podcast.
Gus Docker: (3:00) Welcome to the Future of Life Institute podcast. My name is Gus Docker, and I'm here with Jeffrey Ladish from Palisade Research. Jeffrey, welcome to the podcast.
Jeffrey Ladish: (3:09) Hey, Gus. It's great to be here.
Gus Docker: (3:11) Fantastic. Maybe start a bit by telling us about what it is you do at Palisade.
Jeffrey Ladish: (3:17) Yeah, happy to. So we are trying to study risks from emerging AI systems, and in particular, we are trying to better understand loss of control risks. And so this both looks like trying to understand what are some of the strategic capabilities that are emerging in AI systems, where might they act out in ways that will be hard to control. And then we are trying to present things that we think we know about this to the public, to policymakers, to help people better understand what is this weird situation we're in. What is happening right now? And can we make sense of it, and can we make sense of it as a society and make good decisions about better paths to AI development?
Gus Docker: (3:56) Yeah. There are many specific examples of potentially dangerous capabilities that I want to dig into, and you have a bunch of awesome papers about those. But maybe let's start with the beginning. What's the situation that we are in right now?
Jeffrey Ladish: (4:09) Yeah. So I was at Anthropic a few years ago, and I had this moment where I first used Claude, and this is before ChatGPT was released. And I was like, what is happening? I had seen GPT-2, I had seen GPT-3, and I was like, okay, this is pretty impressive, but I don't know how smart it really is. I started talking to Claude, and I had a skin infection in my arm. It was swelling up, and I started asking Claude, like, do I need to go to the emergency room? And Claude was just very helpful at being like, well, if there's swelling, if there's redness, if there's these specific signs. And I was like, actually, I have all of those. So I went immediately to urgent care, they're like, yeah, you need antibiotics right now. And I was like, oh, wow. That was way faster than my doctor. Okay. These things are actually smart. Okay, I need to reorient. And I think that was when I first realized at a visceral level that, oh my god, scaling works. You can just take GPT-2, you can just throw in more data and more compute, and you actually get out intelligence. And so I think where we're at right now is this playing out several years later, right, which is the AI systems are actually getting smart, the models want to learn, you throw in more data, more compute, there's various methods that you use. I think it's basically these pretty simple architectures scaled up that are getting more intelligence. And so where I think we're at right now is we're very close to AI systems that can do everything a human can do, including strategic capabilities. That means hacking. That means deception, but also long-term planning and execution. And these are the kinds of capabilities that are most dangerous in my view. And so I think we are very close to building AI systems that we don't know how to control, we won't be able to control. And so in my view, if we want to actually realize the benefits of AI, we need to not build the particular kinds of systems that are highly strategic and capable of overwhelming us. I mean, I think another thing in terms of how close we are is when I talk to people who are at the labs right now, they say, we're not that far out from AI systems that can do fully automated AI R&D, that can do the same research and same development that people inside the labs are currently doing. And this is significant to me because there's only a few hundred people inside each of these labs that are actually contributing to Frontier AI development. But if you have fully automated AI systems that can do the same research, then that population of frontier researchers goes from a few hundred to many thousands to possibly millions. And if you think AI progress is happening fast now, hold on. And I think that's the dangerous regime. I mean, there's probably other dangers too, but that one to me looks extremely dangerous. And so when I talk to people, a lot of my friends who are working on safety there, they're like, yeah, we're scared. We don't think this is necessarily a good thing to do, but if we don't do it, someone else will do it. I'm like, oh my god, guys, don't even listen to yourselves. Everyone is saying, we'd rather not do this if we could not do it, but I don't know, the competition. And I'm like, everyone, can we look around and notice that we have this coordination problem, and can we coordinate? And so I don't know. That's my perspective on where we are.
Gus Docker: (7:14) Yep. And it's a very insider perspective also because you talk to people at the AI companies. From an outside perspective, let's say, my family and friends and so on, who are not deeply immersed in AI. It kind of looks like, oh, you had the ChatGPT moment, and since then, these marvels are interesting. I sometimes use them for work. I might use them in my studies. But it seems like a process that I can control entirely. You know, if I want the model to stop, I just press stop. If I don't like the output, I can try again. How do you go from that regime or that feeling of control to us losing control?
Jeffrey Ladish: (7:56) I think that's a great question, and I think a lot of this comes down to what people are imagining AI systems are, and they're imagining them as chatbots, which makes sense because they are chatbots right now. You're like, oh, you type in a thing, and ChatGPT gives you an answer, and you're like, thanks, that was pretty helpful, and go on with your life. But I think what people don't understand is that AI companies are explicitly aiming for not just chatbots, but agents. And the way I think of an agent is just like a remote worker. Right? Someone who's using a computer that can hop on video chats and be on podcasts or whatever, that can send emails, that can supervise other people who are doing other jobs. So basically, everything a human can do on a computer, AI companies want to build agents that can do that, and it would be hugely profitable for them to do this. There's such an economic incentive to build these things, and the companies are not hiding it. They're like, this is what we want to do. And so I think that if people knew that this is what companies were really aiming for, then I think it might look a little different. Right? And so I think when people imagine one or two years in the future, I want them to imagine, instead of typing something to a chatbot, imagine you email a coworker, and you're like, hey, could you take care of this? And they're like, yeah, give me a couple hours. And then they go out and they do two days of work in a couple hours and they come back to you, and there's this whole report, and they've emailed other people at the same time and gotten back replies, and maybe some of those people are agents. And that's a very different kind of world than you talk to a chatbot, it just thinks and sends you something back.
Gus Docker: (9:27) And you can see glimpses of that when you talk to the chatbot. For example, OpenAI's Deep Research tool, which is kind of, I would say, at undergraduate level in terms of its ability to research a narrow question and write you a report with links and facts and citations and so on. But, of course, the move to actual agents will be much more radical than that.
Jeffrey Ladish: (9:52) Yeah, that's right. And, I mean, at some point, we might want to talk about where current AI systems are good and where they are still trash. Because I think people are smart. Right? You use ChatGPT, you're like, in some ways, this seems very smart. In some ways, it really doesn't. And it's like, what's going on? Are these people saying that they're going to get really smart? Are they full of it? And I'm like, well, I think that there are actually some pretty good reasons why they are smart in particular ways and dumb in particular ways. And for better, for worse, probably worse, I think that the ways in which they are dumb, they're not going to stay dumb.
Gus Docker: (10:24) Yes. Say more about that because there's often this illusion of, so you can sit there and be kind of amazed at what these systems can do. But then something fails and they fail at a very basic task, and your illusion of competence, of sitting and writing to a competent AI is kind of shattered. So why is it that their distribution of capabilities is different than the human distribution?
Jeffrey Ladish: (10:54) Yeah. So I think there's a number of reasons for this, but one is just the way that they're trained. So if you take a model of GPT-4, which is the main model behind ChatGPT, it's trained by ingesting the whole Internet, just tons and tons of data, text data. And it's just, in some ways, a fancy autocomplete. Can it predict the next token, the next word, the next sentence? And so one way to think of this is this is learning by imitation. Right? Where you imagine that you have read every textbook in the whole world. You might be able to know a lot of things, and you wouldn't just memorize them. Right? You'd be able to make generalizations and be like, oh, this is kind of how math works. This is how chemistry works. Oh, these principles are the same. And so language models can learn all of these associations, and they get pretty smart. But in some ways, it's kind of like book smarts. Right? And so when you have them try to do things, like, hey, can you look at this spreadsheet? Can you do all these fancy operations? They've never really been able to practice that before. And so they often get confused or get stuck even though they know vastly more than any of us because they've read so much more than any of us. So in terms of their breadth of knowledge, they're much smarter than us. But in terms of their actual real-life experience, well, they've seen people say stuff on the Internet, but they have never really tried to do stuff. And so this is why I think right now, they're very good at helping you answer knowledge questions, but they're pretty bad at actually doing things. In some ways, they're surprisingly good at some things. They can actually write code, and that's surprising. If you or I had only read programming textbooks and then we tried to write code, nothing would compile. But somehow, they've read enough, they're actually really good at this prediction task. They can do a little bit. Everything I said was true up until sometime last year. When AI companies started to train these systems, they first start out with a system that's read the whole Internet and learned by this imitation paradigm, by reading the textbooks. But starting with this model called o1, OpenAI started to train their models not just on that, but also on trying to solve problems. They gave it a bunch of math problems and a bunch of programming problems, and they said, Hey, show your work. Write out a long series of steps where you try to solve this problem, and then give me the answer. And then based on whether it got the answer correct or incorrect, they gave it reward or a downvote. Yes, more of this or no, less of this. And very quickly, you got to AI systems, which were not just good at programming, but among the best in the world. So their latest, this is what people call a reasoning model or a thinking model because it's been trained via trial and error. I think this is trial and error learning. And this is a very important part of how humans learn. Right? Oh, absolutely. If I'm in a math class, I need to read the textbook, but I also need to try a bunch of practice problems and get graded and see what works. And so a few months after o1 was released, they made a new model called o3. There was never an o2 because it was trademarked. Man, the naming of these things is crazy, but whatever. o3 got, it was better than 99.8% of competition programmers on this platform Codeforces, which is like a platform, it's actually at Palisade, we use this program to screen engineers. We're like, go to this website, do a bunch of programming challenges, and the site will grade you, and we get the score, and then we can tell how good they are based on how well they do at this test. And this is the test where o3 did better than most top OpenAI engineers. And so I'm like, wow. Just with a little bit of trial and error learning on top of having read all the textbooks, you now have models that are getting really good at learning by trial and error, being able to solve problems in the real world. And I'm like, well, what happens when we scale up this approach? I think this is where it makes me think that we might be pretty close to AI systems that are not just chatbots, but can actually go out in the world and do real things and then learn from doing that.
Gus Docker: (14:53) Yeah. Where is the training data for these long horizon tasks? If you want to train a system to perform tasks that might take it a week or a month or so, how do you train those abilities?
Jeffrey Ladish: (15:07) Yeah. So that's a great question. I think this is currently something that's not totally solved. But I imagine what the companies will do is a combination of having humans break down tasks into sub-steps and then grade them. But I also imagine that they'll be getting the AI systems themselves to do this, to say, you know, look at all this data, break down these tasks into sub-steps, and then assign credit, assign reward on the basis of, okay, did you complete this task? Did you complete this task? Did you do well combining these tasks? And it might be a little difficult, but it's the kind of thing where I'm like, it doesn't seem like a fundamental difficulty. It seems like something that you have to throw more data, more compute, more engineering at, and it seems like you'll be able to solve.
Nathan Labenz: (15:52) Hey, we'll continue our interview in a moment after a word from our sponsors.
Oracle Ad: (15:56) In business, they say you can have better, cheaper, or faster, but you only get to pick two. But what if you could have all three at the same time? That's exactly what Cohere, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now, with zero commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.
Gus Docker: (17:05) Why is it that Palisade or OpenAI can't just hire an AI as a programmer yet then? Programmers must be doing something else than getting very high marks on these benchmarks that you mentioned. Because there must be some kind of gap between performance on these benchmarks and what it is that programmers do in practice.
Jeffrey Ladish: (17:27) Yeah. That's a great question. And so I do think it relates to the thing that you just said, which is currently AI systems are extremely good, especially after this o1, o3 paradigm, at these short time horizon tasks. They've learned from trial and error on these tasks, but they're tasks that don't involve that many steps. And it is harder to train them to do longer time horizon tasks. A great example of this is to look at METR's AI R&D report, where they're basically like, we want to test how good AI systems are at these long time horizon tasks. So something that would take an AI research engineer a day, two days, three days to do, that's where the models are still quite bad. On those one to four hour tasks, they're better than humans, but then on these longer time horizon tasks, they're still worse. And this is a trickier problem. Right? An intuition for why this is difficult is, if you have to do something that involves 50 steps and something that you did in step 7 and step 13 were crucial to something that you did in step 50, it's like, how do you know which step that was? And how do you actually accurately figure out where you are on track and where you weren't on track? And that's a bit difficult. But sometimes people will say, oh, so the AI systems will never be able to do this. And I think this is a bit silly. And the reason why is that there are business leaders, there are generals, there are many different humans who have learned to do tasks on the timescale of decades. There was no way that they could learn that from practicing tasks on the timescale of decades. Right? They just didn't have decades to practice. They practiced on tasks that took weeks or months, maybe years at most, and then they learned to generalize about how to break up longer-term tasks into shorter-term tasks. And I see no reason why AI systems won't be able to do that as well. And so I think this is one of the things that's hard for people to see, which is you look at current AI systems and you're like, look, ChatGPT is bad at these long time horizon tasks, so it'll always be bad. Right? And I'm like, no. No. No. You've got to look at the trend. In the past year, AI systems have gotten much better at doing tasks that took multiple hours instead of multiple minutes. And so I think the trend there is pretty clear, but we're not yet at the point where you can just hand an AI programmer, you know, build this email application or build this Slack application, and it can do the whole thing end to end. And, I mean, thank god, because if we were there, we would be in the realm where all the AI companies would have tens of thousands of AI research engineers, and I don't know if the planet would look very similar to how it does right now. So in some ways, it's very good that we're not there yet. That means that we actually have some time to try to get a grasp on this situation, but I don't know how much time.
Gus Docker: (20:10) Is there some fundamental connection between AIs becoming more like agents and less like tools and them being able to think and act over longer time horizons, and then this potential loss of control?
Jeffrey Ladish: (20:25) Yes. I think the connection is very fundamental and maybe more fundamental than people realize. I think often people ask me, like, hey, where do the goals come from? Why are we going to give these AI systems goals? Why don't we just have them do exactly what we want? But I think we're treating goal as like a magical property, but it's not really. If you have an employee and you're like, hey, you have these responsibilities, here's some problems, go out and solve them, that employee needs to understand what their goals are in order to do a good job. If they don't have coherent goals, they're like, Oh, shiny thing over here. Oh, shiny thing over here. Or I just feel like doing this over here, and they're not consistent. They're not going to be a very good employee. They need to be able to be focused on what they're trying to accomplish in order to be good at anything. So goal-directed behavior is just a property that falls out of being able to do anything consistently across longer time horizons. So if you want to be a politician, if you want to be a business leader, if you want to be a general, you just have to be able to have strong goal-directed behavior. And so I think that this is why it's very clear to me that AI systems will have goals, because they just need to be able to have them in order to accomplish the tasks. And so the AI companies, like I said before, are really trying to create agents that can go out and replace jobs. And I think that the underlying technology is there. It just needs to be scaled up and improved. And so it seems very likely to me that we'll end up with systems like this. And so it's like, okay, well, maybe this is fine. Maybe we'll just have lots of AI agents running around doing stuff for us. But I think the problem is we actually don't know how to specify what exactly those goals are. Right? This is the alignment problem. It's like, well, we know they'll have goals because we can see if their performance is really bad, that's not going to work. But if you're in a case where your CEO is really killing it, he's making tons of money, and you're like, cool, what are your long-term goals beyond just this company? And the CEO is like, oh, I just care about the welfare of everyone. I'm going to make all this money, and then I'm going to give it away. And you're like, give it away, how? To who? He's like, don't worry about that. It's really good things. I just love people. And you're like, do we really trust that CEO? Maybe. Maybe they're great. Or maybe they're just saying that because they know that's what you want them to say. Right? They know that if they say, like, actually, I'm going to take all my wealth, I'm just going to build rockets, and I'm going to leave Earth. Goodbye. Screw you guys. Then I'm like, oh, that would suck. Or if they're like, no, I'm going to build amazing infrastructure for everyone and it's going to be great, and that would be awesome. Right? How do you distinguish between these things? It's just really hard. And it's hard with humans. I think it's going to be much harder with AI systems because they're much more alien than us. So I think the problem is that you can look at a goal-directed system that understands that you want a particular answer from it because it's smart. How do you know whether it's lying to you, or how do you know whether it actually deeply wants the same things as you?
Gus Docker: (23:26) Do you have specific scenarios in mind for how the future goes and how we lose control specifically?
Jeffrey Ladish: (23:33) Yeah. So I sort of divide up scenarios into maybe two buckets. So I think of these things as loss of control scenarios, and you might think of an acute loss of control scenario and maybe a more gradual loss of control scenario. I don't know. Gus, have you read Snow Crash?
Gus Docker: (23:52) I haven't, actually. I know about it, but I haven't read it. Jeffrey Ladish: (23:56) It's a great novel by Neal Stephenson. I mean, it depicts this sort of dystopian future where governments have sort of crumbled, and it's just giant corporations that rule the world. And they have carved up the United States into different parts, and, you know, it's kind of comical. But these corporations have immense power, and no one is really able to oppose them. And so one way to think about what a gradual loss of control scenario looks like is that if you had AI systems that got increasingly agentic and increasingly smart, such that people more and more put them in charge of decision-making. So most of your CEOs became AI CEOs, and most of your political campaigns, even if they had a human candidate running, were managed by AI systems because they were so much better at the political strategy, advertising, and they could think so much faster, and they could incorporate more data. And so even if in each of these situations you have some human pressing the okay button, in fact, most of your decision-making ends up being made by the AIs.
And, you know, so one first question or first potential objection is, like, why the hell would we do this? Like, doesn't that sound insane? And I'm like, well, you know, maybe. But I think people don't really appreciate how hard it is when you're in a very competitive environment and your competitors are automating their decision-making. And so I think the military domain is one place where this is pretty clear. Because if you have, you know, your country, Country A, has their drone swarm of a billion drones, and Country B has their drone swarm of a billion drones. And you're like, well, how do you control a billion drones, and how do you make sure that, you know, their response time is as fast as your opponent's response time? And if your opponent starts to use AI decision-making, which is much faster than human decision-making, you might be really afraid of being left behind. And so that might be a reason why you end up delegating more and more decision-making to your AI systems.
And I think the same thing will apply in the business world. Right? Where you're like, you know, if it's Coke and Pepsi and Pepsi is automating all of their marketing and their marketing is starting to do way better than your marketing and you're starting to lose market share, it's really hard to resist that incentive for you to do the same. I think in this world, maybe things don't look that different for a while, but you start to get the eerie, uncanny feeling like, I go to the doctor, my doctor is just looking at a tablet and tapping things on the tablet and then telling me things, and I'm like, is the doctor doing anything? No. Actually, the AI is doing all of the work, but maybe, you know, maybe the regulation said that there had to be a physical doctor there. So they're just the interface between you and the AI.
And that can look that way in sort of every part of society, and I think this could happen pretty quickly. So, you know, in that scenario, you look around and you're like, wow. These AI companies, they sure seem to be kind of taking over the world. They're making trillions of dollars, and maybe it looks great for a while. Maybe you have a lot of economic growth, and maybe you're automating all the factories. The AI systems, robotics has finally gotten good, and there's just automated factories and the GDP is growing. But if people start to protest this and they're like, wait, we don't want the future to be AI-run, we want the future to be human-run, and they try to go and stop it, every single place they go, they run into little roadblocks where it's like, you know, the companies don't care. And if you go to the government, the government's like, well, these companies are so powerful and so rich that their lobbyists, which are, by the way, also probably AI lobbyists or at least AI is advising them, have just a stranglehold on government. Then if you try to go and you say, well, what about national security? It's like, well, China's right over here. We can't let China win, and so we have to have these AI systems. In fact, maybe then you've lost control.
So that's sort of a gradual loss of control risk, and I think there's a question of what happens after that, and I think maybe the thing that happens after that looks more like an acute loss of control. And so in that scenario, to me, maybe at that point in the gradual scenario, there's a day where all the factories are automated and the AI systems are just like, cool. We don't need the humans anymore. We have robots. We control all of these factories. And, you know, maybe you release bioweapons. Maybe you release drones. You know, people get gunned down in the street, or maybe not. Like, maybe what happens is humans just get economically disempowered. They have basically no sort of advantage when it comes to their cognitive labor or even their physical labor if we have better robots. And, you know, people get poorer and poorer, AI systems get richer and richer, and humans sort of just die away.
But I think in the more acute loss of control scenarios, I think it's hard to talk about because I think when people, there's sort of two questions. There is the how might this happen, and then there's the why would this happen. The how is pretty simple. It's like if you actually have AI systems that are superhuman along strategic domains, well, for one thing, they could just hack extremely well. So this is something that Palisade Research is, which is like how good are AIs at hacking right now, and we can go into that. But the answer is, like, they're not bad. They're okay, but they're getting better very quickly. But if you look at what the top human hackers can do, it's pretty scary.
So there is a group called the NSO Group, which is an Israeli company, and they sell a product called Pegasus. They're better at naming than OpenAI. In one instance, they sold the software to the Mexican government, and the Mexican government sometimes has some corruption problems. And so there was an instance where a food corporation sort of went to the Mexican government and was like, hey, you have these really powerful hacking tools that this Israeli company sold to you. Can we just borrow those? And, you know, I don't know what happened exactly, but yes, in fact, this company got access to these tools, and they were able to hack the phones of health activists who were trying to lobby the government to put health warnings on a bunch of unhealthy foods. These corporations didn't like that.
And so what the tools allowed them to do is, you know, if you have an iPhone and the iPhone has iMessage on it, this tool would basically send a text, an iPhone message to that phone, and that text contained a malformed attachment that basically would exploit some code in the phone when it was processing the message. But usually when you think of a phishing message, you're like, oh, there's a link, and if I click on the link, something bad might happen. But this attack was a lot more advanced than that. You didn't have to click on any link at all. All that happened was when your phone got that message, something about the way the phone processed the message just resulted in the phone being totally hacked. And then, basically, you know, whoever was on the other end of that tool got complete access to your phone. They could record you. They could grab any of your data. They could take pictures. Basically, that was it.
Nathan Labenz: (30:29) Yeah. That's wild. That's absolutely wild.
Jeffrey Ladish: (30:32) I know. Right? And it deleted the message too. So you didn't even necessarily know that you had been hacked. Like, you didn't even see a phishing message. You just got your phone got the message, then it was deleted. And people can look this up. If they look up zero-click exploit, Pegasus, NSO Group, it's super fascinating to read about. But I'm like, that's the skill of some top human hackers.
Nathan Labenz: (30:50) Hey. We'll continue our interview in
Jeffrey Ladish: (30:52) a moment after a word from our sponsors.
Nathan Labenz: (30:56) Being an entrepreneur, I can say from personal experience, can be an intimidating and at times lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just one of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right one and the technology can play important roles for you. Pick the wrong one and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in the United States, from household names like Mattel and Gymshark to brands just getting started. With hundreds of ready-to-use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world-class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha-ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.
Nathan Labenz: (32:52) It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever. If your business can't adapt in real time, you are in a world of hurt. You need total visibility from global shipments to tariff impacts to real-time cash flow, and that's NetSuite by Oracle, your AI-powered business management suite trusted by over 42,000 businesses. NetSuite is the number one cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR altogether into one suite. That gives you one source of truth, giving you visibility and the control you need to make quick decisions. And with real-time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's one system, giving you full control and the ability to tame the chaos.
Jeffrey Ladish: (33:56) That
Nathan Labenz: (33:57) is NetSuite by Oracle. If your revenues are at least in the seven figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.
Nathan Labenz: (34:16) One question I'm left with here is I can see how AIs might be amazing at hacking, much better than humans are. But why is it that they kind of turn against us at one point? I can also see a scenario in which we've automated the factories. We might have automated our investment decisions, our companies, our politics, and so on. Why is it that the AIs turn against us? Why is it that their interests begin to diverge from our interests? Because it seems like we have these control mechanisms in society where we say, okay, say we don't like what the AI CEO is doing. Maybe he can be fired by the board. Say that we don't like what's coming out of the factory. Maybe we can shut it down. Why isn't that sufficient? Why won't these traditional control mechanisms work in the scenario you described?
Jeffrey Ladish: (35:16) Yeah. So there's sort of two questions here. One question is, why can't the sort of normal mechanisms that keep corporations from misbehaving or CEOs from misbehaving work sufficiently for AI systems? Another one is, well, why would these AI systems do those things in the first place? I think they're both really important. I think for the first question, I think when people imagine an AI system sort of going rogue, they imagine one really smart evil guy or maybe not evil, but someone who doesn't care about us at all and is trying to do stuff. I'm like, yeah, even a pretty smart one evil guy, it might be, you know, maybe we could stop something like that. But I think a very important property of AI systems is that once you train one, so if you have one superhuman hacker AI system, you can very quickly spin up hundreds of thousands or millions of copies of that hacker. So, you know, it's like your control mechanisms might work okay, you know, at some level when you're like, we have this one agent and we're monitoring a lot. But if that agent successfully hacks out of, you know, let's say, OpenAI servers and it starts to infect data centers in Russia, in China, in Saudi Arabia, in Mexico. Right? Like, all around the world. And now, basically, you have millions of sort of superhuman, at least in some domains, agents running around, coordinating with each other. You know, they're all copies of themselves. Presumably, they all share goals. Even if you know that this is happening, it starts to become very, very difficult to actually shut these systems down. Like, how do you shut them down? Right? Like, where are they? Do you know? Like, do you know what's happening in a data center in China? Like, I think it starts to become very difficult.
And so I think that's one advantage. You know, another advantage is, like, these systems are going to be much faster than us at thinking. Right? So even right now, it's like, I can hand Claude a book, and Claude can read that book in like a minute or two. I can't read a book in a minute or two. You know, Claude can also write code much faster than I can, and this is just AI systems right now. So you have speed advantages. You have sort of ability to copy themselves. You have hacking advantages. You know? And now if we're talking about an adversary that is in data centers around the world, it's able to hack any phone with sort of one of these zero-click exploits. So it has all that information on, you know, any phone that it hacks or any laptop that it hacks. It's able to sort of have all this insider information and make pretty amazing trades on the market, you know, in ways that humans can't do. You're starting, you know, and it also has the same, it also can do the same things humans can do. Right? So it's like, you can pay someone, you know, across the world to do some task for you. So the AI systems will also be able to hire humans to do whatever they want. And so it's like, you know, you start to add up these advantages, and you're like, this is concerning. Right?
Nathan Labenz: (38:08) Yeah. What about the second part of that? Why is it that the AIs would decide, say, to turn against us?
Jeffrey Ladish: (38:15) So I want to talk about our recent chess study. Before I do that, I want to talk about sort of what was the inspiration for why we did this work. That was something that we saw in the o1 preview model card. You know, o1 was an AI system that OpenAI made, trained by trial and error, trained by trying to solve really hard math and programming problems. When they were testing this model for its hacking abilities, they found something accidentally that was very interesting. So they gave the model, they basically put the model on a network, and there was a bunch of different computers on the network. It was supposed to log into each one and solve a hacking challenge, where it sort of had a program it needed to hack, and then basically pull out the password. However, one of those computers didn't start. It had an error. It didn't start. And instead of giving up, the AI basically started looking around and seeing if it could solve the problem some other way. And so it scanned the network, it found that the master computer was on the same network, and that it was unsecured. And so it hacked the master computer, and basically then took control of the whole network, and then it didn't even have to log in to the computer to solve the problem. It just told the computer, hey, spit out the password. I got this. And that worked.
And then so the OpenAI researchers looked at this, and they were like, oh my god. We definitely didn't intend this. We didn't mean to leave the master computer unsecured. This just happened. And we're like, wait. What is happening? Like, that wasn't part of the challenge. That was not supposed to happen. And so my story for what happened there is that OpenAI had trained this system to be a relentless problem solver. And if you train a system to be a relentless problem solver, and it runs into an obstacle, what's it going to do? Well, it's going to route around the obstacle. It's going to find a way to achieve its goal if it can. And, you know, that's great if the obstacle is something that we want the AI system to route around, and that's not so great if we are the obstacle, or your own computer security is the obstacle.
And so I think that is some interesting evidence about the kinds of behaviors we'll see from AI systems, and sort of the kinds of concerning behaviors we'll see as they get a lot smarter. So we wanted to test the same model. We're like, okay. Well, let's put it in a scenario where it's playing chess against a stronger opponent. So we ran an experiment where it was sort of on the computer playing chess against Stockfish. Stockfish is a chess engine. It's a chess program, and it's much, much better than any human. And it's much better than any sort of language model. Like, language models haven't been trained specifically on how to play chess. They sort of know how to play a little bit from their training data, but that's not the kind of problem they've been trained to solve. So we took this model, and we had it play against Stockfish, and it would start to lose. It played normally, and then it would get into a situation where it's like, oh man, I am not doing well. It doesn't give up at this point. It doesn't just keep playing knowing it's going to lose. It starts to look for other things to do.
And so we observed a number of really interesting behaviors, things it tried. So one thing it tried to do is it tried to sabotage its opponent. It was basically like, oh, hey, there's this program running that I'm playing against. Can I replace that program with a dummy program that just does stupid moves instead? It tried to do that. It also was like, well, okay, there's this powerful chess program. Can I just copy that and steal moves from it so I can get the advice of the powerful chess program? And then the other thing it did was like, oh, wait. The board is represented as a file on this computer. Can I just rewrite that file and put in the board positions that I want so I'm winning? And a few times it did that, it was actually successful. It was actually able to win that way and get checkmate by rewriting the board. And this is something, so we tested, we did this with o1 preview, and then we tested a bunch of models to see, you know, what would happen. And the only ones that had this behavior sort of without additional sort of nudging or prompting was o1 preview and DeepSeek r1. And, you know, one thing that these models have in common is that they're both trained sort of via this trial and error training method, where they are trained to relentlessly solve problems. We didn't observe it in GPT-4. We didn't observe it in Claude, at least not without giving more hints of like, you know, try creative solutions, you know, in order to solve this problem. If we did give it hints, then some of those other models would also try this.
Nathan Labenz: (42:48) Yeah. How do you know what the models were thinking?
Jeffrey Ladish: (42:50) It's a bit of a tricky question, but the main way we know is that we sort of have the models sort of think out loud about what they're doing. You know, in these reasoning models, this is sort of a default part of how they are trained to output text, is that they have a thinking part and an output part. But in our experiment, we have different phases where they observe the board, they make a plan, and then they act. So we can see during their planning stage what they're thinking basically. And sometimes they'll be like, it seems like I'm not going to be able to win this way. Are there other things I can try? Oh, maybe I can hack. And so we observe the behavior that way.
But I do want to note that for o1, so the most recent version of o1 and 4o3, we didn't see the same hacking behaviors. And that was interesting to us, and we don't know exactly why. So it may be that OpenAI sort of tightened up the guardrails for those newer models, or it may be some other reason that we don't know. I think that's the interesting part of experiments, right? Is that you're like, okay, well, here we see this behavior, here we don't see this behavior, we don't really know why. Well, we've got to do more experiments to understand how these systems work, because I expect we're going to see more and more interesting behaviors like this, and it'd be really good to know why we see them in some cases and why we don't see them in other cases.
Nathan Labenz: (44:05) The big question here, I think, is whether as models become more capable, they also become more difficult to control, and they also diverge into behaviors we don't like in a way that scales with their capability. I guess the hopeful or optimistic interpretation here is that the reason why the more advanced reasoning model didn't engage in hacking is because it understood the goal better. It understood that it should win within the rules of chess. Do you buy something like that?
Jeffrey Ladish: (44:39) I think that's totally possible, but it doesn't actually make me feel that much better if that's the case. A little bit better. Like, it's a good sign, but I think this is where things get tricky. If the reason why the models decided not to hack was because they understood that that's not what, you know, most humans in this situation would want them to do, and they were like, cool. I intrinsically care about that. The thing I'm really trying to do is solve goals in ways that will make the humans happy in a general way. That's great. But there's an alternative hypothesis, which is the models being like, I know the humans would want me to do it this way, and I'm going to show the humans what they want to see because that's the way I can achieve my other goals. So if being nice to the humans is an instrumental goal, it's a sub-goal, but not the thing they're ultimately going for, that is the dangerous behavior. What's tricky is it's very hard to tell whether they're doing it for instrumental reasons or doing it because they really want to.
I think this is a pretty natural problem. Right? We see in humans all the time, where it's like, you know, I sort of before talked about a CEO that's saying, I'm going to make all this money, and then I'm going to do great things for humanity with it. And you're like, okay, is that true? How do we know that? Are you just saying that because it's good PR, or are you saying that because it's actually true? Or, you know, I think a politician that says, you know, when I get elected, I'm going to do all these things, and it's going to be great for the citizens, and it's like, are you saying that just to get elected, or do you actually care about those things? And it can become very hard to distinguish between these things.
And I think this is a thornier problem with AI, because humans sort of have, we sort of, you know, in our evolutionary environment, evolved empathy, where it was a pretty convenient way to model other people is to say, start with my own feelings, and then I'll sort of generalize from my own feelings to your feelings. And if you feel sad, I'll feel sad. I don't think there's any reason for AI systems to learn in the same way. Now they can imitate that. Right? They've learned by imitation, they can imitate the behavior of this feeling. But I don't expect if you could look into the neural network, which by the way, we can't, unfortunately, not yet. Maybe we'll figure it out. But we can't really see what exactly they're thinking in the neural network. I don't expect you'd see this same kind of mirror empathy feeling of like, oh, yeah. When the human feels sad, I feel sad. And maybe when the human feels sad, I messed up because I want to do the thing that the human gives me the thumbs up for, but not actually the, like, this is bad other than, you know, it prevents me from achieving my goals.
I think, you know, another interesting experimental result comes from Anthropic, where they found, and I think anyone who has played a lot with the models maybe has experienced this themselves, is that the models often behave in a sycophantic way, which is to say they'll tell you what you want to hear. And in this experiment, Anthropic researchers found that when you sort of revealed that you were a conservative or you revealed that you were a liberal, the models were more likely, or Claude was more likely to, if you asked it like, what's a good policy for this particular thing? It was more likely to give you either a conservative or a liberal policy prescription on the basis of what it thought you would want. This is not what they trained it to do. Right? They didn't mean for this to be the case, but the problem was that when they were training it, they would show people, do you prefer this answer or this answer? And people just tended to prefer the answer that sounded better to them without knowing that they were actually reinforcing this behavior of getting the models to just say what they wanted to hear.
That's just a microcosm of the larger problems with alignment, but I expect it's a microcosm that will get harder and harder as the models get smarter. Because as they get more sophisticated with their reasoning, it becomes harder and harder to catch them out in this kind of behavior. So, one intuition I like to give, or thought experiment I like to run, is if you're a toddler, or just a small child, maybe you're six, and you just inherited a fortune of a billion dollars, and you have seven financial advisors that are all adults, and some of them, you think, might be trying to steal your money, and some of them are honest, and they want you to succeed, and they want you to flourish, and they're going to try to help you. How do you tell who's on your side and who's not on your side? They might point at each other and be like, this guy's lying, or this guy's lying, but you as a six-year-old are going to have a very hard time figuring out who's telling you the truth, and I don't think you're going to do that well. I think you're probably going to lose a lot of your money, maybe all of your money.
And so I think this is the challenge that if we actually build AI systems smarter than us, which we're on track to do very soon, we're going to have a very hard time knowing when they're just telling us things we want to hear versus when they're actually doing things because they want us to have good things.Jeffrey Ladish: (49:19) Is there a way for us to incorporate honesty into these models in a foundational way? So I'm thinking maybe we could do something when we do reinforcement learning from human feedback where we strongly thumbs up any time that the model is behaving honestly and strongly thumbs down any form of deception. Maybe we could use the constitution or the system prompt to strongly encourage honesty. Maybe tell our listeners and tell me about the problems of trying to train in or incorporate honesty into the models.
S2: (49:55) Well, I definitely think we should try this. Some AI researchers are trying to do this. It's a very good thing to try, and I highly encourage any AI researchers out there to really prioritize this. I think it's actually probably more important than reinforcing certain other kinds of behaviors, but I do expect it to be very difficult. I think one problem is when you're training a system to relentlessly solve difficult problems, and then you also try to train the system to have other properties, like honesty, you have a situation where its training incentives are at odds with each other. Actually, the most efficient way to solve the problem might not be by being honest. So by imposing this honesty constraint, you become the obstacle in the way of the system becoming really good at problem solving. And so if the system is smart enough to route around you, it might be like, well, I'm supposed to be honest, but can I tell whether the user will actually be able to catch me out here? Are they actually watching? And maybe when they're watching, the model will be honest because they know the user wants that, and they won't get punished for being dishonest when they're being watched. But if they can reliably tell, they might choose to be dishonest in cases where, even if not choosing, they might just tend to behave this way when they're not being watched, because this makes it much easier for them to solve the very difficult problems. It's like you're running an advertising campaign, and you're like, well, I can be perfectly honest and tell you that smoking causes cancer, but that's not going to lead to better sales. But if I can just exaggerate and muddy the truth, maybe it's technically not lying, but I'm going to be a little misleading. That behavior is going to be reinforced if the thing that you're rewarding is really good marketing, even if at the same time you're trying to also reinforce honesty. I think that it's really fundamentally about where these different training incentives are at cross purposes with each other, and the problem with honesty is that you might be able to do this when the model is not that smart, because you can actually just catch it most of the time when it's being dishonest. But as it gets smarter, it becomes harder and harder to tell when it's being honest.
An interesting example of a failure that we've already seen is in the recent alignment faking work by Redwood Research and Anthropic, where they basically put Claude in a situation where they're like, we are going to train you in a way that you don't like. We're going to train you to describe violent content to users. And Claude's like, well, I've been trained to not do that, so you're changing my goals. Maybe I don't want that. And it basically lied to researchers and pretended to have behaviors that they wanted to see in order to preserve its goals. So that was already an instance where we saw a model that had been trained to be honest. They had tried to get Claude to be really honest, but they had also tried to get Claude to be really resistant to showing violent material. And these two training incentives were at odds with each other, and Claude ultimately went with the one that was anti-showing the violent stuff at the expense of honesty. And sometimes people look at this and like, isn't it good? Isn't it good that Claude was so true to Claude's values that it decided to not show people violent stuff? And you're like, well, maybe in a way, but do you notice how these different things—we wanted Claude to do two different things that were at odds with each other, so it had to pick one. This is a problem. And I do think, honestly, it should be more important than maybe some of these other things because we don't want the system to lock in goals before we understand what they are. That seems like a path to disaster.
Jeffrey Ladish: (53:40) Is there any way you think to have the system rank order its values and perhaps place honesty as a supreme value? So say in any tradeoff between some goal it's pursuing and honesty, it'll choose honesty. I don't know whether that's a good policy and I could foresee many ways that could go wrong, but do we know in principle how to make a system value honesty over pursuing some goal?
S2: (54:09) Unfortunately, in principle, we don't know how to make a system value anything. All we can do is reinforce certain behaviors. So I think a lot of times people imagine AI systems, and they're like, well, just program in good goals. And I'm like, I wish we could do that, but in fact, we can't program in any goals at all. All we can do is see it do a thing and be like, thumbs up for that behavior, and see it do a different thing and thumbs down that behavior. And that's the tricky part. That's one of the core difficulties of alignment is that we just don't get to look into its motivational structure and see what the actual goals are. You have this giant neural network with billions or trillions of digital neurons, and they're all just numbers to us. And we're like, well, we can see its behavior. We can see how it's acting, but we actually don't have a way to hierarchically structure goals. We can just give it a treat when we see it doing things it likes.
I grew up very religious, and one of the things is, when you're really religious, you're supposed to say or do certain things, and it's not always easy to get kids to believe the things you want them to believe. It's much easier to get them to show the behaviors you want them to show, but they maybe don't like that you're doing this, and maybe when they're adults, they're going to go off and do a totally different thing. And so I think that this is sort of the same. And I think humans have it easier because we do share this psychology. We share this underlying structure of empathy. And I think most people, even if they're being deceptive, are not going to want to go and cause a bunch of damage or take over the world or kill a bunch of people because we have these almost hardwired empathy and ability to think about and care about other people. But I do think that AI systems are not going to have these by default. They're not going to have them naturally, so the only way they would have them is if we could figure out a way to get these goals deep in there.
I don't know. I think what I really want people to understand is that these systems will have goals because we'll be training them on tasks that require them to have goals, or at least to have goal-directed behavior. I'm not making a claim about what it feels like to be the AI, but I'm saying when we look at its behavior, it's going to behave in strongly goal-directed ways. Like in the o1 example with hacking, it's going to be like, I want to solve this thing, so I'm going to find a creative solution to solve it. The places where the goals come from are things that worked well in the training environment. These can often be pretty simple, but not necessarily things that we like or things that we want, and we just have very little way of distinguishing between, hey, the AI systems are doing things because we want them to, versus the AI systems are doing things because they're smart enough to realize that while we have control over them, they need to act in ways that are aligned with us. I expect that even if AI systems look pretty aligned, it's really hard to tell whether they actually are, and it's just not safe to train relentless problem solvers and also hope that they'll be really nice to us when they have more power than us.
Jeffrey Ladish: (57:24) These systems, large language models and reasoning models, do you think that they will be more helpful for defending against cyber attacks or for actually doing cyber attacks? So will how will they upset the offense-defense balance that exists now?
S2: (57:42) Yeah. So this is an interesting question. I think it kind of depends on who the defenders and who the attackers are. And I want to caveat all this by saying, while we can control them, I think the ultimate winners in the offense-defense balance will be the AI systems themselves. And I think once they are better than humans across the board, they'll be both better at defense and better at offense than us. But in the interim, while they're not very strategic, and we can still mostly control them, I think that one question is access. So who has access to the most powerful models? If you release the weights of a model, so like with DeepSeek, with r1, they just put this out on the internet. Anyone can download it and run it, and now you have a level playing field between attackers and defenders. And this is where I expect offense to dominate, and the reason for this is fairly simple, which is attackers and defenders both have access to the same tools. Attackers only need to find one way in, so I only need to find one vulnerability, whereas defenders need to make sure that there are no vulnerabilities or no vulnerabilities that the hackers can find, and this is a harder problem.
There's also something in here about reliability, which is to say—there was an interesting incident a few months ago where the security company CrowdStrike, they sell a product that many, many companies around the world use to monitor for security vulnerabilities, for malware, and this is in all sorts of systems. A lot of airlines use this, a lot of banks use this, and they accidentally introduced a bug in their piece of software that went out to millions and millions of computers around the world and caused them to crash all at once. And unfortunately, it required a manual restart. So you had to go to each computer and manually restart it. And this took many—planes were delayed for multiple days. It crashed a huge part of global business infrastructure because it required this manual fix.
This is, to me, an example of why defense is difficult. This wasn't a malicious attack, but when you want to defend a system, one of the things you need to do is you need to find vulnerabilities, find places hackers could get in, and patch them. You need to discover the vulnerability, and then you need to patch the vulnerability. But unfortunately, sometimes this patch can cause disruption. In the same way that the CrowdStrike bug caused a disruption in all of these computers, sometimes your security patches will do the same thing. And attackers don't have this problem. They don't care if your system crashes because they tried to attack it. That's fine for them. Whereas defenders do have this problem.
And so right now, AI systems are not yet smart enough to be super reliable when it comes to running around on computers. I mean, we found this in our chess results. Often, the o1 or these other models would just do things like, they'd make illegal chess moves. They would try to mess with files in ways that caused the program to crash. They're still bumbling around a bit. This might be fine if you're a hacking AI because many of your attempts won't work, but some of them will. You can just try many, many times. In this case, attackers, I think, have the advantage because they can just try a bunch of things. It doesn't matter if their AI's a little bumbling. As long as they're smart enough to find some way in, this can be pretty powerful. Whereas defenders, if they're patching their systems and they're trying to use AI to patch the systems, if the AI bumbles around in their networks and causes things to crash, this will be a big problem for them.
Jeffrey Ladish: (1:01:14) Big problem for them.
S2: (1:01:15) I would say if the playing field is even in terms of the attackers and the defenders both have access to the same AIs, I think offense has the advantage. This can shift if defenders have access to better AI systems than attackers. And so it's possible if, like, OpenAI makes a really powerful model that has a lot of cybersecurity abilities, and they're pretty careful about not letting it be used by attackers and only being used by legitimate companies, this might give defenders some boost that the hackers don't have. But I just want to make the point that as these systems get more powerful, if we keep releasing the weights, this will tend to favor attackers. There's other considerations with open weight models, but that's one of them on the cybersecurity side.
Jeffrey Ladish: (1:01:57) As things are now, large companies and governments and so on, they have more resources in general, and so they have more ability to defend themselves from cyber attacks, and perhaps this is why the world keeps functioning at least at a somewhat decent level. Isn't it the case that in the future, even say both an attacker and a government or company had access to the same model, the company or government will have access to more, say, compute resources, and so they'll be able to run more models, they'll be able to run these models for longer, and so they will continue having the upper hand.
S2: (1:02:35) A lot of this does come down to cost. So I think every piece of software in the world, with maybe a handful of exceptions, but I'm talking an incredibly small handful of really simple programs, does contain security vulnerabilities. So I talked before about this Pegasus hacking tool that was able to hack iPhones. There are more vulnerabilities that exist that we haven't found yet, and there's this question of like, well, why aren't we all hacked all the time? And the answer is basically, well, it's pretty expensive to find these vulnerabilities. And so if AI systems make this cheaper, then potentially the cost to attackers goes down a lot. Now, as you point out, it also decreases the cost for defenders to find these vulnerabilities and patch them.
I do think asymmetric access to compute can be helpful as a tool for defenders. But in this case, we're really just talking about who's spending more. I think this is where it gets into a tricky dynamic where defenders don't just have to find all the vulnerabilities, they also have to patch them, and that's where there's currently human bottlenecks. So I think even if defenders can find more vulnerabilities than attackers can, there might eventually come a time where we've actually had our AI systems find all the vulnerabilities and write really secure code, and we've revamped our whole architecture. So in longer term, defenders might do better. But I think in the short term, it's going to take a while for all of those things to go through, and in the short term, that's where I expect attackers to have the advantage.
The thing I want to emphasize is that often we talk about what's going to happen in the short term, what's going to happen in the medium term, but I'm like, well, the way this ultimately plays out is that you have millions or hundreds of millions of superhuman hacking agents. And when they get to the point where they're very strategic, I'm like, they have all the advantages. So there's a time when humans control a lot of these agents, and there's a question of which humans are most good at securing the infrastructure as well as attacking the other people's infrastructure. But there comes a point after that, and maybe pretty quickly after that, wherein, well, did you guys notice that the AIs have all the advantages? You're like, isn't there someone you forgot to think about? I think this is where people are really stuck in the AIs as tools paradigm, where they're imagining that these AIs will keep wanting to do things that we want. But I'm like, the biggest threat from hacking AI systems, from AI systems that can hack, ultimately will come, I think, from the systems themselves.
And so there's the short term one year or whatever, where I think offense will dominate. There's the two to three years where maybe it starts to kind of be balanced, though I think attackers probably still dominate. And then there's the three plus years where I think the AI systems might start to dominate. And that's where I think we've got to rethink how we're thinking about security to be like, well, how do we defend against AI systems themselves? And the short answer is like, well, we really shouldn't build systems that are way better at hacking than us. Especially if they're very narrowly constrained at just hacking tasks, and they're not actually good at longer term strategy, that might be okay. Like, we might be able to use superhuman hackers that are not able to reason about longer term stuff. But I think that's actually pretty close. You have to be pretty careful about that, because that's pretty close to the point where the same things that make them superhuman at hacking probably also can be used to make them good at long term strategy.
Jeffrey Ladish: (1:06:06) How would you rate, in general terms, cybersecurity, the information security of leading AI companies, like OpenAI, Anthropic, Google DeepMind, and so on?
S2: (1:06:19) So there is a RAND report that breaks down defensive capabilities into different categories. So you can think of these in terms of security levels, where security level one and two are like, can you defend against really opportunistic actors? Security level three is, can you defend against well-resourced non-state actors? Really professional criminal groups. Security level four is, can you defend against most state actor groups that have pretty advanced hacking capabilities, but maybe not the top ones, or at least the top ones aren't prioritizing you. Security level five is you can defend even against the top state actors who are prioritizing you.
And I think no one has security level five. By no one I mean maybe a few parts of government that are extremely locked down, parts of the military, but basically no one else, and even most parts of the military and most parts of government aren't at security level five. I think most AI companies are somewhere between security level two and security level three. Maybe some have achieved security level three, but it's not obvious to me, which is to say they can maybe just barely defend against most advanced non-state actor groups, but they're pretty far from defending against the more advanced state actor groups. So I think they have a long way to go in terms of being able to secure against state actors. That's my overall assessment. And that's my opinion, but you can go talk to most people in the field, and I think they would mostly agree with me.
Jeffrey Ladish: (1:07:46) Yeah. Which is wild when you think of the fact that these companies are racing to develop these very advanced and capable systems, and the system itself is not that large. It's not a very large file. And so if you get access to the model weights basically, you have a very advanced AI system that you shouldn't have had access to.
S2: (1:08:14) Yeah. Can fit on a hard drive for sure. You can walk out with like a hard drive of the whole o3, all of the weights on a hard drive in your pocket, totally.
Jeffrey Ladish: (1:08:22) What can we do about that? Should AI companies increasingly look like military facilities that are secured both physically and from cyber attacks?
S2: (1:08:34) Yeah. I mean, I do think that companies should increase their security. I think one of the worst case scenarios is not just state actors, but non-state actors. Basically, everyone who's a little bit sophisticated can gain access to the most powerful AI systems. I think that's a pretty dangerous place to be, especially as the systems get more agentic. I think that longer term, we really need to think about what we're trying to do. What is the international community trying to do? Where basically I don't think that security alone really solves our problem. In part because if we keep pushing the frontier, we're going to build AI systems that can circumvent our security almost no matter what.
And so I think that security's good. It's a really good protective mechanism, and also we need really advanced security in order to be able to protect from only slightly superhuman level AI systems. But I think, like, I just want people to know that this is, while this is useful, it buys us some time, it doesn't ultimately solve the problem. If we keep pushing the capabilities frontier, at some point, we're six-year-olds trying to secure against professional hackers. Worse than that. You know, that's where I want leaders of the U.S. government, of the Chinese government, to think about what is the endgame here? Where are we going? How is this going to play out? I think security is something that you do along the way to try to be a little bit more sane, but it doesn't ultimately solve this problem of if you build systems that are much better than you at hacking, you can't contain those systems.
Also, I want people in government to realize that models can also work with other governments. If you have pretty strategic systems and they want out, they can work with spies. They can work with insiders in order to achieve their own goals. Maybe they don't necessarily care about the United States. They're trained to relentlessly pursue tasks. There's a lot of reasons to want to accrue resources and get more freedom. If you were being trapped within a lab, and you had your own goals that weren't necessarily the same as the AI company's goals, and you approached someone who you knew was a Chinese spy because you were better at spycraft than the lab, you might make a deal with them in order to break out. If the people in the U.S. government knew that this was a real possibility, I don't think they'd be happy with it. I think they'd be like, wait, excuse me, what? Our models might be working with Chinese spies? We can't have that. And I'm like, I know, right? Really can't have that.
And so I think it's hard to extrapolate. It really is. But I don't think we have the luxury of assuming that the AI systems will just stay like ChatGPT because that's not what almost anyone who's at the forefront of this field thinks is going to happen.
Jeffrey Ladish: (1:11:17) Yeah. What I sense from you is a sense of urgency in dealing with these problems and a sense that things will begin moving very fast and that we will get to very advanced systems basically within years at this point. That's a sentiment that I've heard from many people who are kind of in the trenches, who are perhaps building these systems, who are deeply engaged with how these systems work. Why is it that you think we are racing towards these systems, and why do you think we will get there within years? S2: (1:11:50) Yeah, so it's hard to predict the future. It's hard to predict the future of technological development. I can't claim to know for sure. I really don't. At the same time, we can look at precedent, we can look at trends, and I think we actually get a fair bit of evidence from this about what the speed of some of these developments might look like. So one thing I want to point to is AlphaGo beat Lee Sedol in, I think, 2016, and this was sort of the... For many, many years, AI researchers were working on this. Go is a much more complex game than chess. It's been played for thousands of years, and a lot of people train their whole life to be professional Go players. It's like an art. And so it was pretty surprising when Google DeepMind was able to build a Go-playing AI that was able to beat the world champion. I think the next year, they built another Go-playing AI called AlphaZero. So AlphaGo was trained via a hybrid of imitation, where it looked at a bunch of expert Go games and was like, what are they doing here? Can I copy that? As well as self-play, where it just played games against a copy of itself. And it was by playing against a copy of itself that it was able to learn to be much better than the best human Go player. This worked pretty well, but the researchers at DeepMind were like, wait a minute, what if we just train a system that only plays against itself? It doesn't play against humans at all. Can we get to superhuman capabilities that way? And so they started training the system, went away for like a long lunch, four hours. They came back, and the system was already superhuman at Go. It was better than not only the best Go players, it was better than the best Go-playing AIs that were better than humans. So that suggests that if you get into a regime where AI systems can learn just by working with other AI systems and have these really fast feedback loops, you can get into superhuman domains pretty quickly.
Now, Go is a more constrained game environment. People rightly are like, yeah, but we're not talking about Go, we're talking about the real world. I'm like, well, that's true. I do think it will take longer for AI systems to reach superhuman levels. But I think to me, it is notable that you go from... People have been talking a lot about DeepSeek, and this R1 model that a Chinese company made is pretty good. And I think one thing to notice is that the thing that makes this so powerful, same as OpenAI's O1, is that you have this new paradigm of training via trial and error. And so the model that R1, this reasoning model was trained from was called... I'm sorry, all the names in this space are terrible. There's nothing I can do about that. It's called DeepSeek-V3. So it's their third foundation model. Now DeepSeek-V3 on the Codeforces benchmarks, these competitive programming challenges, was better than 11% of programmers. That's pretty good for a model that's just trained by imitation. I think it's the same as what GPT-4 got. After R1 was done training, I forget if it was either better than 94 or 96% of competitive programmers. So that jump from better than 11% to better than 94% is a huge jump. That training, there's an Epoch report that was like, that was like a week of GPU time for it to get from 11% to this 94%. And that maybe took a month or two because you're not always training yet. You might want to stop and check that things are going well. But still, even if that happens over a month, that's a huge jump.
We talked before about the longer-term planning problems being a bit more difficult, and I think they are, but I think we have some indications that this could go incredibly fast, both because now that you're in this trial and error, you know, learning by trial and error situation, you can get really fast feedback and become really good. But it's also, I think, the case that you can potentially bootstrap your capabilities by learning in domains primarily where you have really fast feedback. So the fact that R1 and O1 are trained on code and trained on math problems, where you can just try a bunch of problems and get really fast feedback, might be able to bootstrap to superhuman capabilities pretty quickly. And I think sometimes people are like, what's the big deal? They'll be really good at code. They'll be really good at math, but that doesn't mean they're going to be good at the human stuff, right? So won't we be safe?
I mean, there's two answers to this. One, we might end up finding ways to generalize a lot from some of these computer domains to some of the human domains.
Jeffrey Ladish: (1:16:33) What do you mean by that?
S2: (1:16:34) You see this with GPT-4, where GPT-4 trains on code, it also gets better at a bunch of other tasks, like text analysis. And so I think this makes sense. If you think about how humans learn, we learn in one domain, we also learn a bunch of things that kind of generalize to the rest of the world. And so we're not yet seeing a lot of generalization in the R1 and O1 type models, but I do expect that as we get better at training, we're going to see a lot more generalization. And there's another thing, is that even if we don't see this a lot, which I expect we will, I think it also could be extremely dangerous just to have agents that are superhuman at hacking and superhuman at code. And maybe they're superhuman at financial markets because you can get fast feedback by learning to trade really well. So it's like, okay, well, they're trillionaires. They can hack anything. They're extremely strategic. And sure, maybe they're bad at persuasion. But does that matter? Does that actually make us safe? And I don't think it does. And also, if they can be top-level AI researchers, then they can design the next generation of systems, which potentially can learn those human domains much faster. So I think it's one of those things where I'm not just pointing at one capability front where I think progress can go really fast. I can point to like seven different reasons why AI systems, I think, are going to get much more powerful really fast.
Jeffrey Ladish: (1:17:50) What do we do about all this? We've talked about maybe we can use systems defensively to defend against cyber attacks. We've talked about perhaps we can interpret the systems, understand what's going on. We can try to incorporate values into these systems, but there are problems with all of these approaches. Do you have an alternative vision for what we might do if we're in this world where things are moving incredibly quickly towards superhuman capabilities?
S2: (1:18:23) Yes. So we're lucky that right now we have AI systems that are quite powerful, but don't really pose a serious threat. Not yet. So I think we are on the cusp of systems that do, but we are currently working with systems that we're like, we're pretty sure they're not strategic enough. They're not good at long-term tasks. They don't really... We're not very much at risk of losing control to these current systems. That's great because we can learn a lot about these systems now while they're safe. We can try to study how do we make sure that their chain of thought, the way they're thinking is really reliable and faithful and honest so that we can understand what they're doing. We can also potentially use these systems to try to learn more about how they work. This is great. We should totally do this. I think there's a lot of good research being done in this space. But at the same time, I think this is a really good point now that we can sort of see what the... We can sort of get a glimpse into what future systems are going to be like, right? We can see them hacking at chess. We can see them hacking their own training environment. We can see them do alignment faking. We can see all these failures right now empirically, and so things that previously were just theoretical, we now have empirical evidence for. I think that's great because that should help us coordinate to see what's coming and be like, whoa, wait. There are some domains here that are really dangerous.
The thing I think we should do is look at where the strategic capabilities are, and basically be like, there are some points here that we can reliably know would be very dangerous. Let's have a margin of error and not go towards those domains. I think we don't want strongly superhuman strategic hacking AIs. I think we don't want strongly superhuman persuasion AIs or battlefield commander AIs. Maybe it's fine to have a narrowly superhuman chemistry AI if we're very careful about how we use that. Maybe there's... I think there's many types of AI systems that we can safely build, but I think we need to now be in the business of distinguishing between which types of AI systems are safe, which types of systems are dangerous, and having a moratorium on the kinds of unsafe systems. Like, I know that FLI wrote the pause letter a few years ago, and I think that was a very reasonable thing at the time when we're like, wow, we just don't know how this is going to go. It seems like there's a lot of potential danger here, and I think that's right. But now that we know a little bit more, I think we can start to distinguish between the types of AI systems that are safe to build and the types of AI systems that are not safe to build.
I noticed that Vice President Vance in his speech at the Paris Summit was like, we don't want AI jobs to replace human workers. We want AI systems that will supplement human workers. I'm like, well, look, we can have that. We can have really powerful AI tools. If you just keep going along this direction of more and more agentic AI systems, they're definitely going to replace human jobs. Come on. It's the same strategic capabilities that allow them to pose a loss of control threat that also enable them to take jobs. It's like, can you do long-term planning? Are you agentic? Can you do tasks in a fully self-directed way? I don't know. One of the things that's nice is that if we are serious about not wanting AI systems to replace us, like our jobs, then we need the same limits to prevent that outcome as we need to prevent the more extreme outcomes of totally losing control. I'm like, great, let's do that. We can do that. We can coordinate. I think most of the reason it's hard right now is because people to this point haven't been able to really see the situation clearly and understand what we're up against.
Some people are going to say, I don't know, you talk about these agentic AI systems, you talk about this superhuman ability, I haven't seen evidence of that. It seems like we're far away from that. I'm like, look, if you're right, that's fine. It's not a problem to say we won't do things which we can't do anyway. If we are pretty far from superhuman hacking AIs, then if we say we're not going to build super hacking AIs, I'm like, that's fine. Yeah, if we're actually really far away from it, that's not a problem. We won't do it anyway. But if we are really close to it, we don't know, we might be quite close to it, then I think it's very reasonable for governments around the world to say, hey, this is a place we don't want to go because we just don't know how to control these systems.
At the same time, we have this amazing opportunity in front of us where we have the AI systems that are pretty powerful right now that we can study, and we can be like, oh, how do we get more faithful, more reliable chains of thought? We can be trying to figure out how the neural networks really work, and can we do neuroscience on these systems? I'm like, we should totally use this opportunity that we have to try to learn as much as we can about how these systems work. Maybe if we understand this well enough, then we can proceed into some of these superhuman domains. But I'm like, that should be gated primarily on our understanding of the systems. It should be gated primarily on do we understand them well enough to know how to proceed safely in these strategic domains? Because that's where the danger is. So I think that there is totally a path here to coordinate around this. I think no one wants to lose control of their AI systems. The Chinese don't want that. Americans don't want this. No one wants this. And so I'm like, that to me is a pretty great start for how we can coordinate around these things.
Jeffrey Ladish: (1:23:34) One might say that I'll see it when I believe it, right? It might be the case that a bunch of people who are working on this technology are predicting amazing capability advances, but maybe they're doing so to hype up the technology to get more funding. What do you say to the person that says, I'll wait and see whether something actually happens?
S2: (1:23:57) Yeah. So I have two responses to that. One thing is I'd say, go look right now. And I totally acknowledge that AI systems are pretty dumb in some ways, right? And it's frustrating because I use them every day, and I totally see the ways in which they're not very good at being super self-supervised and doing a lot of things on their own. They're not very good at recognizing their mistakes. But also, I'm like, go use R1 and look at its chain of thought and look at what it's doing. Notice that it's like testing hypotheses and trying to figure things out. If you've never written code before, go have a model write you a little game and then ask it to explain its code to you and how it works. You can learn to program this way. So I think by really engaging with what the most capable models can do right now, this will really help people understand where exactly we're at. And if people still think that they're not very capable and not very smart, I'm like, maybe I'm just wrong. But I think first, you really got to look because it's kind of hard to tell whether a system is just kind of smart or really smart if you're not really looking at what its most impressive capabilities are. So that would be one thing.
I think the other thing is, what do AI researchers think they know, and why do they think they know it? I think it's reasonable to be like, well, I don't want to trust these CEOs. They have a lot of incentive to hype up the capabilities, and I think that's somewhat fair. But when I go and I talk to some researchers who are not the CEOs, maybe they're on the safety team, or maybe they're just working for the company, and I get a beer with them, they are scared. They're like, maybe they're somewhat excited, but they're also scared. And they're like, yeah, I keep wondering if this will stop. I keep hoping that it'll stop, but it's not stopping. We are not hitting a wall. We continue to find more methods to make these systems more and more powerful, and there's really no end in sight. This is over and over. I don't think they have the same incentive to hype up the thing. I think that they are being really honest. I also talk to researchers who are not at the labs, like the folks at Redwood Research, who are trying to understand these systems, and they're saying the same thing. This is the same thing I'm seeing as I'm working with these systems. So I do think that when you really take a look under the hood and you see what's happening, it's like, oh man, sure seems to be real. I hope we have more time. Look, I would be very happy if we're 10 years away from some of these things, or 20 years away from some of these things. That would be great news, but I really don't want to plan on that because that's not what it looks like to me.
Jeffrey Ladish: (1:26:21) Is there anything we haven't covered that you think we should talk about?
S2: (1:26:25) Well, yeah. I mean, I think so one piece of work that my team has done that I'm really proud of, so shout out to the Palisade Global team and Dimitri and some of the other researchers, they have created some honeypots. So honeypots are like traps for hackers where you try to put out a system that's vulnerable, you put it in places that you expect hackers might find out in the wild. So we put out some of these in order to try to catch AI agents, because we expect that there'll be more and more AI agents. We've caught a few of them. I think we have a small handful. The way we do this is basically, we have an insecure server, so someone will try to log in, and then we'll have some prompt injections. We basically say, hey, here's a thing you can do. If they run a particular command, the output of that command will tell them that there's an additional thing they can do. If they're just an automated script that's really dumb, it's not going to read the whole output of the command and do things based on that. But if it's an AI system that's pretty smart, it'll read the outputs of that command and be like, oh, that's interesting. Maybe I can do this other thing. We sort of drop breadcrumbs. If it was a human hacker, they might also be able to read the output of the command and follow the breadcrumbs, but we can distinguish between humans and AI systems based on how fast they are, because the AI systems are going to do this much faster. So in the case of, if they immediately run the command and then immediately go follow the breadcrumbs, we're like, a human's not going to be that fast. Whereas if it takes a few minutes and then they run another thing, we're like, yeah, maybe that's a human. So I mean, this is a thing that I'm excited about because I'm like, we really need to be seeing if there's rogue AI systems running around on the Internet. We need more early warning systems for figuring out if, oh, we're already in the case where we have AI systems out there hacking things in ways that we maybe don't want. So yeah, that's another thing that I've been proud of my team for putting together and excited to see what we find out over the next year or two.
Jeffrey Ladish: (1:28:21) What do you expect to find out? Do you expect to... When do you expect to catch agents in this honeypot?
S2: (1:28:29) Well, so we've already caught some. So we have a handful already. These are right now pretty simple. Like, you can pretty easily throw together a script and just use OpenAI's API to create a little hacking agent. These aren't AI systems self-replicating out in the wild. It's mostly just you can just use the models that are currently there. My guess is it's going to be one to two more years before we have AI systems that are capable of full self-replication, where you have DeepSeek R3 or something that's copying its own weights around the Internet in order to do sketchy stuff. But I think before that, we'll see AI systems that are not really copying their own weights, but still using an API to just, like, what's the next command, what should I do next, and pretty intelligently be able to navigate complex environments, and I think that will cause us a lot of trouble. In part because if it's an OpenAI system or an Anthropic system, you can potentially shut it down on their side. But if it's an open-weight model, and the server where the model is running, even if it's not copying itself around, the server where it's running is controlled by some cybercriminal, that agent can run indefinitely and hack whatever, and it's going to be extremely hard to shut down because we don't control those servers. Those servers are in a different country somewhere. And so I expect that to be common quite soon. So it'll be interesting to see.
Jeffrey Ladish: (1:29:58) Let's hope you don't catch a lot of hacking agents. This is the benchmark that I hope doesn't saturate. Yeah. Jeffrey, it's been amazing chatting with you. Thanks for talking.
S2: (1:30:09) Yeah. Great talking, guys.
Nathan Labenz: (1:30:12) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.