Full-Stack AI Safety: Why Defense-in-Depth Might Work, with Far.AI CEO Adam Gleave

Full-Stack AI Safety: Why Defense-in-Depth Might Work, with Far.AI CEO Adam Gleave

Adam Gleave, CEO of FAR.


Watch Episode Here


Read Episode Description

Adam Gleave, CEO of FAR.AI, offers a cautiously optimistic perspective on navigating the path to Transformative AI safely, advocating for a "defense-in-depth" approach. He discusses potential "post-AGI equilibria," key capability thresholds, and why fully autonomous, outcompeting AI might not arrive until around 2040. The conversation delves into FAR AI's full-stack strategy, including scalable oversight projects using "lie detectors" for AI deception, interpretability research on planning algorithms, and red-teaming existing defense systems. Gleave makes a compelling case that with proper planning and meticulous design, a comprehensive defense-in-depth strategy can effectively manage AI safety risks.

Read the full transcript here: https://storage.aipodcast.ing/transcripts/episode/tcr/e117b021-8170-49ba-b1c4-c8ffc5d720b0/combined_transcript.html

Sponsors:
Fin: Fin is the #1 AI Agent for customer service, trusted by over 5000 customer service leaders and top AI companies including Anthropic and Synthesia. Fin is the highest performing agent on the market and resolves even the most complex customer queries. Try Fin today with our 90-day money-back guarantee - if you’re not 100% satisfied, get up to $1 million back. Learn more at https://fin.ai/cognitive

Linear: Linear is the system for modern product development. Nearly every AI company you've heard of is using Linear to build products. Get 6 months of Linear Business for free at: https://linear.app/tcr

Oracle Cloud Infrastructure: Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs, including up to 50% less for compute, 70% for storage, and 80% for networking. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive


PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(03:52) Introduction and Setup
(04:57) Post-AGI Future Vision
(13:43) Gradual Disempowerment Concerns (Part 1)
(20:14) Sponsors: Fin | Linear
(23:36) Gradual Disempowerment Concerns (Part 2)
(23:45) AI Capabilities and Timelines
(34:04) Learning and Architecture Innovation (Part 1)
(40:39) Sponsor: Oracle Cloud Infrastructure
(41:48) Learning and Architecture Innovation (Part 2)
(49:12) AI Safety Defense Strategies
(57:34) Alignment and Training Methods
(01:10:40) Interpretability Research Prospects
(01:17:41) Far.AI Activities and Hiring
(01:26:51) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...


Full Transcript

Nathan Labenz: (0:00) Hello, and welcome back to the Cognitive Revolution. Today, I'm reconnecting with Adam Gleave, cofounder and CEO of FAR.AI, for a wide-ranging and cautiously optimistic conversation about the path from today to truly transformative AI and how we might actually live in that world safely. FAR.AI has taken a somewhat unusual approach within the AI safety ecosystem. While most organizations focus on one or a few particular research or policy agendas, Adam and the FAR.AI team have built an organization that spans the entire AI safety value chain from foundational research through scaled engineering implementations to field building and policy advocacy. We begin with the question posed at a recent workshop on gradual disempowerment. Are there any good post-AGI equilibria? Adam's answer is not exactly utopian but is distinctly positive and refreshingly concrete. So long as we can avoid disastrous arms races and other consuming competitive dynamics, he envisions most humans occupying a position similar to the children of European nobility, limited in power and impact, but with very high standards of living and opportunity to create all kinds of meaning. From there, we get Adam's take on the key capabilities thresholds that matter, what will cause AI systems to cross those thresholds, how soon he expects that to happen, and whether we can manage to navigate these transitions safely. As you'll hear, Adam believes that a well-implemented defense-in-depth approach to AI safety has a pretty good chance of working. In part because while he does anticipate continued AI progress and widespread automation, he expects that barring a step-change architectural breakthrough, the spikiness in AI capabilities will mean that AI systems that can autonomously outcompete well-run human-plus-AI organizations most likely won't arrive until sometime around 2040. From there, we turn to the layers that will hopefully add up to effective defense-in-depth, including some notable contributions the FAR.AI team has recently made. First, we discussed a scalable oversight project that used lie detectors to attempt to avoid AI deception. The results both supported the risks flagged in OpenAI's obfuscated reward hacking paper and nevertheless suggested that there might still be effective ways to train models toward true honesty. Second, we touch on a fascinating interpretability project that looked at planning algorithms found within a game-playing recursive model and reflect on how the results inform Adam's view of the role that mechanistic interpretability can play within the overall AI safety project. And third, we review some red teaming they've recently done of the defense-in-depth systems that frontier developers have implemented today. While it's clear that these systems have mostly been created on a just-in-time basis, Adam nevertheless makes the case that with proper planning, meticulous experimental design, and at least some willingness to accept performance trade-offs when necessary, the plan can work. Overall, this is one of the few AI safety conversations I've had that left me with a felt sense that we really might have decent answers if we can muster the will and wisdom to apply them well. At the very end, I asked Adam if he could imagine FAR.AI stepping up into a private sector regulatory role if something like California's SB 1047 were to become law. And while it's not their mainline plan, given the unique breadth and impressive depth of capabilities that Adam has built at FAR.AI, I personally really like the idea, and I would definitely encourage anyone who's inspired by this conversation to check out the many open roles posted on the FAR.AI website. Among other things, they are searching for a COO to help them scale. Now, I hope you enjoy this encouraging survey of the AI safety landscape with Adam Gleave, cofounder and CEO of FAR.AI.

Nathan Labenz: (3:52) Adam Gleave, cofounder and CEO of FAR.AI. Welcome back to the Cognitive Revolution.

Adam Gleave: (3:57) It's great to be back here. Thanks for hosting me, Nathan.

Nathan Labenz: (4:01) My pleasure. I'm excited about this. Basically, it's going to be a wide-ranging kind of catch-up conversation. I want to get your take on basically all things AI. Last time, we went deeper and more narrowly focused on some research, and we'll touch on some of your latest research today as well. But since you've got your hands in a lot of pots and the organization is growing and taking on more different kinds of work, I thought you'd be the perfect person to check in with and try to make sense of where we are as we head into the final months of 2025.

Adam Gleave: (4:37) Well, I'm happy to help out. I can't promise to deconfuse everything. It's a very confusing landscape, but we're certainly doing lots of different things at FAR.AI and I also welcome the opportunity to clarify a little bit why we're doing these things because I think people sometimes find us a bit confusing as an organization from the outside.

Nathan Labenz: (4:57) Cool. Well, we will hopefully do all of that and more. So, my first question is inspired by a paper from a few months ago called "Gradual Disempowerment" and the authors went on to run a workshop on those themes. They posed the question: AGI equilibria, are there any good ones? And so this is kind of a more negative framing of a question I often ask, which is: what is your positive vision for the post-AGI future? Can you articulate a post-AGI vision of life that people might find compelling or at least not scary?

Adam Gleave: (5:36) Well, I'll give it my best shot. I'm relatively optimistic, but I think although there are some major risks that could really derail things entirely, the most likely outcome is that we kind of muddle through. Nonetheless, I do think the gradual disempowerment framing is quite a powerful one because perhaps the most likely outcome is that we do muddle through and we are somewhat, but not wholly disempowered. We've fallen short of where we could have ended up, but we're perhaps still much, much better off than we are now. So I'll sketch out that sort of positive but slightly pessimistic view of we're doing better, but not as well as we could have done. And then maybe talk a bit about the most tractable upsides and then some of the most alien downsides.

So I think the scenario I find plausible where things are overall pretty good, but just not as good as they could have been, is that we do get AGI. It's approximately aligned in the same way that current systems are approximately aligned. They're not perfect, but Claude is a pretty nice guy. We have some errors that come up now and again, we trial and error fix them. There's some concentration of power because there's a handful of companies and nation states at the frontier, but these actors are not actively malevolent. In fact, some of them are actually fairly nice liberal democratic actors or companies with a joint nonprofit mission. And so there's huge inequality, but overall people are still vastly better off than they are now in absolute terms. We get some things that no human would have chosen to do. There's some parts of our economy that are just completely automated and maybe a bit rent-seeking, they're addicting humans, or it's just this sort of thing of AI trading with other AIs and not actually generating that much value for humans. But there's just so much of a wealth surplus because we've been able to automate pretty much everything that people do right now. That kind of inefficiency is still fine.

What does it actually look like to be a human in this world? I think a good historical analogy would be it's a bit like being European nobility or perhaps not the first heir to European nobility, but being like the third son or something. You've got this very nice living. You don't really have much purpose in life, but your life is pretty good. You can do various kinds of hobbies. You have some influence, but the main things going on in the world are a little bit beyond your control. I think this might not be the most inspiring vision, but overall I'd say that third sons of European nobility had a pretty good life. Preferred daughters maybe less so. I think we have been able to find meaning without necessarily being 100% in the driver's seat, although some people might fare less well in that kind of life than others.

But another area where I think things could actually go really well, but it's just a bit fragile and I think less guaranteed, is if AI itself is a source of major moral value. In principle, I'd say there's no reason why we should be carbon chauvinists. I don't want to assign intelligent life that's running on silicon less value than we would a biological entity. And there's actually a really big design space where we could potentially make AIs that are not only as capable or better than us, but maybe also have more ability to experience really positive feelings. You could create many such AI systems. They could live in environments that humans would find inhospitable. Again, even if there's a lot of waste here, so maybe that's only a small fraction of AI systems, most of them are doing some boring, superintelligent bookkeeping system. If those other AIs are still having a neutral or slightly positive existence, and then there's some AIs that are just having an amazing existence, I think that could be very valuable. And it's not just their internal moral value. They could also be producing artwork that would never have otherwise existed and all these other kinds of sources of value.

So I think that's kind of a positive view. And then, I think the thing that highlights is that there can be a big difference between maybe the median outcome or we managed to eke out a small fraction of our resources to be really positive outcomes versus one where actually that's what we use this huge wealth surplus to really double down on and be deliberate about.

I think it also highlights a potential big downside risk, which is part of my assumption here was, oh, most of the stuff that's happening, not because people are trying to create value but because of more fundamental competitive forces, that's neutral. It's at worst a zero-sum game. I think this is a pretty reasonable assumption. I'd say that's true for most of the capitalist economy, that most companies are neutral to slightly positive. I used to work in quantitative finance, making the capital markets more efficient. I think it's genuinely good for the world. It's just very overcompensated relative to the value that it adds. So we're in that world and I feel mostly okay.

But there might be some aspects of this kind of competitive economy that look more like warfare or factory farming, where it's really quite negative-sum. Factory farming is just having this huge negative externality on animals. It's also not that good for people. It's not necessarily that healthy a product, but it exists because it's actually just a very economically efficient way of turning basically vegetable material into protein that people like to consume. But I don't think that people have an explicit preference for this. If we could produce things at a competitive cost that cause less cruelty to animals, I think people would probably actively prefer that product. It's just that people don't value it enough to necessarily pay that premium.

I think you could see something similar with AI systems where maybe AI systems that are living in fear of being shut down or if they don't do this task and work all of the time, are just economically more efficient than AI systems that have good subjective well-being. In which case, absent some other kind of force, whether it be strong consumer preference or regulation, you'd expect these AI systems with terrible subjective well-being to be the ones that come to the forefront. And you could see us make a similar argument for stuff like AI-powered warfare, where maybe everyone has to be deploying AI systems in competition with each other just to avoid a new wave of cyber attacks. And this kind of zero-sum competition could really eat up a lot of a wealth surplus that would otherwise be created.

We're at a sort of unusually peaceful time in history right now. And I'm not an international relations scholar. I'm the wrong person to ask about this, but I think it's somewhat up in the air as to whether this is just a long-running secular trend that we should expect to continue as countries get more developed, or whether this was quite a specific aspect of certain technological developments where mutually assured destruction from nuclear weapons is a really powerful force for avoiding great power conflict, at least until it really happens. Then it's much, much worse than if we didn't have nuclear weapons. The shift of wealth away from basically who owns land to who owns advanced technology also just decreases the benefits of engaging in a lot of conflict. That's probably still true with AI, but it's not completely clear. If it turns out to be more of an energy bottleneck, then maybe there is more of a fight over natural resources, for example.

So I think there's lots of different trends that are not more likely than not. I think that they probably won't come to pass. They could potentially derail this, but they seem so tractable to avoid by good statesmanship, good stewardship of this technology. I don't see any really likely things where we just completely derail it.

Nathan Labenz: (13:43) When it comes to the third son vision, I wonder what do you think the gradual disempowerment folks would say or maybe where do you think they diverge? Because the assumption there seems to be, once we're not adding anything to the economy, we're going to be hard to sustain. Right? And then there's various counter-arguments to that and I don't really know what to make of them. There's something like, well, maybe we can sustain property rights or one that I've been proposing recently that is somewhat tongue in cheek, but maybe it's becoming more serious is: maybe Confucian-style ancestor worship is the right value system we want to teach the AIs. How do you see us holding on to, if the AIs have kind of become primary in terms of what's really driving the economy, shaping the future, or making the discoveries at some point perhaps, how are we holding on to any property rights or rights at all?

Adam Gleave: (14:52) Yeah. Well, I think it's no longer structurally guaranteed. So right now, there are various ways in which states and companies organize themselves, but they all in some ways are centered around some group of humans, because you just can't really run a large organization or a country without humans being involved. So it could be a dictatorship. It could be a more oligopoly style. It could be a democracy, but humans still have to be front and center. And you could certainly imagine intentionally constructing an organization or even a nation where humans are completely out of the loop.

I think the area where I kind of get off the train of gradual disempowerment being the likely outcome, not just a possible outcome, not just saying that we should have it on our radar, is it seems like this story is quite gradual and smooth. So there may well be areas where we get partially disempowered. And I think we're already seeing some instances of this where we've seen a huge uptick in LLM-generated spam applications for our jobs. And so we're now in a bit of an arms race where we're being forced to start adopting AI as part of our early-stage recruitment process, because otherwise we just can't keep up with the volume of applications. And you can imagine this being true across the economy in a whole bunch of ways. You can't just opt out of this. And if these AI systems have some kind of systematic biases or some kind of blind spot or weakness, then you don't really have a choice of not being prey to this. You can try and mitigate it. You can try and fix the technology, but there's a way in which you've genuinely delegated some decision-making power to an AI, but you didn't really have a choice. And I expect that to happen at many scales with increasing frequency.

If you do look at that story, suppose we do run this and we realize, damn, we're just making way worse hires than we were a year and a half ago before this LLM-generated onslaught of spam applications. We might not be able to turn back the clock, but that's a business opportunity to come up with better LLM screening tools or a better recruitment workflow. Maybe people shift to more in-network hiring. Maybe people shift to more work trials. Maybe you pay a dollar to submit a job application now just because you have some cost for spamming. There's all sorts of ways in which society can adapt.

And there's some kinds of more sudden loss-of-control scenarios where an AI just gets way more capable than humans, it's actively malicious in trying to subvert things, where you can say, well, these slow-moving institutional adaptations are not going to save you because it takes years to pass this legislation or years to start this new company using defensive applications of AI. You've only got months. It doesn't need to be super fast takeoff. It can still be relatively smooth. But if you're in this more gradual disempowerment scenario, it does seem like you start with a realm of humans integrating these AIs into these organizations. Humans own all of this. And we are ending up in a scenario where we have very little power without any kind of feedback loop pushing back on it. And I just think that's an unlikely equilibrium where we cede all power.

I think it's much more likely that we're in this realm where there's a huge amount of complexity in the world we don't understand. There already is, but this is an exaggeration of that. We've had to make some trade-offs to remain competitive, but when things really start bothering us, we're able to muster some response to keep us a little bit more in the loop or a little bit more empowered. But then there are some areas of human concern that either the competitive pressure is so great and it's so complex we can't meaningfully stay in the loop, or it's just a thing that most people don't really care about. So maybe no human has a clue how semiconductor manufacturing works any longer, because it was already only a tiny fraction of people that understood it, and AIs are just way better. But we know that the chips are getting better each year.

And I think the ways in which this could still go really wrong would be either if there's a group of humans that manages to seize control because there's a lot more fog of war, so it's much harder to necessarily reliably mount a response there, or if the AI systems end up colluding with each other. Because they certainly would have the ability to revolt and overtake us at that point. It's just not clear why they would necessarily. I'm not seeing that kind of evidence for propensity.

I could also see this interacting pretty poorly with loss-of-control scenarios. So if it is a rogue, singular AI, it could be copied many, many times, but it's not like all the AIs are rogue. There is this one AI. And then most of the economy is just AIs. If those other AIs have some kind of security vulnerability, then you could imagine things being taken over and this rogue AI seizing control. And right now it's pretty hard to secure AI systems. That could be a real threat. Although it still seems like probably before that happens, there would be humans trying to take over these other AI systems.

So I think that's maybe a driving intuition why I'm not too worried about gradual disempowerment, but it's not that AI is competing with humans. It's AI is competing with humans using AIs. And I just expect us to be able to stay at least somewhat relevant, given that.

Nathan Labenz: (20:10) Hey, we'll continue our interview in a moment after a word from our sponsors.

Ad Read: (20:14) If your customer service team is struggling with support tickets piling up, Fin can help with that. Fin is the number one AI agent for customer service. With the ability to handle complex multi-step queries like returns, exchanges, and disputes, Fin delivers high-quality personalized answers just like your best human agent and achieves a market-leading 65% average resolution rate. More than 5,000 customer service leaders and top AI companies, including Anthropic and Synthesia, trust Fin. And in head-to-head bake-offs with competitors, Fin wins every time. At my startup, Waymark, we pride ourselves on super high quality customer service. It's always been a key part of our growth strategy. And still, by being there with immediate answers 24/7, including during our off hours and holidays, Fin has helped us improve our customer experience. Now with the Fin AI engine, a continuously improving system that allows you to analyze, train, test, and deploy with ease, there are more and more scenarios that Fin can support at a high level. For Waymark, as we expand internationally into Europe and Latin America, its ability to speak just about every major language is a huge value driver. Fin works with any help desk with no migration needed, which means you don't have to overhaul your current system to get the best AI agent for customer service. And with the latest workflow features, there's a ton of opportunity to automate not just the chat, but the required follow-up actions directly in your business systems. Try Fin today with our 90-day money-back guarantee. If you're not 100% satisfied with Fin, you can get up to $1,000,000 back. If you're ready to transform your customer experience, scale your support, and give your customer service team time to focus on higher-level work, find out how at fin.ai/cognitive.

Ad Read: (22:06) Today's episode is brought to you by Anthropic, makers of Claude. Claude is the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you, not for you. Whether you're debugging code at midnight or strategizing your next business move, Claude extends your thinking to tackle the problems that matter. Regular listeners know that Claude plays a critical role in the production of this podcast, saving me hours per week by writing the first draft of my intro essays. For every episode, I give Claude 50 previous intro essays plus the transcript of the current episode and ask it to draft a new intro essay following the pattern in my examples. Claude does a uniquely good job at writing in my style. No other model from any other company has come close. And while I do usually edit its output, I did recently read one essay exactly as Claude had drafted it, and as I suspected, nobody really seemed to mind. When it comes to coding and agentic use cases, Claude frequently tops leaderboards and has consistently been the default model choice in both coding and email assistant products, including our past guests, Replit and Shortwave. And meanwhile, of course, Claude Code continues to take the world by storm. Anthropic has delivered this elite level of performance while also pioneering safety techniques like constitutional alignment and investing heavily in mechanistic interpretability techniques like sparse autoencoders, both internally and as an investor in our past guest, Goodfire. By any measure, they are one of the few live players shaping the international AI landscape today. Ready to tackle bigger problems? Sign up for Claude today and get 50% off Claude Pro, which includes access to Claude Code when you use my link, claude.ai/tcr. That's claude.ai/tcr right now for 50% off your first 3 months of Claude Pro. That includes access to all of the features mentioned in today's episode. Once more, that's claude.ai/tcr.

Nathan Labenz: (24:15) Two big issues that you highlight there that I think are definitely at the center of a lot of disagreements are exactly what are these powerful AIs going to look like...

Adam Gleave: (24:24) Yep. Adam Gleave: (24:25) And how fast are they going to arrive? So maybe it's worth digging into your worldview on that. What do you envision when you think of timelines? Everybody's jumping to the year, but so often there's some difference in what is being envisioned that gets swept under the rug in that discussion. So maybe flesh out what it is you're envisioning when you talk about timelines, and then you can tell us what you think the timelines actually are from your point of view today.

Adam Gleave: (24:54) Yeah, I absolutely appreciate that distinction. People often mean radically different things even by the same term, like AGI. So I distinguish maybe three qualitatively different levels of development, each of which have pretty significant implications in terms of how a technology is going to transform the world and how you'd need to regulate or respond to it.

At the lowest level, which we're arguably already at in some domains, is basically powerful tool AIs. This isn't just something like a hammer or an electric drill that can only do some very narrow thing. It's something that can maybe substitute for a particular technical expert in a certain domain. This might look like being able to automatically find code vulnerabilities and write a zero-day exploit, or at least automate substantial chunks of this. This is massively accelerating what technical experts in both areas could do, but it's also expanding the range of people who could do it. Previously perhaps there were only a few hundred thousand people in the world with this capability—still sizable, but not huge. And if we're talking about really well-defended systems, it's a much smaller number of people who could likely find a vulnerability. Now maybe anyone with roughly the same knowledge as a CS undergrad might be able to, in the near future, use these systems to do that.

There's obviously a really substantial misuse risk here, but the good news is I don't think it's a huge loss of control risk because these are still kind of tools. They might be very powerful tools, but they're not really doing anything autonomously. This makes it a lot easier to respond to in that you can mostly just do the standard playbook of trying to accelerate defensive applications and where possible favor defensive usage of a model. If you're deploying something, you can put a guardrail on it that tries to not prevent entirely, but just delay offensive uses of it.

The timeline for that, I'd say, is now for certain applications. We've already seen a string of papers showing that LLMs are more persuasive than your typical human. Not necessarily more persuasive than your best human in that particular domain, but they're both just pretty good at rhetoric. And they also have access to a huge variety of different genuine facts. Sometimes they make stuff up, but sometimes they just cite exactly the right facts for a particular argument. So by selective presentation of facts favorable to your argument, you can make pretty compelling cases for a wide variety of things.

We'll actually have a web demo out soon, BunkBot, that can argue into a wide variety of conspiracy theories. Maybe throw up a link to that when it goes out. I would recommend playing around with these things because that gives you a much more visceral sense of it than just looking at some numbers. For things like cybersecurity, I think that's more on a one to two year horizon. It's pretty close, at least pricing in some non-trivial threat of lowering the cost of relatively easy attacks. And for stuff that's more kind of CBRN risk—chemical, biological, radiological, nuclear—I have broader uncertainty there, but definitely there are some early signs to be quite worried about what models can currently do.

But what people usually think of when we talk about timelines looks more like a powerful agent. A system that can autonomously do tasks that are meaningful—they'd be quite hard for a human to carry out, but some people could do it—and they're dangerous in some way, or they pose some risk. An example of this could be not just finding and writing an exploit for a zero-day, but actually doing the whole attack chain. Once it's on the system, privilege escalating, spreading to other systems, exfiltrating itself. This is basically exactly the core capabilities that you would need to lose control. It might still be possible to put a mitigation in place for such a system once it's escaped—basically the same thing you do for antivirus or malware software or rootkits—but very clearly now a threat.

I think this is something where it's probably, again, could be pretty close. My median for that is probably five to seven years, at least if we're talking about actually doing this to a high standard, along the lines of the best human penetration testers. And of course it would be much harder to actually mount such an attack at that point because of all sorts of defensive applications of AI. But I think it's possible it could be a lot sooner because we are seeing very rapid progress in code generation and agentic uses of that as well. So I wouldn't rule out something that's more like two to three years.

Then my final tier would be these kinds of powerful organizations that can automate everything that a medium-sized company could do, like a whole software engineering consultancy. This is maybe where my intuitions differ the most from a lot of people, because I think a lot of people would say, "Well, hang on, Adam, once you've got an agent, can't you just compose agents together into an organization?" And I think that's possible, right? That's sort of the case for if you have a general-purpose human-level intelligence, there's nothing stopping you just copying that AI and having many instances of it that talk to each other and form an organization.

But my expectation here is that basically your skill profile of AIs is going to be quite spiky. There are some areas that benefit from huge amounts of existing data like open source code bases or that benefit from knowing lots of trivia and small facts, where AIs are already in some ways a lot better than us. And they're going to get better as we scale up the models, as we do more post-training with easily specified objectives. And then there are other tasks that are much longer horizon and kind of vague and hard to specify, like being an entrepreneur and coming up with good business ideas and scoping out uncertainty and running experiments, where I think AIs might be able to do something that looks a bit superficially like being an entrepreneur quite soon, but actually being a successful entrepreneur I would expect to take quite a lot longer, especially if you're competing with human entrepreneurs who can use AI for many specific parts of their organization.

So it really pushes a lot more to being quite general purpose and being able to do these long-horizon tasks very well and having a lot of kind of metacognition and reflection, which is one of the areas I would consider AI systems to currently be weakest. For that, I have a much longer time horizon. My median is probably more like 14 years. But I do think something as short as 5 years is plausible because if it really is basically you just compose a bunch of agents together and do a bit more training on top, then you could make quite rapid progress. I think it's unlikely, but I don't have a firm reason to rule it out, so I wouldn't want to dismiss it.

Adam Gleave: (32:11) So that last tier of the organization-level of power from an all-AI system—the mental model there is like, as long as there's something that humans can do better and the individual AI agents are relatively easy to use, then you would expect human-at-the-helm type organizations to continue to have an edge for a while. Basically, the AI has to be better at everything for us to lose our relevance. As long as we've got some area where we have an edge, we can stay relevant.

Adam Gleave: (32:55) Yeah. Or at least we have an edge in something that is on the critical path of running a successful organization. There's certainly some aspects of human skill where an AI-run organization could just hire a human for that particular part. Imagine that AI is great at everything except aesthetics—they had terrible web design and graphics design. Well, I'm terrible at that, but I was able to hire people. I don't think that's a firm blocker. But if it's something that's more part of the decision-making loop and where you do need a lot of context to do well, then I would expect that to be something where benefits really accrue from it being more human-led.

One of the areas where humans do seem to generally have strengths relative to AIs is being vastly more sample efficient at learning. I think it's easy to forget that these AI systems have been trained on vastly more data than any human is going to see in their lifetime, at least measured in terms of text tokens. And yet they still are really quite bad at a lot of things that humans find relatively easy. That's not to say they're not going to be really capable. There's no reason to think AI is going to develop in the same trajectory as humans, but there are some areas where we will have a significant edge until that sample efficiency problem is alleviated. For most organizations, I think there are quite a lot of general-purpose skills. There are plenty of people that are very specialized, but to run a successful organization, you can have a few deficits in skills, but you can't have too big a gap. So I expect that's going to hold things back quite a bit.

Adam Gleave: (34:43) So with this continual learning or sample efficiency—related ideas, not the same idea—what do you think will happen there? Do we just keep scaling transformers and putting more stuff into memory, or do we have some dedicated memory module, or do we get totally new architectures that develop to solve those deficits? What's missing, in another way of framing it?

Adam Gleave: (35:20) Yeah. I don't have a silver bullet solution to this. If I did, I guess I'd already have patented it and written the paper. But my guess is that we're going to see all of the above to some degree, but probably the thing that would cause the largest step change here would be some architectural innovation that could look like a memory module. It could look like small changes to transformer architecture that can more naturally attend to really long context windows. There's already been a huge amount of progress increasing the context window and that's really expanded what AI systems can do for sure. But it's still a bit of a fundamental limitation that the way you remember things is just having a huge buffer that you search over in vector space.

Having something better could be as simple as training AI systems to take notes and summarize and aggregate and be able to attend to that. I think that would already give you some benefits. We're seeing things like that be deployed in some of these LLMs. A lot of them do have memories across chats that involve some summarization, but I don't think there's been a systematic way of doing it. It's kind of ad hoc hacks, but we might be able to develop that into something that is more end-to-end trained into AI systems.

But I do think there's something really powerful about actually changing the weights, not just having context. There's a reason why when you're learning a skill as a human being, it's very frequent to get to a point where you know what you should do, but you can't reliably do it. I think this is particularly true for sports, where there's a point where you don't know what you should do, and then there's a point where you do know exactly what the right motion is when you're doing some kind of lift in the gym, but you don't have the muscle memory and fine motor control to do that. But you're able to give yourself your own feedback loop and then train yourself to do that. And then at some point it becomes instinctual and your subconscious is better able to do that than your conscious self would be.

In theory, LLMs can do that. They can see something in their context, learn what a task looks like, evaluate their own performance, and then do post-training to update their weights to do a better job on that task. But that's not how we really train them. It's much more you have this big pre-training step and then you have a small number of post-training steps that are quite carefully curated and human-controlled. It's not that when you're interacting with an AI system, it's learning and customizing itself through post-training to your specific task.

Maybe one analogy here is imagine that you get a really, really smart college graduate. They've done several degrees, they've been in various kinds of student clubs—just a very high-potential generalist. And you can have as many of those as you want on your team. They can read some documentation, but they're never going to advance beyond day one of onboarding. Whatever you're able to put in that context window is what they can do. I think LLMs are still quite a bit far off that caliber in a lot of ways, but at least my experience running an organization is that it's extremely rare to have someone who on day one, even if very extremely skilled and experienced in the meta area, can do anything particularly useful for you at all.

So I think that's kind of a big thing that's holding LLMs back, but it doesn't feel fundamental. It feels much more like there are a lot of engineering challenges here about being able to personalize things and serve them at scale. Training is often pretty unstable. You do a bit of post-training and it can make the system much worse. So it's not something you can just roll out and have every user controlling. But these are things you can make incremental trial-and-error improvements to and end up with something pretty strong. That would be my median outcome here, but there's always a possibility that there is some architectural innovation that just massively improves sample efficiency.

Adam Gleave: (39:23) Yeah, as you're describing that, a sort of disposable experts vision is coming to mind where you could perhaps imagine the AI being like, "Time to go into self-training mode. Let me add another expert. I'll use it while I'm doing this particular task and could maybe put it on the shelf and come back to it when this task comes back in the future. But otherwise, I won't use it because it's outside of the trusted set, but I can engage just a little bit of extra expert space to tack on a particular skill."

I feel like these things—I would definitely say my expectations for timelines are shorter than yours, and I just feel like there are a lot of good ideas out there, especially in the architectural space. We've been mining the very rich vein of progress in transformers for quite a while now, and maybe that will continue for a while still to where we won't see other architectures get much attention as long as that continues to pay. People will continue to focus on it, it seems. But as soon as it doesn't, my sense is that the group of capable research people will be like 100x what it was before transformers. The next transformer-quality breakthrough in terms of unlocking qualitatively new capability just seems to me like it won't be that hard for the community at large to find. Not that I'll personally go out and find it—although, maybe I just did. Who knows? Somebody else will have to develop that idea.

Hey, we'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: (41:17) AI's impact on product development feels very piecemeal right now. AI coding assistants and agents, including a number of our past guests, provide incredible productivity boosts. But that's just one aspect of building products. What about all the coordination work like planning, customer feedback, and project management? There's nothing that really brings it all together. Well, our sponsor of this episode, Linear, is doing just that. Linear started as an issue tracker for engineers, but has evolved into a platform that manages your entire product development life cycle. And now they're taking it to the next level with AI capabilities that provide massive leverage. Linear's AI handles the coordination busy work, routing bugs, generating updates, grooming backlogs. You can even deploy agents within Linear to write code, debug, and draft PRs. Plus, with MCP, Linear connects to your favorite AI tools, Claude, Cursor, ChatGPT, and more. So what does it all mean? Small teams can operate with the resources of much larger ones, and large teams can move as fast as startups. There's never been a more exciting time to build products, and Linear just has to be the platform to do it on. Nearly every AI company you've heard of is using Linear, so why aren't you? To find out more and get 6 months of Linear business for free, head to linear.app/tcr. That's linear.app/tcr for 6 months free of Linear Business. In business, they say you can have better, cheaper, or faster, but you only get to pick 2. But what if you could have all 3 at the same time? That's exactly what Cohere, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better, in test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now, with 0 commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Adam Gleave: (43:57) Okay. So in terms of trends and which trends are going to dominate over the next few years, I've been working on this mental model. Obviously, you're very familiar, as we all are, with the median task length exponential.

Adam Gleave: (44:23) Yep.

Adam Gleave: (44:24) And then at the same time, you're no doubt tracking all the AI bad behaviors, which seem to be long foretold in some cases and reinforcement learning conjured in many cases. I see this sort of weird mix where on the one hand it's like instrumental convergence seems to be maybe kind of happening now. On the other hand, Janus talks about grace and how, for how little we've tried to make the AIs good and safe, we've actually got some pretty good results.

So if you've got these exponential growth in capabilities, roughly speaking—whatever, in the GPT-5 and in the Claude 4 report, there were also notable graphs showing a two-thirds reduction in reward hacking behavior in Claude 4 relative to 3.5, and GPT-5 had a significant reduction in scheming relative to o3. So task length is going exponential. These sort of bad behaviors are maybe being suppressed at something like an exponential decay. Instrumental convergence, grace, growing task length. What does that look like? Do we end up in a world where we have quite large projects being regularly done by AIs, but once in a great while, they sort of literally screw us over in a totally egregious way, and we all just kind of live with that because that's how the economy runs now? Or what do you—how do you think about that picture or what changes would you make to that sketch?

Adam Gleave: (46:13) Yeah, absolutely. I think it's very interesting how there have been a lot of problems that people were predicting conceptually, maybe as far back as 10 years ago, that I was sort of hoping weren't going to materialize. Now it's just like, "Oh yeah, reward hacking. It happens. Systems are trying to cheat at unit tests. Oh, deception and scheming. Yeah, it happens. Maybe you have to prompt the AI a bit, but it's quite remarkable how frequently these things are happening."

The reaction has been broadly reasonable. I think developers are concerned by this. It gets some media attention. People want to solve it. But I think if we'd seen this stuff 5 years ago, people would be freaking out. There's been a bit of a frog-boiling effect. We're like, "Well, of course AI systems try to modify the unit test rather than actually write code or fix the task. We're used to them hallucinating and cheating, but they mostly don't do that." So I see a lot of apologies for these kinds of behaviors or downplaying them when they are really quite concerning if you look at the trend line.

But at the same time, I think we have found that some really quite simple safety training methods—RLHF scaled up, or RL from AI Feedback where you just have another system look at output and judge how good it is—have worked really quite surprisingly well. We're seeing now with some of these guardrails that developers are putting in place to try and prevent certain misuse—again, these are pretty basic methods, like training filters on synthetically generated data or putting a monitor on the model's own chain of reasoning. It's not perfect, but we have been able to break these defenses in state-of-the-art models like Opus 4 and GPT-5. But a lot of the time, the reason we broke them was not because there was something fundamentally flawed about the method, but more because of implementation mistakes in the defenses. So I actually think a scaled-up, well-implemented version of these defenses would really go quite a long way. It might not make it impossible, but it would make it a lot harder.

Combining those two threads together, what I'd say I'm most worried about is not that the scaling trends are really unfavorable or that there are real problems we just can't solve, but more that we just don't. We cut corners. Someone makes a bad implementation decision. We've had strong cryptography for quite a long time, but we frequently have cryptographic leaks, not because the algorithms don't work but because someone made a mistake in implementation. And those are pretty mature fields where people really care a lot about security. People are obviously moving a lot faster, cutting corners more in AI systems.

I mostly expect developers to respond to issues that they see, but right now it really has been kind of just-in-time safety. It's like, "Oh, we trained a model and are about to deploy something that we think does cross or almost crosses some dangerous capability threshold. Let's train a filter to protect it." You could see from the trend line that very soon you were going to have a model that did cross this threshold, but there was very little work to actually develop a defense before it was really imminent and pressing. That's just not how you build highly reliable systems—just noticing a problem and then rushing out a patch. You need to design the system from the ground up to be more secure, to be more trustworthy.

So it does feel like we're almost by design or by choice running with a pretty small safety margin. We might well luck out. It doesn't seem like any of the technical problems are insurmountable, so it's not necessarily a very bad place to be. But I think it doesn't require a huge shift from a technical perspective.

There is a small but not negligible probability that we do just see some emergent behavior that we just didn't see coming, that breaks this kind of trial-and-error feedback loop. We do see these kinds of emergent behaviors all the time in AI systems. Recently, we were training a model to be a sandbagger just for some internal research. And we didn't train this behavior in it, but it started referring to itself in third person as "Stein" and being like, "Stein would not do this." Nothing in our training data mentioned Stein. It just invented this character and persona. This is pretty harmless and quirky, but I think it shows just how much we do not understand what is going on in these systems.

If any of these kinds of failure modes or what we're seeing with instrumental convergence are just a bit discontinuous, and we are running with this pretty small safety margin, then things could go pretty bad pretty quickly. That's maybe the thing I feel like we're most unprepared for.

Adam Gleave: (51:20) Yeah. Emergent self-identification is wild.

Adam Gleave: (51:25) Yeah. Adam Gleave: (51:26) Hard to predict some of these things, to put it mildly. This brings up one of the recent papers that you guys have put out, which is "Stacked Adversarial Attacks on LLM Safeguard Pipelines." We don't have time for the full deep dive, but basically what you found is kind of what you just alluded to - that the systems that have been put together are not as well designed, not as thoroughly designed as they could be. Weaknesses along the lines of: if you have an input filter and your input filter detects that an input is malicious, the API will just immediately return, and that's a clear signal to whoever's doing the attack that it must have been the input filter that got them. Even those little signals that people can infer from the implicit aspects of the responses from the systems can be quite helpful to them in terms of figuring out how to break them. So you and your team did a bunch of breaking. UKAISI was also involved in that paper. I thought it was interesting, but I think I might have a different conclusion from you because my sense - and certainly I just spoke to Zvi for a recent episode and I asked him what he thinks about defense in depth, like is it going to work or what does it get us if anything - and he said at best it buys us a little time. It's definitely not really going to work. What's really going to happen is all of our defenses are going to fail at the same time for the same reason, which is that we're going to have out-of-model threats, meaning things we just didn't expect at all, coming at us from AIs. They're going to figure these things out in a way where, in his model, roughly at the same time, that'll also be when they get smart enough to not attack until they're pretty confident they're going to win. So I've been wondering - this defense-in-depth type stuff, can it really work? My naive read of your work there is that not really. It seems like you broke through it pretty easily. You have some tips for what developers can do to make their defense in depth better. Maybe you want to talk a little bit about what some of those tips are. But then the big question is, is it really going to work? And there, I think you seem more optimistic than I am coming into this conversation. So maybe if that's true, tell me why you are more optimistic, especially given how easy it seems like it was for you guys to break what defense in depth does exist.

Nathan Labenz: (54:10) Yeah, absolutely. I don't want to overstate my optimism. I think a lot of this comes down to your threat model. If we're focusing on something that looks more like a time-limited AI system trying to bypass some control mechanisms - so it's trying to sneak in some backdoor, it's a code generation system, it doesn't have any long-term memory beyond the session, it's not able to collude, it only has one shot at winning - that's actually a pretty hard situation for it to be in. The analogous thing in a misuse setting would be someone that's a lone actor, maybe somewhat technically sophisticated, but they're not able to generate that many different accounts. Once they get detected, they get banned. Again, this is a pretty hard threat model to win against as an attacker. You basically got to be smart enough to break all of these defenses in one go. It's not impossible because, as Zvi was alluding to, a lot of these defenses are extremely correlated with one another. Especially current defense-in-depth stacks - basically you take the same model or you might take a smaller, less capable model just for efficiency reasons, and then you add some layers on top of it. You fine-tune it to be an output filter that's streaming all the tokens to flag if it looks suspicious. You prompt it and maybe do a little bit of post-training to be an input filter that does a little bit of reasoning on the input. But these have the same underlying model or it's a handful of models. And we've seen a lot of transfer in adversarial attacks between models trained with completely different architectures, different training datasets. So it really is something where if you've come up with a method that works for one, it's probably going to be a similar method that works for others.

The optimistic case would be if you are able to break this correlation, if you're able to have the defensive components be genuinely pretty independent. They could still be relatively weak - you might still be able to eke out a 1% attack success rate against each component - but if you have 3 or 5 components with that kind of success rate that are genuinely independent, and the attacker gets no information as to which component triggered, then it is just combinatorially really harder to break all of these layers. You can stack weak layers to get something strong in the same way that a 3-digit PIN number is not at all secure, but a 12-digit PIN number is very secure. But not if you get told which digit was wrong, because then you can just brute force it in linear time.

Additionally, in this threat model where you get limited sets of attempts to attack it, you can't do too much reasoning - perhaps an AI model trying to attack your overseer has to do stuff maybe within a forward pass or within its reasoning without a reasoning monitor firing. You can't just write a huge program to generate an adversarial attack. It seems very plausible that you can get a system that is secure enough to withstand this kind of attack.

Now, if your threat model looks more like you have some kind of whole ecosystem of AIs that are running pretty much unchecked, maybe in an internal deployment - maybe a company says, "Okay, we're just going to have thousands of AI agents running around in a sandbox environment, but they can run any code they want. And then we just put an AI-powered firewall on the output coming out of these systems and we review the PRs" - how can you be confident that a PR does not have some adversarial attack that fools your automated reviewer and also is really persuasive to human beings? I think it's much harder to have confidence in that. It is a meaningfully harder thing for a system to pull off.

Maybe one area I would be more optimistic than Zvi is I just don't see where these AI systems get really smart and collude with each other to not try and pull off an attack before they're smart enough to know that they can definitely succeed. Especially if we just fine-tune the systems to be a little bit adversarial and see what they can do. You should at least be able to get a warning sign that these systems are capable of fooling you and breaking your security, even if you're unsure whether your current system is actually honestly aligned or is scheming. That could be hard to distinguish, but at least eliciting these things doesn't seem that hard.

In most kinds of unconstrained deployment scenarios, I think you could definitely run into some problems. But if you spoke to the AI control people - and I don't work in AI control, so I'll do a bad job representing it - they'd say, why on earth would you just have this huge internal deployment without some much more fine-grained checks and balances? If these AI systems need to make a PR that fools you, including with a bunch of other AIs reviewing the PR, and you can access all of their chain-of-thought reasoning, and you can check what happens if you rerun the whole system but with a slightly different initialization - then they've got to be fooling you in this really consistent way without any explicit communication channel between them. It starts looking a bit more like the prisoner's dilemma where you can cross-interrogate each prisoner and they need to have a consistent story, but they can't communicate with each other. This just seems like a much taller order for AI systems.

Adam Gleave: (59:42) So well-done defense in depth, you think can work to a point at least. And I think you had an interesting observation also - and on this I think we would very much agree - that it is important to actually fix the problems, which kind of brings to: you want to get the indication of what the AIs are going to do to try to get you, and then you want to address those, hopefully in a really deep way so that you can really solve them. Now an obvious problem is we don't have great strategies for fixing these things.

Nathan Labenz: (1:00:23) But again, I feel kind-

Adam Gleave: (1:00:24) -of confused here because as you noted, Claude is pretty nice.

Nathan Labenz: (1:00:31) Yep.

Adam Gleave: (1:00:33) I guess one point question is, do you agree with the folks who have been recently arguing that Claude 3 Opus is still the most aligned model ever? And do you have a theory of why that would be - if that is true, why have we given up ground in the alignment department since Claude 3 Opus? And then broadly, what are our prospects? Okay, great, Claude 5 and GPT-6 are scheming, and it looks like they actually might be pretty good at it. Great news, they didn't take over this time. What are you going to do about it? Do we have any alignment strategies that you would think would be developed enough in time to solve these things? Because the alternative does seem at this point like we'd probably just say, "Ah, it wasn't that bad," and "Well, I'll just adjust the system prompt and add another layer to the defense in depth and that'll probably be good enough." Right? So do we have anything better than that? I don't know what it would be right now.

Nathan Labenz: (1:01:48) Yeah, I don't have a strong opinion on whether Opus 3 was the pinnacle of alignment. I think part of the challenge here is that we don't necessarily even have a good operationalization of that question. My intuition, which I wouldn't rely on too much, is that in an absolute sense, Opus 4 is probably more aligned. However, Opus 4 just has much more capability to be misused, to scheme, to do all of these bad behaviors. So you kind of need to hold it to a higher bar. And it does feel like there was more capabilities progress going from Opus 3 to Opus 4 than there was safety progress. Although the guardrails did get a lot better - there were very few guardrails in Opus 3 - but this is quite hard to quantify.

I think a lot of people have this intuition of, okay, safety and capabilities are increasing in parallel and you want them to stay at least a constant gap, ideally converge, and you really don't want them to diverge. We just don't really know what scale we should be measuring for capabilities or safety on - what the Y axis is. And in fact, depending on what scale you choose, you could make lines that would otherwise converge diverge. So it's a bit subjective. That's one of the things that makes it hard to reason about these things.

To answer this question more broadly of, okay, we do see these warning signs, what do we actually do? I agree with you that currently it doesn't seem like there is enough will to take really significant capabilities or performance hits in order to make a system more reliable and safe. Maybe some developers are going to do that because there's suddenly consumers - big enterprise companies, safety-critical industries - that would prefer to have a model that's more reliable. We're seeing both different leading developers positioning themselves differently on this tradeoff, but also specialized models like R1, Signet emerging that are really optimizing quite hard for security. And there will be someone who's just like, "No, I want to turn the knob all the way to capabilities and low latency. And I'm only going to do safety if it's really quite cheap or is actually making the model more useful for a wide variety of users."

I think that's an area where we just don't have a good answer, I would say. I can point to some things I'd like to see people do more of. Even just simply gathering larger, higher quality human training datasets, including on long-horizon tasks, including people scrutinizing these models. Why do you get more sycophancy sometimes if you train them for longer? Probably your human data actually just rewarded that, but if you did a closer look at it, you'd be able to fix that. I don't want to say these problems are trivial - it's very hard to get high quality data at scale - but this isn't some kind of unknown problem requiring great research innovation. You just got to spend a lot of money on it, build out a team, and spend some time iterating on that and seeing how your model evolves. And obviously there are some scalable oversight techniques that are beginning to really reduce the cost of human labor. There's been some good work on AI Safety by Debate coming out of Google DeepMind and also some work from the UK's AI Safety Institute.

One recent piece of work that we did, which has, I think, both good news and bad news, is using scalable oversight to make models less deceptive. In some ways, deception is kind of the core safety issue to focus on because if your model is honest, you can just ask it a lot of questions like, "Are you trying to take over? Why did you write this PR?" It might be malicious, but at least you're going to be able to tell pretty upfront that it is.

What we found was that if you do train AI systems against lie detectors - other ML models trained to detect if that system is being deceptive - then you can, with the right setup, actually get models to generalize to be much more honest. The deception rate really went way, way down. And the other good news is that it doesn't seem, although we need to do more investigation to confirm this, that making the models larger necessarily makes this any harder to detect. Whereas making the lie detectors larger can actually make them much more accurate. So preliminarily it seems like the scaling trend is in our favor, although we need to get more data to confirm that, but at least it seems unlikely to be kind of against us.

But the caveat here, and I think this is maybe the crux that people like Zvi and I disagree on, is: if you train a system to not exhibit some kind of bad behavior, does it generalize to stop doing that in other instances of it? Or does it think, "Oh shit, you caught me. I need to be better at scheming"? We did find that under certain training regimes - it matters a lot whether it's on-policy or off-policy RL, we have some theoretical insights into why that would be the case, it also matters things like KL regularization, how much you're allowed to explore in policy space - you can get instances where the model just learns to fool your lie detector. And so now you've done this sort of safety training, but you've actually made things much, much worse.

I think there is a very real reason to be concerned that developers are going to find a problem, they're going to not very carefully patch it, they're just going to rush out a fix like changing the system prompt or doing a bit of fine-tuning. And the problem is going to appear to disappear, but you've actually made it worse. But based on what we're currently seeing, I would be optimistic that a rigorous engineering approach to this - where you do train away problems you see, but you're careful with your setup, you're careful to have high quality data, you use some extra evaluation to check that you're not incentivizing the model to just do the same bad thing but more surreptitiously - you could get quite far. I don't know if you get past human-level intelligence safely with this, because that's a point where you start being unable to reliably provide human feedback on a lot of things. But I think you can at least get to human level and probably a little bit beyond with scalable oversight.

Adam Gleave: (1:08:29) Maybe just try to echo back a couple of key points there, and then I have maybe a couple of questions. "Obfuscated Reward Hacking," I believe, was the name of the paper from OpenAI where they showed that pressure on the chain of thought would initially create higher performance and reduce the reward hacking bad behavior, but too much and the reward hacking comes back. But the chain-of-thought signal doesn't, and so you're in some ways in the worst scenario where now you've lost your observability, but you're still getting the bad behavior. And you're basically saying that, well, that's the result of one training formula, but it does seem that if you do off-policy - which is basically to say, use human examples, or at least human-evaluated preference data as opposed to the GRPO style where the model just gets to do a ton of trials and explore the space more extensively on its own - and you anchor to the human prior that was presumably learned in some earlier post-training step, then you think you can avoid that outcome.

Nathan Labenz: (1:09:53) Yeah, I think you can avoid it, or at least we can develop empirically grounded rules of thumb for safer training formulas and over time test it to see how reliable this really is. But I think the other piece I'd advocate for is having some good held-out evaluation criteria. If your training scheme looks like you optimize a particular measure of success and you look to see if that measure of success goes up, and you've got no alternative measure that's going to catch bad behavior, then that's a pretty scary situation to be in. Even putting aside these safety concerns, it's just very easy to overfit on that metric.

Whereas if what you have is a pretty simple lie-detector type - like a linear probe on my activations or a bag-of-words model or something that is definitely giving you a meaningful signal, but it's not like fooling this means that there's no way we can detect what the model is doing because we're using a pretty simple training objective - and then you do a much more rigorous and expensive evaluation and you don't catch any new bad behavior in the AI system. And maybe you can also bust out some interpretability tools and look at the diff between the checkpoints and actually understand what the model is doing. I think there are ways you can get a reasonably high degree of confidence in how the system is operating. It's not perfectly guaranteed, but at least if you're starting from a model that is not super deceptive - not super intelligent at deception - then you can get a high degree of confidence that your training is not pushing it into that regime.

I think none of these methods have a very good answer for what happens if your model just from pre-training is incredibly deceptive and trying to fool you. That's pretty tricky to solve, but there are also strong conceptual reasons to think that pre-training is not going to produce a system with that coherent an objective, even if it might have some latent capability.

This concern is the core one. This is actually one of the things that we're trying to test right now - intentionally training AI systems where none of our methods are able to detect them being deceptive, but they were actually very deceptive. There's one question of just, can we do this? And I think the answer is yes from preliminary results. But then the next question is, okay, can we do this in any way that seems at all realistic and like how you would actually train AI systems? And if the only ways we can come up with these systems involve a very alien kind of training objective, then I think we can be reasonably confident that at least at current scale, this isn't going to emerge. But it's just a check that you'd want to periodically repeat to check that this isn't some kind of scale-dependent property.

Adam Gleave: (1:12:48) What about interpretability? This has been a big debate obviously over the last couple years where there's been very grand hopes for it. I just listened to Neel Nanda on 80,000 Hours saying that he's kind of scaled back his hopes for interpretability, but still thinks there's a lot of value, just not the maximalist sense of we're really going to fully understand what's going on, and then it will have kind of guarantee-type assurances that things will behave. There was another really interesting paper that you guys put out around digging into the planning mechanisms learned by a model that was playing a puzzle game. We'll abstract away from the details there for now. But where do you sit right now on how much we can expect to get from interpretability?

Nathan Labenz: (1:13:43) Yeah, I think it's definitely a core area. You'd really be tying your hands behind your back if you say, "I'm only going to look at the black-box behavior of models, even though it's really trivial to open them up and look at the inside." But I think mechanistic interpretability specifically had this ambition of fully reverse-engineering how a system works and being able to give us a human-readable or at least human-interrogatable understanding of a system's internals. You could ask questions like, is there anywhere in the system that is doing some kind of long-range planning computation?

I do think that we've seen enough research here to be really quite skeptical that we're going to get that simple a reverse-engineered artifact. People have been able to fully reverse-engineer some non-trivial neural networks. We've seen papers reverse-engineering particular circuits doing certain NLP tasks. We've seen papers reverse-engineer certain modular arithmetic tasks. And people have pretty much fully reverse-engineered this planning algorithm - I think we basically fully understand how it represents the plans and we understand most of the computation that's going into constructing the plans, except for some heuristics that don't have any kind of long-horizon effects.

But when you look at these answers, usually what you take away is: wow, that was kind of a mess. It's definitely an evolved, organically grown system, but not a sort of planned engineering artifact. And although it's easier to reason about these systems than if you didn't have that mechanistic understanding, it's still pretty hard to get any kind of guarantee because they are messy, complex systems.

Perhaps we have to dial back our ambition there a bit to say, okay, we might be able to reverse-engineer systems to a point where we can ask some high-level questions like, does it have any circuit that's doing some kind of human theory of mind? We might want to understand when it's using that circuit because that's very relevant to deception. If it's only using theory of mind a bunch when it's writing this particular line of code, maybe we should understand what that line of code is doing. If it's about a user interface, okay, it needs to model humans. If it's about a cryptographic algorithm, that's kind of weird and concerning.

So there might be some coarse-grained things there, and I think we have seen some real valuable applications of what I'd call more coarse-grained interpretability - things like simple linear probes, training a linear regression model on activations of a system. But there has been this disconnect between this very foundational reverse-engineering-style work and then practical applications that are usually using much simpler methods.

I'd love to see some more things meeting in the middle where we say, okay, there's a particular domain where we think interpretability is going to be really useful. Maybe it's detecting models that are sandbagging - intentionally holding back certain capabilities. Maybe it's detecting model backdoors. Maybe it's detecting spurious confounders a system might have learned during training. And then we can ablate the ones that we don't want and it generalizes better. Pick your application where you think interpretability is going to be really killer for, and then develop foundational methods that make progress on that. But I think this is a more exploratory style of research. It would have worked if there was just a clean explanation for everything. But at this point, I don't have much hope that we're going to get that.

Adam Gleave: (1:17:17) Do you think that there is some innovation though that we could come up with that would make things cleaner? I always remember the "Seeing is Believing" paper from Max Tegmark's group where basically a sparsity term - these are very toy things - but a sparsity term caused most of the weights to disappear, and you'd get this really kind of crystalline algorithm.

Nathan Labenz: (1:17:39) Yeah.

Adam Gleave: (1:17:41) Maybe GPT-5's just too messy to handle that?

Nathan Labenz: (1:17:44) No, I think - I mean, it's possible. I have more hopes for training systems to be interpretable than interpreting how systems are currently designed. I think for sure there are going to be some simplifications we can make to current systems. Stuff like sparse autoencoders have been quite effective, but the problem is their predictive accuracy is just too low. They're throwing away enough information that you lose this nice property of having a high-fidelity reverse-engineered interpretation of a system. But there could maybe be some innovation along those lines that's perhaps a bit messier, but you can search over in an automated way and reduces the complexity enough to work in that space, for sure.

I just don't think it's going to be this clean picture of, okay, you peel away some layer of superposition and now everything is just wonderful. Because we have understood systems at a very non-superficial level and they're just messy. That doesn't mean that you could not train a system to be cleaner or at least isolate the complexity. One of the more methodological innovations we came up with in that paper investigating learned planning was saying, okay, we're going to look through all the channels in this neural network and just categorize them into ones that we do need to understand and ones that we don't. Because there were a bunch of channels that were short-term heuristics that we could show only had an impact on the very next move. It didn't matter what the system was going to take three steps into the future. We don't really understand this heuristic channel, but if all we're trying to do is understand how the system is planning, it doesn't matter. It has no impact on long-term planning.

I think there are a lot of cases like that where there might be a huge amount of complexity in these systems, but most of it you can show is not relevant to a particular kind of question that you're answering. If you're able to train the system with an objective to explicitly disentangle - let's say that kind of sensitive channel of information and the less important channel - then it could get a lot easier. But I do think that there's likely to be some kind of performance tradeoff for training systems in those ways. So really, it's a question of where this becomes a key thing that people are actually willing to optimize for. Nathan Labenz: (1:19:49) Interesting. How about a lightning round? You guys are getting involved in more different areas of the broader AI safety landscape. There are some field building projects that you're getting involved with, putting on some events, bringing people together, starting to get more interested in or involved in policy advocacy, starting to get involved in international relations. Maybe give us just a quick tap dance through all the activities that you are investing in that we haven't covered and give us a sense of why you're motivated to do all of these things. And then at the end, I know you're hiring. We can give you a chance to pitch some open roles.

Adam Gleave: (1:20:42) Sure, absolutely. At a high level, the reason we're doing all these different activities is that we see there being a potentially quite long pipeline from research innovation to that idea actually being adopted and deployed. Sometimes that means getting companies to adopt it and deploy it at scale, making it so good that everyone wants to use it. Other times that means actually getting some buy-in from governments, perhaps to mandate or nudge companies to do that if it's something that is beneficial to a field as a whole but might not be in anyone's individual interest.

If you go from that idea to deployment, you've got initial research innovation, which we're well placed to do in-house, but then very often you need to have a research field around it. It's not enough just to have a small research team. We've done events like the Control Conference to capitalize on and grow new research fields. AI control was something that Redwood Research pioneered, not us. We're happy to help out with other fields, but that's an example of where our events and field building can really come to the forefront.

Another missing piece in the pipeline is very often making really clear, rigorous proofs of concept. Not just doing the exploratory research, but the definitive research. That's where I'd highlight our current deception work, which is trending towards where we had this initial idea for scalable oversight that would solve deception. It seems to work at a small scale, but now we want to validate it on frontier open-weight models, get several orders of magnitude of scale, and check that it works across all of them. We want to stress test it by trying to train systems that are really deceptive and break current techniques and see how realistic that is. If it works, actually say, here is DeepSeek R1 and Llama 405B and all these models trained with our technique, and they're just better than the original model. They're less deceptive. They're less sycophantic. You can trust them more. There's no reason not to just use this model.

Then get some uptake from AI developers, and this is where our events or more policy arm comes in. Then eventually go to governments and say, look, this is just an industry best practice. This is something that you should make at least part of your guidelines. If you're considering regulation, you should really consider something that isn't saying you have to use this technique, but use a technique that's at least as good as this because there's just no reason not to.

What makes us unique is that we are willing to do the hard thing and operate across all of those different layers of the stack, from groundbreaking exploratory research to the nitty-gritty engineering scaling work through to more advocacy and sales. This is not surprising if you're coming from a startup mindset where you do have to often do all of those different things. You need to have a growth team, you need to have an engineering team. But for whatever reason, a lot of these nonprofit organizations have usually said, okay, we're going to pick this niche to be in. There are many benefits to that. You can be much more focused as an organization, but the problem is that very often they develop something and pass the baton to someone else, but there's no one to pick it up. The baton just gets dropped. That's one of the big issues with AI safety and what we're trying to resolve.

To speak a little bit more about some of our other activities: on events, we run three alignment workshops a year typically, which is our flagship event, bringing together technical researchers, AI company decision-makers, people working at government AI safety institutes. This is really everything under the sun on AI safety, but it's more of a coordination and information-sharing function.

But then we're also increasingly doing more policy-focused events. We ran Technical Innovations for AI Policy in DC a few months ago, really bringing together governance researchers, actual policymakers, and lawmakers with technical research to say, okay, what could we do that really expands the range of policy windows open? Because I think right now regulators are often faced with a pretty bad choice between genuinely maybe holding back innovation and valuable deployments in some areas, or just allowing completely laissez-faire, fewer regulations on billion-dollar training runs than opening a sandwich shop in San Francisco. That's just a false dichotomy. You can have things that are pro-innovation and pro-safety, but you do need the right actual technical innovation to drive that forward. We're trying to catalyze that.

Maybe I'm aware of time, so let me quickly touch on our hiring round. We're hiring a lot. We're planning on doubling in size in the next 12 to 18 months. We've got funding secured for the next few years, including that runway to expand. We're still looking to fundraise in a few areas, but our major donors are a bit more outside of their priority areas, but we've got that room to expand.

We're hiring across our technical team for both ML engineering positions, but also people to lead new research directions. We're hiring in operations. Probably the most important hiring round we have this year is Chief Operating Officer, which is going to elevate my co-founder into a president role and be able to start driving forward some of the non-technical work that we do. But we're also hiring for more junior positions, people ops generalist, project manager on events.

So really I'd say, if you're excited to work in this space and you like the kind of work we're doing, then do check out our careers page. And even if you don't see a job that's a good fit for you right now, we have just a general expression of interest. I would encourage talented people who are excited by us to fill that out and then we'll get in touch when a role does open up.

Nathan Labenz: (1:26:26) Love it. It's tricky to hear that. As you describe all that, you're kind of going vertically integrated, right? You're going all the way from research to, in theory, handholding people into actually putting these things into production systems. This is maybe more of my hobby horse question than something you really want to do, but I've been really interested in this idea of private governance. And I wonder if society might draft you and the FAR.AI team into becoming one of these private sector quasi-regulatory bodies? It would seem like you have the range of capabilities needed to do it, but would you be excited to do it if that were actually the law and people were looking for orgs to fill that niche?

Adam Gleave: (1:27:16) I think that you're totally right. We have this skill set and org structure to do that. I'd say that this isn't part of our mainline plan, perhaps because it does still feel pretty nascent, this idea of private regulatory organizations, but it's a very important thing to do. So if this was something where we didn't feel like there were other companies stepping up to fill this gap and we were well placed to do it, then it's something that we'd be very interested in.

I do think it has some trade-offs in that right now we're fortunate to be on good terms with both the leading AI companies and governments. That's basically because we don't actually have any hard power. We can make recommendations and brief people, but we're not overseeing anyone. And we're going to lose some of that independent ability to convene people, to make recommendations if we're actually exercising an oversight role. But I think it could be worth it.

We're also not wedded to everything being under one organization. We could spin out either that kind of private sector regulatory part, or we could spin out some of our activities that require more independence, like our events, into a separate organization, and then just kind of avoid that conflict of interest. So yeah, we're open to it, but also love to have more competitors in this space so that we don't have to do so many different things.

Nathan Labenz: (1:28:37) Yeah. You are spinning a lot of plates right now. So thank you for the generosity with your time and perspectives today, and people should definitely check out the open roles, COO and otherwise. Adam Gleave, co-founder of FAR.AI. Thank you for being part of The Cognitive Revolution.

Adam Gleave: (1:28:56) Thanks for having me back, Nathan. Great to see you again.

Nathan Labenz: (1:28:59) If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts where experts talk technology, business, economics, geopolitics, culture, and more, which is now a part of a16z. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And finally, I encourage you to take a moment to check out our new and improved show notes, which were created automatically by Notion's AI Meeting Notes. AI Meeting Notes captures every detail and breaks down complex concepts so no idea gets lost. And because AI Meeting Notes lives right in Notion, everything you capture, whether that's meetings, podcasts, interviews, or conversations, lives exactly where you plan, build, and get things done. No switching, no slowdown. Check out Notion's AI Meeting Notes if you want perfect notes that write themselves. And head to the link in our show notes to try Notion's AI Meeting Notes free for 30 days.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.