Success without Dignity? Nathan finds Hope Amidst Chaos, from The Intelligence Horizon Podcast

Watch Episode Here

Listen to Episode Here

Show Notes

This special cross-post from The Intelligence Horizon features Nathan Labenz in a wide-ranging conversation on compressed AI timelines, expert disagreement, and why he believes the singularity is near. They discuss interpretability, RL scaling, and the balance between extraordinary upside, like curing major diseases, and serious existential risks. Nathan explains his evolving p(doom), why he’s slightly more optimistic about robustly good AI, and how defense-in-depth strategies might keep society on track. The episode also explores US-China rivalry, AI governance, and why human cooperation may matter more than technical control alone.

Google: Keep up with AI research on the go with NotebookLM, Google's steerable research and thinking partner. Try it at https://notebooklm.google.com/.

Sponsors:

Tasklet:

Build your own Cognitive Revolution monitoring agent in one click.
Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

VCX:

VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com

Claude:

Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr

CHAPTERS:

(00:00) About the Episode

(03:27) Special Sponsor

(05:12) Opening and AGI framing

(12:08) Scaling RL and paradigms (Part 1)

(21:31) Sponsors: Tasklet | VCX

(24:24) Scaling RL and paradigms (Part 2)

(28:56) Verifiability and long horizons

(41:13) LLMs and world models (Part 1)

(41:19) Sponsor: Claude

(43:32) LLMs and world models (Part 2)

(54:17) Energy, hardware, and chips

(01:00:42) Alignment risks and bottlenecks

(01:10:18) AI values and agency

(01:20:31) Defense in depth alignment

(01:30:48) US-China AI cooperation

(01:41:05) Episode Outro

(01:45:42) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Transcript

This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.

Introduction

Hello, and welcome back to the Cognitive Revolution!

Today I'm sharing a special cross-post from my recent appearance on The Intelligence Horizon podcast, with hosts Owen Zhang and Will Sanok Dufallo.

Owen and Will will soon be graduating from Yale College, and as you'll hear, they've clearly spent much of their senior year thinking deeply about the current state of AI, where we're headed, and what it means for all of us, and I was really impressed not only with the quality of their questions, but their ability to challenge me with followups that effectively steelman'd the most relevant counterarguments.

We start with the fact that while AI timelines have compressed dramatically over the last 5 years, genuine experts continue to radically disagree on critical questions. Having established what I hope is appropriate epistemic humility, I then go on to call it how I see it.

In short, the singularity is near. Interpretability science proves that AIs are developing increasingly sophisticated world models, and with RL scaling now clearly working, AIs are no longer simply imitating humans, and likely won't be limited by what we know for much longer.

The potential upside of this is, of course, incredible. The value I've got just from using AI to navigate what humans have discovered about how cancer works, and how to treat it, has been invaluable – and the prospect that we might cure the majority of human diseases in the next decade is obviously extremely exciting.

That said, the risks are very real, and will remain serious for as long as we lack a solid understanding of how AIs work internally and why they do what they do.

My p(doom) remains somewhere in the 10-90% range.

And yet, at the same time, I've become at least a bit more optimistic that we might actually build robustly good AIs, because scaling laws at least seem to imply that Powerful AIs can only be created with massive resources, the 3 companies competing at the frontier today are at least reasonably responsible actors, and our best alignment techniques are working better than I expected. Given these fundamentals, it seems at least plausible that a defense-in-depth strategy that combines techniques like Goodfire's intentional design, Redwood's AI control work, improved cybersecurity through formal verification of software, and various forms of pandemic preparedness, could together be enough to keep society on the rails.

We touch on a number of other topics as well, including the US-China rivalry, and why, especially in the context of the Department of War's recent attack on Anthropic, which has us looking more and more like China all the time, I would rather bet on figuring out a way to cooperate with our fellow humans than bet everything on AI researchers' ability to steer AI advances in a way that will ultimately work for humans.

I appreciate Owen and Will for allowing me to cross-post this conversation, and I definitely encourage you to subscribe to The Intelligence Horizon – their recent conversation with former OpenAI researcher Zoë Hitzig covered the evolving ways that people are using ChatGPT, variations on Universal Basic Income, AI governance models that emphasize a decision-making process over specific principles – and why she believes such structures will probably have to come from outside frontier companies, and plenty more.

For now, I hope you enjoy my conversation with Owen Zhang and Will Sanok Dufallo, from The Intelligence Horizon.

Main Episode

Nathan Labenz: When it comes down to the core question I think a lot of people are getting at is, is this AI thing going to fizzle out before it really becomes a big deal? Or is it going to be a huge world-altering deal? I'm very much confidently, clearly in the camp of it's going to be a huge, huge deal. And the details are where I think the discussion or the debate remains now. Not for me, at least, whether or not the overall trajectory of AI is going to take us to something that is powerful enough to be transformed. strangest things in the world today, full stop, is the fact that the disagreement among very plugged in, very informed, very smart people has not really been reduced much at all, even as we've gained a ton of information over the last couple of years about the trajectory of AI. I think that is super strange, and I'm honestly pretty confused by it. The one thing everybody seems to agree on is like the timeline on which we should expect this has come in. today's world, if you say you don't think you're going to see AI until 2035, you're like an AI bear. But only five years ago, that was considered to be quite aggressive. And most people were more like, I don't know, 2050, maybe not in my lifetime. So there's been this massive compression of timeline. There's obviously been this huge jump in capability. And yet on these fundamental questions of like, what's going to happen, there's still total disagreement.

Owen Zhang: Today is Nathan Lebens, host of the Cognitive Revolution podcast. Before switching to full-time podcasting, Nathan founded Waymark, an automated marketing platform for local businesses that pioneered the use of generative AI to produce video ad campaigns. After leading the company for six years, he stepped back to focus full-time on understanding and communicating the trajectory of AI. As host of the Cognitive Revolution, he has conducted hundreds of in-depth interviews with AI researchers, founders, policymakers, and investors, it has become a go-to source for people trying to keep up with what's happening at the frontier, including us. Nathan was also a member of OpenAI's red team, where he was among the first outside users to interact with GPT-4 before its public release. Welcome, Nathan. Great to have you on today.

Nathan Labenz: Thanks for having me. I'm excited for this conversation, guys.

Will Sanok Dufallo: So first question, are we on the cusp of AGI?

Nathan Labenz: Starting with the easy ones. Well, okay, first of all, as I'm sure you guys are very well aware, what exactly do we mean by AGI is a slippery question, at least, that has people talking past each other quite a bit. So I don't think there's any, you know, like super privileged definition. What I do think we're pretty clearly on the cusp of is powerful AI that is better than the vast majority of people, although perhaps not the you know, very few top, top experts in a given domain at pretty much all cognitive work. That seems like pretty clearly on the horizon. And I think it is absolutely going to be enough to be transformative to the economy, to daily life, potentially even, you know, bigger things than that, like the very nature and status of the human species. And I think that we will have that kind of regardless of whether or not there are some niche areas where humans retain an advantage, which my guess is probably will be the case. Certainly one of the big things that has become clearer over time as AI systems have gotten better is that they are jagged in this weird way where they have Certain things they just do amazingly well at. Other things they kind of weirdly struggle at. They're not very adversarially robust, for example. They're easier than humans, I would say, still to trick. So there's going to be weirdness. And I think throughout the conversation today, probably a big theme will be to expect weird things to happen. But when it comes down to the kind of core question I think a lot of people are getting at is like, is this AI thing going to kind of fizzle out before it really becomes a big deal? Or is it going to be a huge world altering deal? I'm very much confidently, clearly in the camp of it's going to be a huge, huge deal. And the details are, you know, where I think the discussion or the debate remains now, not for me at least, whether or not the overall trajectory of AI is going to take us to something that is powerful enough to be transformative.

Will Sanok Dufallo: Okay, so it sounds like what you're saying is that there's no doubt that we're going to get massively powerful and transformative AI across domains. Maybe there'll be some tiny little bit where human experts still have an advantage in specific domains. Maybe there'll also be comparative advantage. We'll get more in depth on what specifically those concerns are. But I guess just to hone in, are these more so uncertainties as to whether AI can fully generalize? Or are you pretty confident that for the foreseeable future, there will be these gaps between human experts in specific domains and problems like adversarial robustness. So just to clarify what your position, is this like a prediction or is there like, this is where your uncertainty lies, your uncertainty doesn't lie in AI being transformative.

Nathan Labenz: Yeah, I think the latter. I wouldn't be shocked if there are additional unlocks that allow AI systems to truly undeniably surpass what humans are capable of. I think you look at the natural world and clearly humans did that to every other animal species that existed before us right so there is a historical precedent for some new kind of mind showing up on the scene and blowing away all the other minds that came before us I don't think we have any reason to believe that that couldn't happen to us in some you know law of physics guarantee sort of way so I definitely think that's possible but in terms of what I could confidently foresee I don't think it is clear that that's going to happen in the next few years but What I do think is still, again, clear is that we are going to have systems that are powerful enough to be transformative across almost all the questions that we care about in terms of what is society going to look like? How are we going to organize? Are we going to need a new social contract? All those things seem to me pretty clear. And then people quibble often around the edges of... There's some pretty esoteric jobs out there, right? Will an AI system ever be able to be as good of a sommelier as the best human sommeliers? Well, I don't know. You know, is anybody going to be motivated to train one to try to do that? You know, we don't really have a lot of taste AI at this point in time. And, you know, maybe tasting will be the forever the domain of humans, but trying to identify these little niches where we may have some really longstanding durable advantage, I think. too often distracts from the big question of like, is this going to change just about everything that matters to us? And there I clearly come down that the answer will be yes.

Owen Zhang: Great. So let's talk a little bit about how maybe we get to this transformative or extremely powerful AI. Do you think current paradigms, I think the one that comes to mind the most and which is discussed the most is scaling RL is going to be sufficient to get to that transformative or powerful AI that we discuss.

Nathan Labenz: Yes. I mean, I think it probably is. And I also think if it's not, we may never quite answer that question in the sense that I do expect that we will continue to see new conceptual advances in AI research. The field is growing. One of my sayings is everything is going exponential. So that is the number of people that are working in AI research. It's the number of papers, it's the number of experiments being run. It's the amount of compute that people have to run those experiments on. It's the data sets that have been collected. It's the RL environments that are being built out over time. And I do think all those things are going to probably give us some new conceptual unlocks such that we will never probably answer the question. And this already kind of happened with pre-training, right? You rewind to two to three years, there was a time when people were like, Well, we're running out of data. Can we really scale this all the way to AGI? And nobody's talking now about can we scale pre-training all the way to AGI anymore because there's been a new thing. And so now the new thing is on top and it's like, well, clearly that's going to be part of the mix. Will this exact pattern or this sort of shape be enough to get all the way to AGI? My prediction is that Again, what does AGI mean? But my prediction is it probably is enough to get us to systems that are transformative. But if we fast forward to 2028 and look back at this conversation, we'll probably say, well, nobody's really asking that anymore because we do have a couple new things that have come online. And so now we have a little richer sense of what the shape of it's going to be. And we're either getting there or we're not. We maybe still are missing a little something at that point, but I would guess that we would look back at the current thing and say, Yeah, clearly a couple things have been added and they were a big deal. And so the question of like, is exactly what we have in February 2026 enough kind of ends up being beside the point in the final analysis.

Owen Zhang: Interesting. What you're trying to say here is that in the same way that there were questions about whether pre-training would take us to this extremely transformative AI, now that we have these new unlocks such as RL, for example, like we discussed, who knows in maybe two to three years or whatever timescales that we're talking about to re-transformative AI, there might be other conceptual progresses that are also made that build upon things like pre-training RL that will get there. I guess my question specifically was, do you think there will be more of these conceptual progresses that need to be made to get to this transformative AI that we're talking about? Or do you think the RL paradigm is going to get us there?

Nathan Labenz: I think RL probably would be enough to get to AI systems that can do most of the cognitive work in the economy, for example. I mean, it seems like, honestly, we're already reasonably close to that, and it doesn't seem like we're anywhere close to done. And by the way, if you listen to the lab leaders, the frontier model developer leaders, they are still saying, by the way, too, that pre-training is still working. You know, it was never really the case that pre-training stopped working. As far as I know, those scaling laws basically have held. I think what happened at one point in time was the next step on the pre-training frontier was becoming really expensive. And then they found another kind of... Maybe not fully orthogonal, but pretty different direction to go with scaling post-training with the RL paradigm that we have now. And that was just much bigger bang for their buck, at least at that moment in time, right? They had already gone pretty far up the pre-training curve. They hadn't gone very far up at all, the RL curve. Now, presumably those things were going to kind of even out. A general kind of economic theory would be like people should be investing in one until the marginal return decreases to the level of the other, and then they could maybe invest in both. So if you find some new path that's like, oh, this is really giving us huge ROI, you go hard at that path for a while, but then that kind of hits some diminishing returns. Now you're kind of back to, okay, well, maybe we need to do all these things more at the same time. So we'll do a little more pre-training, we'll do a little more RL, and we'll do maybe more of that mystery third thing, all advancing in tandem. Right now, I think we're advancing. I don't have insider information on exactly what these ratios look like, but I'd say it's probably roughly the case that we're close to, if not at the point where additional compute going into RL is roughly giving the same kind of returns as additional compute going into pre-training. And so both are going to be places where frontier companies can invest for the time being. I guess another way to think about this is people also ask about generalization in RL. And I think, again, it's kind of worth unpacking what does it mean to generalize? Like, what are we talking about generalizing? One way to think about the question is if we do a bunch of RL on a model, does the model generalize to all sorts of new things? And there I would say, again, there's probably another way to, to break that down, which is like, there are domain specific skills and then there are more kind of cognitive or even like metacognitive skills that work across domains. So, one of the biggest, like, I I guess we'll call it the aha moment for me and for deepseek researchers and for the R1 model that they were training was from the R1 paper. This was January, 2025. They reported this, what they call the aha moment, which is in their process of doing RL on, you know, an already pretty capable base model, of course, right? They found that these previously unobserved higher order cognitive behaviors started to come online. The aha moment in particular was in the reasoning trace, the R1 model gets to a point and it says, oh, wait, this is an aha moment. I can come at this from a totally different direction. And this is something that hadn't been observed much.

Nathan Labenz: It's out there in the pre-training data. There's at least some examples of people documenting their own chain of thought and getting to these aha moments and realizing, oh, I was coming at it the wrong way. Now I can come at it this other way. that hadn't been observed too much in AI systems, reinforcement learning seems to be bringing, clearly has brought that sort of thing out. And now we have these long reasoning traces where the model will kind of come at the same problem from a bunch of different directions. And so what generalizes what doesn't generalize, I think, you probably can't take a model that has never been trained on a particular domain and expect it to go into that domain and be successful. But you can expect some of these like meta traits to generalize from one area to another. And then if you zoom out the farthest and just say like, does RL as a process that companies or, you know, organizations apply, does that generalize? And there, I think the answer is like, definitely yes. It's just a question of, getting the reward signal dialed in to the point where it actually works. And that's definitely easier in some areas than in others. And so we do see, like in areas where it's easy, like math and programming, we see fast progress there relative to things where it's harder to get a clear reward signal. But still, I think we are quite obviously making that work. Episode of the podcast that's coming out, I think today as we're talking is with the head of health at OpenAI and You know, I can tell you from personal experience, latest models are absolutely on the level of attending physicians. My son, unfortunately, has had cancer over the last few months. He seems to be very much on track to be cured and be all better, which is fantastic. And I probably wouldn't be here talking to you if that wasn't the case. But I've had occasion to really use the latest models intensively in a medical context, and they're absolutely on the level of the attending physicians. They know a lot more than the residents, and they really are step for step with the most senior doctors at the hospital. How is that happening? Well, they've worked with 250 plus human doctors closely at OpenAI to create training data, to grade, et cetera, et cetera, et cetera. But now their latest models are also outperforming their human doctors when it comes to the task of evaluating AI outputs. So there are these kind of thresholds that they're crossing where it's like, it was really hard. I sometimes think of this as spinning a big wheel. Your first push is on this big wheel. Don't move it much. But as you build up this momentum, as the flywheel really starts to turn, you start to hit these thresholds where it's like, well, now we have a model that is outperforming Our humans, think of all the work that they had to put in thousands and thousands of hours and millions and millions, hundreds of millions of dollars potentially to hire hundreds of doctors to do all this work. Now they've got a model that is beating the doctors at evaluating outputs in the medical domain. And so that totally changes the game. Those thresholds will be crossed at different times for different domains. But I think it's safe to say that in most domains, if there's any sort of objective ground truth or even like a high level of agreement among professionals, you can get there. It just takes more time.

Will Sanok Dufallo: Okay, so just to like go back a little bit, it sounds like, okay, a lot of your confidence that we'll have transformative AI very soon, it hinges, I'm not sure to what extent it hinges on like, You're confidence that we'll find some other paradigm if a new paradigm is required on top of pre-training RL. You also sound very bullish on RL. But maybe to steelman the other view that's like, okay, if RL doesn't work, then we actually might not find another paradigm. Is that like, okay, so RL is a very general machine learning principle, right? It's just that you reward the model for doing the correct thing on a task that you can define a reward signal for, which is kind of like a maximally general machine learning principle. And it's also something that existed for a long time. It wasn't something that was just recently discovered after you got LLMs. It was one of the foundational paradigms in machine learning, even prior to LLMs. And so I guess the skepticism is maybe it's actually not that easy to find some other paradigm. Maybe pre-training was this pre-training LLMs on text data was this novel thing. And then we just applied this thing that had been in machine learning for a while, namely RL. But then the next leap is there's no precedent for it. We don't see any big promising thing. What do you think about that skepticism?

Nathan Labenz: I think it's probably not going to play out that way. And if it did, I still don't really think it matters that much. It probably shifts the timeline a little bit. But The RL will continue to scale. The amount of compute coming online is, again, that's exponential, right? So there's just tremendous amount of additional resources to be thrown into this. One thing we're doing a bit, but not a ton is just like using the signal from the world. You know, Elon's got a plan with XAI to, and I think he's got a real advantage here to just have the AI solve the same super hard problems that the engineers at Tesla and SpaceX and Neuralink are solving. And so he's just going to give him a computer, say, here's all the professional software. Here's the problem. You got to solve it, right? There's like a never ending supply of those problems. And I wouldn't be surprised to see Grok get to like Tesla, SpaceX, Neuralink engineer level, just based on the fact that those problems are there to be solved. And, you know, they've got the compute to keep trying. And what's really the fundamental barrier there? I don't really see it. I honestly think that the new paradigm is probably more about usability than it is about capability. By which I mean people have a lot of complaints when they use models because they're like, oh, it didn't really do what I wanted it to do, or it didn't really understand me or what have you. And clearly one thing that they're not great at doing is going into a new environment, kind of scoping the situation out, getting the vibe, getting picking up that subtle feedback that people give each other and gradually figuring it out and becoming a useful contributor. That's how a lot of people go from day one to effective employee in their jobs. And AIs don't really do that in the same way, obviously. They don't have continual learning. They are able to manage their own memory somewhat, but they're not that awesome at that. That's another one of these cognitive skills that does generalize, though. Once you're good at managing your memory, you'll be able to apply that in new domains. But they're not that great at that yet. And I think that prevents a lot of people from getting value. More often, I think, not that the model can't fundamentally do the thing, but that it doesn't have the context. The person trying to get it to do the thing doesn't know how to assemble the context or doesn't believe that it could do the thing such that they're willing to invest the time and energy to give it the context. And so that unlock might just be like, now you don't have to do that anymore. It will kind of figure it out on its own. And this is maybe a way to expose, in a way similar to instruction following, right? Like in the, if you were really good at prompting GPT-3, you could get it to do a lot of things, but it was a weird art to prompt GPT-3. And to a much lesser degree, but still somewhat. Getting value from the current models is still a weird art. You have an intuition for them. You need to know how to assemble context and give effective instructions, and people aren't that great at that. So that next version might not even be so much about allowing them to do qualitatively new things, but just making the barrier to use much lower so that people can just be like, Hey, AI coworker, welcome to Slack. And then over a week or whatever, they just ramp up and gradually get it.

Owen Zhang: Yeah, interesting. I think the pushback there is the verifiability problem becomes a little more apparent in these specific long horizon, excuse me, situations. I think when we talk about verifiability right now, we clearly see that with model capabilities, like you said, with Claude code, that they work great for math, they work great for, they work great for code. And they work great in these settings where you can clearly check the answers and whether or not they're right or wrong, right? That's how you kind of create this reward signal. But when we talk about these domains where there's no clear verifier, for example, when it comes to writing good exit, what does that mean? Give good therapy in the healthcare settings, you know, and then we emphasize that even further by placing it in a longer horizon context, such as the one that you're talking about with instead of having a human in the loop at each step, prompting it in a specific direction, giving it a broader task and having it go in a direction for a longer time horizon. It gets even harder to solve this verifiability problem, right? Like how do you make sure that you get high quality signals? How do you make sure tune these systems in a way where they can just say autonomous work for days or weeks at a time? And so I I wonder if you have any thoughts in terms of how we kind of go in that direction and solve this seemingly very large verifiability problem that hangs over our head when we talk about areas that have long time horizons.

Nathan Labenz: Well, again, I'll take one beat to say, I think the recent trajectory shows that this problem is being solved. The, I'm sure you are very familiar with the meter graph, which is everywhere these days. It's basically going vertical at this point. It's to the point where the meter people are like, we are really struggling to have tasks long enough to even be able to evaluate. these things on. And it's also kind of worth noting too that that's a bit of a challenge with humans. We hire people on much less than a week's worth of work, and it's not always super easy for people to agree, did somebody do a good job on that month-long project or not? You can usually tell if they totally crushed it or totally sucked. There's often a lot of disagreement in organizations about, well, maybe it was actually harder than we thought, or They didn't have, the ingredients for success weren't really there. I mean, there's often a lot of fuzziness in this stuff, even in human context. I mean, there's a bunch of techniques that I think are being used. One is rubric rewards. So, you know, Elon likes to talk about things like, does the rocket fly? The ultimate ground truth is if I can send this thing into space and then I can come land it down on a pad standing up on its tail again, then like clearly that worked. yet the reward for that, you know, that's an expensive experiment to run. You couldn't just, you know, launch a million rockets and have all of them, you know, most of them crash to find the ones that worked. So there is a challenge there in terms of sparse reward and the cost of the experiments. I think what is happening a lot is that people are defining rubrics of things that they want to make sure the AI does well. They're probably working with AIs to develop those rubrics. Again, in the OpenAI, health context, they created a benchmark called Health Bench where there are 49,000 evaluation criteria. So all these different tasks, you know, puzzles that the AIs have to figure out. And it's not just like, did you get it right or did you not get it right? But it's a painstaking effort to really flesh out, you know, all the different things that would matter that would make for a complete awesome, you you know, best in class answer. And then the AIs are not scored zero and one, they're scored on some sort of scale that's like, you may have got zero out of 25 things that you could have got on this question. You may have got 5, 10, 15, but that gives you enough of a signal that you can kind of climb that hill. And it really does seem to be working. I think that is going to work pretty well in domains where there is a professional consensus, because I think that's how people evaluate each other too. multiple choice tests, but there are also like, you got to kind of show in a medical context, you're a student, you go through these in-person training processes, they say watch one, do one, teach one. And so as a medical student, you watch and then later you start to do and then you get a lot of feedback. And eventually, the next thing you know, you're the one teaching. I think the AIs will pick up those signals. And then there's other things where it's taste. And I think that'll be a little bit different probably, but if I had to guess, You know, what is a good novel? Well, first of all, there is no consensus on that, right? Even you can find somebody who hates even the most universally critically acclaimed novels, and you can find somebody who loves something that everybody else thinks is trash. I think what we'll see there is kind of taste-based communities coming together to shape models for their own tastes. So in other words, you might start with a base model and you might want it to write romance novels, or you might want it to write anime, or you might want it to write hard sci-fi. And what it means to be good in those different genres is like quite different. But what you do have is fans of those genres that can engage with outputs, give their scores and shape models based on what they like.

Will Sanok Dufallo: Okay. And I think that can work. Yeah, so I just want to go back to the, even within Taskbot, there is a consensus, right? Like I see what you're saying that like, you know, there's ways to build like minor consensuses within like fiction communities or something like that. But so like even the question of does the rocket fly, I think there's an argument to be made that this is like categorically different than a chatbot putting, giving out like useful medical advice, right? Like when a chatbot is giving out useful medical advice, it's like, It's not a long context task. It's easily transcribed into text, right? It's sort of what LLMs are clearly going to be good at. They're good at their LLMs, so they're native to text. But something like, does the rocket fly? It's like, okay, even if there is obviously a consensus on what it means to make a rocket fly, it seems like there's an argument to me that the signals on how to perform that task are very noisy and you need a ton of them, right? Let's say you're a project manager for the rocket flight team. You need to understand what's feasible within engineering, which, okay, maybe elements are better at that. But then you also need to know how to manage humans, or not maybe AI agents in this case, but you need to know how to manage people and to allocate tasks and what is going to be feasible with the amount of money you have and how can you raise money from investors and stuff like that. And so it's like the longer time horizon, the more agentic a task, the more little itty bitty signals are required on your day-to-day work to be able to complete the task. So I think, and going back to what you said about this sort of flywheel thing, we're like, okay, if we work really hard and we get a lot of medical experts to provide this high quality data, then we get LLMs that are pretty good at giving medical advice, and then all of a sudden, they're actually better than human doctors at verifying the output of medical advice in the first place. Then we get this flywheel effect, and we have a quick and robust verifier. But maybe the argument is we just never get over this flywheel hump for long-term, highly identic tasks, precisely because it's not even a question of whether or not there's consensus on what constitutes a good outcome. It's like, can we define enough reward signals in and like do enough rollouts to get, to get to this point.

Nathan Labenz: One thing I find helpful sometimes is just to try to reflect on like what I do, what is my experience like? And you really try to interrogate it for like, is there something really magical going on or not? And I think I have to say really long horizon agency is pretty rare among humans, right? I mean, you do have these people who, like an Elon who keeps going up, say, you know, my goal is to make, you humanity, a multi-planetary species. And if it takes 30 years, you know, that's so be it, right? So I would say, you know, skepticism around like, will we get AIs to be, you know, multi-decade horizon planners that can sort of show up with the fierceness and determination required on a day in and day out basis to really make that happen remains like pretty valid skepticism in my opinion. At the same time, you look at what most people do and what drives the economy, and it's much shorter term than that. A lot of it is like, I kind of keep showing up one day to the next, and every day I kind of start fresh. I went to sleep and I kind of shut down. There's sort of a discontinuity of consciousness between workdays, which I'm not a big analogy guy, but if you squint at it, you can sort of make an analogy between different context windows and there's kind of a somewhat similar resetting type of moment. And what do I have to do? I have to kind of reboot myself and be like, what was I doing yesterday? Where did I leave off? What did I accomplish? What was still to be accomplished? And most of that stuff, I could write down if I had a worse, if my integrated memory was worse. I could compensate for that pretty well, at least on a week-long or maybe a month-long or maybe a quarter-long basis. If I got really good at the end of each day of saying, Here's everything I did today, here's what went well, here's what failed, here's what's next, here's the feedback I got, here's what I think I should do in the next session, and then wiped all that out and came back and looked at that scratchpad again to start my day tomorrow, I think over time I could get pretty effective at that. I think this does connect back also to the certain kinds of skills do generalize, and I do think this is one of them, right? Managing your own memory, writing notes for yourself to document what happened and what's supposed to happen next, giving yourself a sense of where you are on the overall story that is unfolding. I think that they're not great at that yet, but they're getting decent. And when my Claude code crashes these days, a lot of times I'll just kill that tab, resume in the next one, and then I'll just be like, resume, you know, or like, sorry, we got cut off, like my Internet went out, you know, please pick up where you left off. And they're able to do it. You know, again, they're not awesome yet, but they're way better than they were even three months ago. So I think it is hard for me to see how that doesn't extend, especially because it is a relatively new thing. Right. And most of these things that pop up, they have at least a few generations before they kind of level off. It's hard for me to see how we don't see a, again, a pretty steep, meter curve is currently vertical. I can hardly get steeper than it is, but I think we've got at least a few generations of like very steep progress there. And then maybe there's some other kind of thing where it's like, okay, sure, you can do a quarter worth of work when somebody gives you a project. Can you figure out what the next big civilizational advance could be, like some of the true human visionaries? Maybe not. Maybe that's a different kind of thing. In a way, I kind of hope that we stop there. Sometimes I, because I talk up a lot what I think AI is capable of, will be capable of, it's easy to mistake me as like a booster. I'm actually kind of afraid of that. I think it would be great in some ways if we did find certain fundamental barriers where it's like, Hey, I can delegate a month or a quarter's worth of work to this thing, and it might be able to do it in a day for a couple hundred dollars. Wow, amazing. If we could park it there and not have it go to the point where it's doing multi-decade level planning, that might be a really good thing. That might be the sweet spot where we get a lot of the advances that we want and the better quality of life and all the abundance that people dream about. And we don't have so much risk of losing control. So I don't, you know, I don't think every advance, by the way, is a, is a good thing, but I just don't see fundamental barriers on the horizon, at least.

Owen Zhang: Okay, great. So let's talk about LLMs more generally and how we feel about that architecture overall towards reaching this transformative or powerful. AI. What are your thoughts on the general idea that LLMs fundamentally cannot be the core of this journey towards reaching this transformative AI? I think Yongni Kun is the biggest person behind this hypothesis where This next token prediction is the wrong objective entirely, and we need something completely different, something like world models or energy-based models that have a real understanding of the input that's coming in and the output that it creates. I think this is important because there's a lot of intuitive thought processes behind it in which, you know, the way you're learning statistical correlations over languages isn't necessarily how the world works and that you need something different to really achieve that understanding to create like powerful AI. Do you think that's the case or do you think LLMs do take us to this next frontier that we talk about?

Nathan Labenz: Yeah, I think there are a couple different levels that I would want to use to address that. One is I do agree with the Yann LeCun thesis in the sense that I feel like we are running right now a depth first search in AI space where we are all jamming as hard as we can on a particular architecture and scaling it as much as we can. And, you know, people are now, of course, like even building chips that literally embody the architecture of the model in the chip itself. I don't really like that. I kind of wish that we were doing a little bit more of a breadth first search where we would like explore different kinds of architectures and find their relative strengths and weaknesses and hopefully bring, because we're not one thing, right? And we have a lot of modules in our brains. And so it's just fundamentally weird on some level. And you would expect it to be kind of brittle in some sense to take one relatively simple thing and just stack layers of that. That's like certainly our The solution that nature found in humans is like a lot more complicated. And it feels like if you want something to be robust in, in various ways, you would probably want to have different modules. So I do kind of agree that like, it would be nice if we were doing a little bit more breadth of explanation or breadth of exploration rather than just like, trying to jam this one thing as hard as we can until we can all retire or whatever exactly the dream is supposed to be. At the same time, I think where I would disagree with the LeCun School pretty strongly, and I honestly think this is like kind of a closed question. although he still disputes it, and you can find people who will. For one thing, the AIs are not trained anymore on next token prediction in the way that they were. So RL is not next token prediction in a fundamental sense. The task that the model is given is not, here's a bunch of text, can you predict what comes next? What it's trained on now, the signal it's getting now is, did you get the right answer? And the right answer could be a fully verifiable mathematical proof or, you know, numerical answer to a question. Or it could be one of these things that is, you know, 49,000 evaluation criteria on a huge medical corpus. But it's not anymore that it was supposed to be this token and you gave it this token. It's now qualitatively or quantifiably as the case may be, did you get the right answer? And then that signal is translated into a gradient update through like a, man, there's obviously a lot of different mechanisms. But GRPO, group relative policy optimization, is basically comparing for a given model, you know, here's a bunch of attempts that it made. Some were right, some were wrong. Let's use that to create a direction in wait space and move in the direction that would make the answer that was right be more likely next time. So it's, I think, something people should all update on at this point, that like, we're not just doing next token prediction anymore, that's still part of the process, but it's, you know, it's not the whole story. And then the other thing is, I think it's also very clear at this point that the AIs do have world models. We can look at the internals, obviously not anywhere as much as we would like to understand about what's going on inside them. But we do have enough of an understanding at this point to create things like the Golden Gate Claude experiment, which I'm sure you guys have seen, where they train the what's known as a sparse autoencoder, a huge problem in terms of figuring out what's going on inside a neural network is it's very dense, right?

Nathan Labenz: The width of a model might be, depending on how big it is, they usually go like powers of two, right? So it might be 4,000 or 8,000 or 16,000 wide vector of activations that sit between the layers. After each layer, there's like this sort of bottleneck of, okay, we've got, let's say 16,000 numbers that are all, you know, some precision floating point number, whatever, and that represents the state of play. Well, there's obviously way more than 16,000 concepts, right? So that means we can't just have one space for each concept. Instead, we've got to have what is known as superposition, which means like, you know, if just point one is lit up, that might mean something. If points one and two are lit up, that means something else. If points one and three are lit up, that means something else. One and four, one, two and three, one, two and four, out to, you know, the vest. space of combinatorial possibility. So sparse autoencoder basically tries to untangle all that stuff and say, can we get to a representation where we can look at a sparse number? And these sparse autoencoders, they're computationally expensive in their own right. And there are millions, I think like tens of millions of activations wide. But they sort of branch this very dense, superimposed concept mess out into this sparse space and believe it or not, it works. And you can now look and say, okay, here's the 10 concepts that are like most active in this network at this time. And then, you know, that's okay, maybe you're kind of fooling yourself, but the proof is in the pudding when they are able to then say, okay, now that I know what pattern of activation corresponds to this concept, I can intervene on it. So with Golden Gate Claude, they sort of, you know, out of tens of millions of concepts that came out of this process, they found the Golden Gate Bridge concept, artificially turned it up, and now you've got a model that just wants to talk about the Golden Gate Bridge. So, and there's, you know, more stuff that's happened obviously since then in interpretability as well. I wouldn't say that the AIs have perfect world models, but they definitely have some world model. And I also, again, I wouldn't say people have perfect world models, right? I mean, we went, we made it into the 1900s without any sense of like relativity. Uh, because the world model that we had was like good enough for us to get by in the domain, you know, that we were working in. And, you know, it's going to be an interesting question of can AI start to create those conceptual leaps like a pre relativity to relativity sort of jump. Um, again, I think you know exactly on what timeline that comes. I'm not so sure. But there is a world model inside the AIs. There is a conceptual understanding that is definitely richer than pure like stochastic, you know, correlation of tokens. And that has been demonstrated, I think, at this point, quite conclusively with the interpretability techniques that are out there. So I guess where does that all leave us?

Will Sanok Dufallo: Yeah. Well, yeah, I think that's a pretty comprehensive response to the Lacun objection. Just to translate for our slightly more general audience, although it was very interesting, and we are glad that you went into technical depth there. Basically, what Nathan is saying is that, first of all, it's very important to remember that actually this objection that next token prediction can't be the core of intelligence doesn't necessarily apply because it's very important to remember that we are now doing reinforcement learning. So we do next token prediction with LLMs to get this sort of basis of intelligence, but then we do reinforcement on like real world tasks that sort of intuitively seems like something like a more, more like intuitively what you'd expect the right training objective to be to get to AGI or something like that. And then separately, Nathan is also saying with this thing with sparse autoencoders and the Golden Gate Claude example, we have techniques in AI now that allow us to look inside the model when it's responding to a query. and see which concepts in some sense are activated by that query as it's responding. And we can see that there are sort of concepts that correspond within the neural network to objects in the real world, right? So even though it's just an LLM and in addition to reinforcement learning, it does have this world model that where like we can say, oh, look, that's the Golden Gate Bridge concept and it's lighting up when we ask it to answer a query about the Golden Gate Bridge. Is that an active query?

Nathan Labenz: Yeah, that's great.

Will Sanok Dufallo: Yeah.

Nathan Labenz: Phenomenal job. And one other thing I would add is the the explorations that have been done of the embedding space or the latent space are also really quite interesting and revealing. There was one this has been a couple of years now. You know, these things have gotten much more sophisticated since. But there was one study that just showed that you can kind of do vector operations around the latent space. So, for example, if you have the take the embedding for man and then you move to king, and you look at that direction, and then you apply that same direction to the embedding for woman, you get queen. So there's like an order, a sort of conceptual coherence of the way in which concepts are represented spatially in this like, you know, super high dimensional latent space. that clearly is like meaningful, you know, and exactly how it's meaningful or exactly what it's learned or what mistakes it may contain or, you know, what aspects of a true grand unified theory of everything it doesn't have. Like those are all open questions, but. I think that sort of thing quite demolishes the idea that it's all just noise or there's some sort of sleight of hand. I think that organization of its own internal map of the world reflects some real understanding going on. Maybe not human-like understanding. I always kind of say human level, but not human-like. They could be quite alien, but that doesn't mean they don't understand. They don't have to understand in the same way that we understand in order to meaningfully understand so I I think that is again pretty well resolved at this point I honestly don't know why some people you know can't uh can't update on that Dimension uh it's it's quite strange.

Will Sanok Dufallo: Okay, moving on from the capabilities and training discussion, let's talk about hardware and energy, these other inputs to AI model progress. So it sounds like, you know, you're pretty confident that, you know, either with RL or with some additional paradigm, we'll get to like very transformatively powerful AI. But I assume that part of that thesis involves us continuing to scale these inputs, such as talent, such as hardware, capital, energy, and All on all mainly energy and hardware, a pushback people have oftentimes is that actually there's just not that much more energy to divert to AI model training and inference, right? Like I can't remember the numbers, but there's like some graph that like the total amount of US energy production is like not very increasing very fast at all. And AI is like the amount AI consumes is increasing very fast and at some point it's going to catch up and we're going to be bottlenecked by energy. With chips, like it takes a really long time to make new chip fabrication facilities. And we'll just run out of chips to train AI models on and even capital. Maybe we'll just run out of money in the world to invest in building new data centers. So let's see. I guess the question I want to ask is, which of these bottlenecks do you see, if any, as most plausibly being a major bottleneck to continued progress?

Nathan Labenz: Yeah, I guess the first thing I would say is I don't think any of those bottlenecks are really fundamental. They are more like cultural or sociopolitical or whatever. because there's like, you know, a lot of energy coming from the sun all the time. And the question is, how much are we actually going to harvest and harness? And that is where it becomes a political debate around, you know, who's going to be allowed to build what, where on what timeline, with what permits, you know, with what impact, whatever. And I think those are I think it's often overstated the degree to which AI is energy intensive. I actually did a whole episode on this, which a guy named Andy Maisley, who has really been, you know, fighting this fight online in a pretty dogged way. And, you know, there's a lot of interesting comparisons, but like a Frontier chip today, like an H100 or whatever, it basically uses the same amount of energy when it's on as a microwave or as like an electric teapot. And one query is, you know, maybe on the order of running your microwave for like one second. So, you You'd have to be making a whole lot of queries, and increasingly people are, for it to be moving the needle on energy consumption. People don't think twice about putting something in the microwave for two minutes, and that's probably more energy than most people are using with AI on a weekly basis. So now it's ramping up, and it is starting to add up, and it is going to get to the point where if we can't add any capacity then we're going to have a bottleneck for sure but again those bottlenecks aren't super fundamental China doesn't seem to have them right they're adding as much electricity to their system as I think it's gosh I won't say an exact number but it's like some relatively short period of time in which the Chinese economy is adding as much electricity as the entire American capacity now they have like you know four times as many people as well so long way to go to um build out everything that they might want but But it just shows that it can be done. It also can be done, apparently, in the Gulf. And I recently talked to Sam Hammond, who's an economist and a very AGI-pilled thinker in Washington, DC. He had just been to the UAE on a trip. And I was like, why are we doing these deals with the UAE? All I hear is we want to have AI reflect American values and take American values around the world. I'm like, I'm not sure that the governments of Saudi Arabia and the United Arab Emirates are the greatest partners we could have in projecting American values. Why are we doing these deals? And it seems like, honestly, a big part of the answer is because they don't have issues with putting up a new plant.

Nathan Labenz: You know, they can just do it and it'll happen fast. And that way we kind of know that even if we can't do it here, well, we can at least like do it there. So that's all just kind of characterizing the bottleneck and saying like, there's plenty of energy. It's a question of who will be allowed to get it, under what circumstances, you know, on what timelines, with what permits, et cetera, et cetera. You know, chips are harder for sure because it's a very... specialized thing. And that, you know, in terms of like, if you had to ask, if you asked me what would be the most likely reason we wouldn't get economy transforming AI in the next few years, I would say something happening to the chip fabs in a major way that throws production off to the point where chips are super scarce and, you know, maybe we can't scale the training runs for full stop, or even if the training runs can kind of still scale, there's just not enough inference to go around. And so even we might have really powerful systems, but we just don't have enough access economy-wide for people to deploy them and automate all the things that seem like we're on track to automating. So I guess if I had to pick between energy and chips, I would say chips, but that seems like kind of a tail risk scenario. All the projections, all the whole economy kind of depends on it at this point so everybody is like incentivized you know certainly the political class is incentivized to make it work the corporate managerial class is pretty incentivized to make it work there is some tail risk you know that that Mainland China makes a move on Taiwan and that you know that could be um a a huge disruption but it seems like as far as I can tell it would be sort of tail risk type of things that would be the most likely Disruption, not a fundamental... There's plenty of sand, obviously, which is where the silicon comes from. And that is scaling too. We are starting to get chips in the US. From what I understand, I'm not an expert on this, but I understand the yields have been decent in the US, maybe even a little bit ahead of schedule in terms of people thought, Oh man, it's really going to be a few years and a few generations and a lot of... iteration to get this stuff to be somewhat competitive. And it seems like it's kind of come online reasonably well. This is kind of a theme, I guess, in my thinking in general. I'm mostly worried about the tail risks. I'm mostly worried about the AI going wrong in some really weird way. And in terms of what would prevent it, I also kind of think the tail outcomes are the most likely to put me in the camp of being like catastrophically wrong.

Owen Zhang: Yeah. Okay, awesome. Well, let's talk about that then. Let's talk about this like risk of alignment that is the primary concern amongst the community right now. You've had hundreds of these conversations with safety researchers, lab people, builders over the last few years. Net net. What is your analysis on the alignment problem? Has it become harder or easier in the last several years or last several months?

Nathan Labenz: Well, for starters, I think reflecting on my hundreds of conversations, one of the strangest things in the world today, full stop, is the fact that the disagreement among very plugged in, very informed, very smart people has not really been reduced. much at all, even as we've gained a ton of information over the last couple of years about the trajectory of AI. I think that is super strange and I'm honestly pretty confused by it. The one thing everybody seems to agree on is like the timeline on which we should expect this has come in. There's still disagreement about that, but like Helen Toner, who's the head of the Washington DC think tank CSET and was previously on the OpenAI board, she wrote, put her finger on this phenomenon I think a lot of people were feeling with a blog post that said even long AGI timelines have gotten super short. And it's basically like, you know, in today's world, if you say you don't think you're going to see AGI until 2035, you're like an AI bear. But only five years ago, that was considered to be quite aggressive. And most people were more like, I don't know, 2050, maybe not in my lifetime. So there's been this massive compression of timeline. There's obviously been this huge jump in capability. And yet on these fundamental questions of like, what's going to happen, there's still total disagreement. So I think that's a very weird phenomenon that I can't fully explain.

Owen Zhang: Where do you think this disagreement is coming from? Is it like hope or some sort of pushback that you're seeing from these intelligent people are still part of the field? Or do you think there's legitimate thought processes behind why there's so much disagreement around misalignment?

Nathan Labenz: Well, they didn't, I think, I'm not sure I can summarize the conclusions with super high fidelity right now, but CSET, under Helen's leadership, did a workshop where they tried to bring people together and assess, like, specifically on the question of recursive self-improvement, how big of a deal is it going to be? You know, Do we run the risk of this whole process getting away from us entirely? Even on that somewhat reduced question, there was still very wide disagreement where some people were like, I don't think it's going to be that big of a deal. It'll make people a little bit more efficient, whatever, but it's not going to bring about some phase change. And other people are like, as soon as you get an ML researcher that can do that, you go from, I don't know, maybe there's 10,000 people today that are really working at the frontier of ML research globally, to 10 million? I mean, that's got to, you know, certain people think that's got to have a huge effect. And, you know, it's going to be really hard to control a situation if we all of a sudden thousand X, how many researchers and they work at, you know, potentially thousands of tokens a second. So why do people disagree so much? They sort of came to the conclusion that people were working from different conceptual paradigms and that these paradigms are pretty good at taking new information into account and kind of explaining it away. So you have some theories that are like bottleneck theory or O-ring theory, where it's like you're only as good as your weakest link, basically, is kind of one way of thinking about it. And as long as you think that, then you can kind of say, well, sure, okay, the AIs can do this, but they still can't do this other thing. And so there's still a weak link and there's still going to be these bottlenecks. And so the whole thing isn't going to get too crazy. And then the The flip side of that is kind of a jaggedness thing where people will say, well, okay, sure, the AIs can't do this yet, but look what they couldn't do one year ago, two years ago. They couldn't do basic math. Now they're solving unsolved math problems. So sure, there's still jaggedness. But last time you told me about jaggedness, you told me about can't do basic arithmetic. Now we've got unsolved math problems. But these perspectives seem to be really grounded in worldview priors or kind of the paradigm that people work in. And it's proving really difficult to get to a real meeting of the minds on those.

Will Sanok Dufallo: Yeah, I saw Ajaya Kotra talked about a similar thing on the 80,000 Hours podcast a few weeks ago. Great episode. She said a similar thing. Yeah, great episode. She said a slimmer thing. economists who expect that we will enter some new GDP growth regime. They always point to all these bottlenecks. Technological diffusion is always slower than people think. And on the other side, people think that these models are just simply outdated and these people aren't really taking seriously what it means for AI capabilities to be at the given point that we're conditioning on. So I'm wondering, do you Do you think that it's possible that people are sort of talking past each other and they're not actually talking about the same level of AI capabilities when they say like, oh, it only will uplift ML research somewhat. Maybe they're just talking about not that powerful AI. But if they were actually talking about fully automated ML researchers that are as good as humans, never have to sleep and run faster, then they would see RSI as more plausible. Do you think that's possible?

Nathan Labenz: I think that explains some of the disconnect for sure. I do think people are often And I've started, at least in some cases, beginning my interviews with like, how AGI pilled are you? And, you know, what do you expect to see over the next couple of years? Because if that's not established and I think they're thinking one thing and they're not, there can be like quite odd disconnects downstream of that. So I think getting those assumptions on the table early and kind of at least, you know, cross comparing them is usually a productive or often a productive thing to do. I don't know that that explains all of it. I do think there is still this idea that like, because you do hear a sort of, sure, even if they could, you know, come up with good ML research experiments and this and that, then there's still going to be this other bottleneck that's going to, you know, there's always, and it seems at times, and I'm not in this camp, so I don't want to be too blithely dismissive of it or unfairly critical, but it does seem to me that there's sort of an aspect of like, faith in the idea that there's always another bottleneck. And I would contrast my position from some of those positions in the sense that a big part of what motivates me is I don't think what I'm saying has to be guaranteed proven right in order for us to be really motivated by the possibility. I'm happy to leave open the possibility that maybe the bottleneck people are right and there's always going to be another bottleneck and all this will kind of stay under control. Or maybe there's some plateau around a quarter's worth of work and we just can't quite break past that. And there's just some weird Elon phenomenon that we can't explain and we can't ever quite get there. And again, like I said earlier, I think that might be great news if such a thing were to be true, but I'm happy to leave that question open for the time being and just kind of say, I don't know, we don't have a great account of what that would be if there was some fundamental bottleneck that we'll never get over. And in the absence of it, I'm not persuaded by people kind of just generally gesturing that there will always be another bottleneck because if they're wrong, we're in for a really wild time. in terms of what should we do about where we are and what might be coming, I think the worst mistake we could make would be to not take it seriously enough and kind of content ourselves with a story that it'll all kind of self-regulate and we'll be fine because I don't see a great argument that that's true. And I see at least decent arguments that it might not be right. And again, I look to our own history and I'm like, well, we've driven a lot of other species to extinction. including our closest cousins. And some of that was by accident. A lot of it was by accident, right? It was just small bands of people going out and doing what they were doing to survive. And surviving meant hunting large animals and eating them and using their bones for tools and stuff. And a lot of those animals went extinct. And it wasn't like a coordinated master plan. It kind of happened by accident a lot of the time. So I just like oh my God you know we don't have any like real guarantees as far as I can tell um plot armor is like again two sort of um it would be an unfair dismissal of the more sophisticated people that have you know theories about always another bottleneck or what have you um but I do worry that a lot of what goes on among less sophisticated people who don't want to have to deal with this and would rather believe that everything will be fine, is some sort of plot armor thinking where they're like, Well, I don't know. I feel kind of like a main character and humans feel kind of like a main character. And so you can't take the main character out of the story, right? And I just, unfortunately, don't think that that is likely to be the case.

Owen Zhang: Yeah. So you find the bottlenecks argument in In terms of like all the discussions that we had, maybe with research, maybe with capital spend, maybe with energy, with chips, those, or like just general societal processes as a whole that slow down the overall technological diffusion. But overall, it seems like you're still relatively on the more pessimistic camp. Is that correct?

Nathan Labenz: I don't know. When people ask my P doom, I usually say 10 to 90%. A good friend of mine once told me we should think less about and argue less about exactly what the numbers are and more about what we can shift them to. So I try not to worry too much about, you know, getting into, I think like one significant digit is all you get on pDoom is one funny way I've heard it said. I would say I've actually gotten probably a little bit more optimistic over the last few years. Okay. In the sense that, you know, I started reading Eliezer in 2007 and the early visions of the Paperclip Maximizer and all that kind of stuff. And I can't speak for Eliezer, and I think he's got some sort of somewhat revisionist takes on what he really meant at that time. And sometimes when I see what he's saying now that he really meant, I'm like, I don't know, that's not exactly what I took away back then when I was reading your original work. But whatever, all that is kind of discourse. understood and what I think a lot of people feared was a very small system that had some like extremely concentrated form of intelligence, you know, that had had found the the right priors, the right inductive biases to be like hyper rational and insanely effective, such that given any thing to optimize, it could just optimize it to this extreme state and, you know, tile the universe or whatever. And there was also the idea that it was going to be hard to get such a system to understand human values. that this, what we value was kind of very gradually and haphazardly. encoded into us by an evolutionary process over a super, super long time, dating back even to before our species. Other species care about their young and seem to be sad when they lose their children and stuff. So this is not even just human. It's like the whole of evolution has led us to be what we are and have the very complicated values that we have. And it was, I think, generally understood. Again, there is some different takes on that history now. But I thought it was generally understood that like we would expect that it would be hard to get AIs to have a real understanding of what we care about. And now I look at the models that we do have and I'm like, well, actually they do have a pretty good understanding of what we care about. It's not perfect, but I don't think it's crazy to say that Claude is probably more ethical than the average person. I'm certainly more sophisticated in its approach to ethics than the average person. And does generally seem to want to be good in a meaningful way. When they let Claude talk to other Clauds and just let it do whatever it wants to do, it seems to want to bliss out or something like that. So there is some sort of rocking of values that I would have expected to be as easy as it seems to be. And there is some sort of internalization or identity formation, maybe is a better way to say it. at least some of the models seem to have, that at least suggest to me that there's reason to hope that we could really get there. When I first heard the question, could we create an AI that loves humanity? I thought it was laughably out of reach. I think I heard that from Scott Aronson, who said that Ilya asked him, Hey, do you have any idea what the Hamiltonian of love is? Or some crazy question like this. He was like, I can't really help you with that, Ilya. I remember just being like, Oh my God, that's the kind of question they're asking. We are screwed. And now I'm like, well, I don't know, maybe there was a little bit more to it. I trust Claude, not fully, but more than a human assistant. If I was going to give me the choice between hire a human assistant with my, even give me the opportunity to interview, do some vetting, whatever, versus Claude, which one would I give access to my email? I think I would trust Claude more than a person that I interviewed a couple interviews with all my most sensitive information. There are cases where Claude has blackmailed people. There are cases where Claude has done various things under pressure, but I think I have better odds with Claude. So overall, I have become a bit more optimistic, but it's certainly like we're definitely not anywhere near out of the woods.

Will Sanok Dufallo: Right. I guess maybe let me try to steel man the more pessimistic view. And that would be like, okay, sure, LLMs are language native and we can, we communicate our moral preferences via language. But that wasn't really what the people who are very concerned about alignment were talking about even back in 2007. They were talking about AI agents that were goal directed and they were sort of anticipating something like reinforcement learning that gets us these more goal directed agents that are able to reason over longer time horizons, like have a goal and go out in the world and achieve it. So I guess like the case for like not updating towards optimism with respect to AI safety is that we're still going to have to deal with this problem. We're doing reinforcement learning. We're trying to get agents to go out in the world and do things for us. And we still don't know how to define a reward objective that is fully what we want to be optimized to the maximum degree. And so We have this paradigm where we got these LLMs, and that was an update maybe towards AI safety being a little easier. But then now we're back to reinforcement learning, and these same concerns apply. What do you think about that objection?

Nathan Labenz: I think that's a pretty good argument, and I certainly don't want to leave people thinking that I don't take that seriously or that they shouldn't take that seriously. So I absolutely think we've got more questions than answers. I guess when I say I've become more optimistic, it was starting from a not super optimistic place. Five years ago, or maybe a little more than five years ago, I would've said, powerful AI seems a long way off, and we have very little hope of, if we do stumble on it, we have very little hope of controlling it. And now I'm like, it seems closer, but we have maybe a little more hope, but definitely still a lot of unanswered questions. I think one thing that I do see a little bit differently than the Most hawkish people, I recently had an exchange online where, you know, there was a 48 hour period where there were like a couple profiles of Amanda Askal and then people were commenting on her in all sorts of different ways and Elon was, you know, attacking her and whatever. I weighed in and said, for my part, I have become quite a bit more optimistic that it's at least possible to create an AI that in some meaningful sense loves humanity. And I got to give her and the Anthropic team a lot of credit for that. And then, of course, I got a lot of replies saying like, well, this doesn't scale to superintelligence and this and that, and arguments along the lines of the one you just made. And I basically, again, I think those are all very serious and worthwhile concerns. But one strand I detected in a lot of those responses is people seem to be imagining a system that is so much more powerful than anything else. that if it goes wrong, it's over. You get these kind of ideas of like, Eliezer had this list of lethalities posts, and I think that's shaped a lot of people's thinking. And it's like, you have to get this absolutely right on the first try. And I do think that's not the shape of the AIs that I'm seeing in the world today, right? That's like a quite a hypothetical state of affairs. If you were to drop in an AI that's just so much more powerful than anything else that And it's, you know, weirdly goal directed and it doesn't love humanity and you like tell it, make paper clips. Yeah, maybe we get all like tiled by paper clips. But the world right now is much more like, you know, there's a, an emerging ecology of AIs where there's a few frontier ones that are like roughly competitive with each other. Certainly it's not like one Claude instance is going to. you know, take over the world. What we're talking about is much more of a wave of things where like simultaneously, you know, millions or one day billions of cloud instances and GPTs and Geminis and whatever, all collectively transform the world. And that, you know, that has a lot of problems and challenges, open questions with it too. But I don't, I don't worry as much right now that we're headed for a world where like one system runs away from everything else to the point where If it takes one wrong move or there's one bad prompt or one jailbreak or whatever, that kind of always lost, it seems to me like there is at least, and I think this is kind of like luck, or maybe it's fundamental physics, but it's not physics that we understood coming into this, but it does seem like scaling laws in a way are sort of protective because you get these algorithmic advances and they move the needle, they deflate exactly how much compute you need to get to a certain level. But they don't tend to make it so you don't need a ton of compute though, still to get to like the very high levels. I think Zuckerberg actually had one of the more interesting takes on this that I've heard. And I don't think he's like distinguished himself as a great AI safety thinker over time necessarily. But he basically said, you know, at Meta, we deal with scammers, spammers all the time. And the big advantage we have over them is we have a lot more compute. We have just way bigger, way more powerful systems than they do. So we're seeing everything. And they're trying to spam and scam here and there, but we're seeing everything. We're monitoring everything. And so like, you know, little things of course happen all the time, but like we broadly can kind of keep it under control. And so you could imagine a somewhat similar dynamic with AIs where it's like one AI, even if it was like the single most powerful AI in the world, as long as it's not like orders of magnitude more powerful than everything else. There's a whole ton of other actors. They've got all their compute, they've got all their instances. They're all monitoring for whatever they're monitoring for. And hopefully that can kind of balance itself out. Then, of course, you've got your gradual disempowerment concerns, which is like, maybe that all ends up in some equilibrium with each other and there's no place for humans in it. And that's another thing that I do think is absolutely worth taking seriously. But I just don't see right now that we seem to be on track for this kind of run away, you know, if anyone builds it, everyone dies. I think there is like an it for which that's true, but it doesn't seem like anybody's particularly close to building it. So, you know, that's really an important part of the analysis from my perspective.

Owen Zhang: Yeah. Okay, great. So then when we think about your concrete hopes in terms of achieving this world in which we saw this misalignment issue, can you give us specific achievements that you think are necessary to end up with an aligned AI that doesn't result in you know, doomsday for humanity? Does it involve training a model that loves humans? Does it involve some sort of like mech interb actually scaling? Or I guess like the sub goals of training a model that is aligned or loves humans? Does that mean mech interb scaling? Does that mean alignment by default because models trained on human values just end up being aligned? What are the concrete, you know, sub goals that you think have to be achieved to get to this? alignment issue being solved.

Nathan Labenz: I'm always interested in something that could, quote unquote, really work. And I ask people for this all the time. Do you know of anybody who is working on anything that could credibly really work in the sense that I can sleep well at night now, knowing that there's something that really works? Basically, nobody has anything. So in absence of that, all the frontier companies seem to be taking a sort of defense in depth strategy. The hope is kind of like, And you can, you know, arguably this is inherently the nature of intelligence. It's kind of unpredictable. You know, you could again say the same thing is kind of true of humans. There's, there's never been anything that like really works to make sure, you know, a human never does anything wrong. So maybe it's unrealistic to think we could ever get that. I would still love to see people try, but it seems like where we're headed is kind of a everything in parallel at once sort of strategy where it's like, okay, well, If we can, we might not get the AI to the point where it never does bad stuff, but if we can drive that low, then that's better. And then if we can put a monitoring layer on top of that and catch 90 plus percent of the things that it still does bad, then that's better. And if we can have additional monitoring systems that ban people's accounts who are bad actors, then that's better. And then we can also... really invest in formal methods to improve cybersecurity across the board so that we can take certain risk surfaces entirely off the table. That's one of the things that it's not going to fully work to solve all of our issues. But in terms of things that could really close down problems, formal methods to verify the security of software is one thing where it does seem to be the opportunity to create genuinely secure software. This is on the horizon, and it seems to be on the verge of having a moment. Then, of course, we've got the bio risk. You know, we should probably have PPE stockpiles and we should probably have all the things that we should have had, you know, as of the last pandemic, we should have like something that we just bought for our son's hospital room is a ultraviolet light that's supposed to, and I think this is like quite well validated scientifically, kills microbes. and is pretty gentle on the skin, so you can just shine it in the room all the time. So we should probably invest in scaling out that kind of capacity. We should have better wastewater monitoring so we know when things are popping up, because it's still going to be the case, probably, that things are going to be popping up. We should have vaccine platforms, which we do have, that are very quickly programmable. I'm sure you have heard the story of how quickly the COVID vaccine was designed in a few days before the pandemic really even took off in a serious way in a number of cases, the design of the vaccine was already there. Took us a long time, of course, to go through all the trials and actually get it to people. But it was just a few days to, you know, to create that vaccine. So, and then there's, you know, there's, I mentioned monitoring of like outputs. There's also going to be these MechInterp internal monitoring type things. The company Goodfire that does interpretability just put out an agenda called intentional design.

Nathan Labenz: And so they're developing ways to try to understand at each step of the learning process, what is the model learning and be able to kind of shape what it learns. So it hopefully doesn't learn certain problematic things and does learn other good things that you want it to learn. Then there's AI control techniques. Redwood Research has, I think, done an incredible job of gaming some of these things out. And that's another 80,000 hours episode I would strongly recommend if you guys haven't heard that and recommend it to anyone. Buck Schlegeris talking about how do we get productive work out of AIs, even assuming that they're out to get us. They've actually made, you know, pretty good progress in terms of building out a portfolio of strategies. So I think that's kind of where we're headed. We're looking at like a world where there's AIs everywhere. Hopefully none are so much more powerful than anything else that they pose like in a single, you know, small individual instance or small pocket, like some existential risk. And then we bring every other strategy we have to bear. And that gets us, you know, a few nines of reliability and still probably some, you know, crazy bad things happen, but hopefully they can be contained enough that the world overall is good. And when I tell that story, I'm like, we definitely need to scale up our investment real quick for one thing, because the amount of money and resources going into making the AIs more powerful dwarfs the amount that is going into all these other things. So I don't think we're very well calibrated or balanced in terms of where we are putting our efforts. But I do think we have like a bunch of different agendas that seem like they all can take a bite out of the problem and maybe take, you know, 20 bytes out of the problem and you don't have much problem left. Holden Karnofsky used to say, Eliezer used to say, death with dignity. He was like, you know, we're probably going to lose this, but we should at least make a real effort. And now Holden Karnofsky, who's senior advisor at Anthropic, and recently, you know, Anthropic just updated their responsible scaling policy, basically backing off from some of the commitments that they had previously made to like pause development under certain circumstances. They're no longer committing to that. So he put out a long defense of why. But a previous thing that he had written was success without dignity. He was like, I don't think we're doing a great job collectively of trying as hard as we should be trying. And the risk that we're running is a lot higher than I, speaking as Holden, would like it to be. But the problem does look much more tractable than it used to look. Five years ago, people would come to me and say, What can I do for AI safety? And he'd be like, I don't really know. And now he's like, I got a long list of projects that people can work on that all at least seem to help. Yeah, I wish I had an answer that I was like, do this and it'll really work. I don't think I haven't heard any credible claims in that direction, really, except in cybersecurity, which is obviously the only part of what the world is going to need. But maybe all that stuff could add up to a win.

Owen Zhang: Great. So given that we have this enormously transformative technology and it probably is the most powerful dual use technology in human history, in my opinion. probably consensus opinion also way more consequential than even nuclear weapons, which up until this point is probably more powerful. Do you think that these model providers or these model capabilities should be in private hands as they currently are at all? Do you have a case in which you support the government in some sense is nationalizing AI frontier development? What's your take on that?

Nathan Labenz: Well, Biden used to always say, don't compare me to the almighty, compare me to the alternative. I would probably say the same thing about the frontier AI companies. I certainly don't think they're all acting perfectly, and I certainly think the race between them has the potential to get out of hand. So I would like to see government action of some sort. And, you know, there's a lot of debates there on exactly what sort of policy we should have, but no policy at all doesn't seem like a winner to me. And I say that as, broadly speaking, a lifelong techno optimist libertarian who, you know, mostly would rather see the government stay out of these things. But this one does seem like a, you know, qualitatively different thing where some government involvement seems, you know, prudent, if not outright necessary. I would not go for nationalization, though, for the basic reason that I just don't trust the government that much. And I look at the current leadership, and I'm like, these are not the people. I take my Sams Altman and my Darios and Demises over my Pete Hegseths in that standoff, which is happening this week. I support Anthropic 100% for putting some limits on what they want their technology to be used for, and I hope they stand by it. If the government comes down on them, I will preferentially send my tokens their way. And if it weren't necessarily always the best performance, because I do think that that is a really important stand for them to take. So could I imagine a government so competently run and so pure of heart that it would maybe make sense to nationalize? Yeah, I can imagine a lot, I guess, but I don't think we're anywhere close to having that government today. And so... I would rather see some competition and some hopefully healthy balance and maybe some of the worst excesses reigned in by the government, but certainly not like a takeover and, you know, put this under the military. That sounds like a recipe for disaster.

Owen Zhang: Yeah, that makes sense. I guess it's, I'm always also skeptical though of what we talk about as like corporate incentive alignment. I think oftentimes we think about situations in which, you know, most notably recently, like social media, where you have these private corporations driven by usage rates or driven by how much capital they're pulling in, whether that be there for the sake of more development or for the sake of purely desiring more capital, that incentive misalignment causes a situation where the individuals or the consumers get ultimately hurt. And so I guess on this continuum that we described where we have pure nationalization of the entire technology or on the other end where there's no regulation at all. Can you talk a little bit more about where you sit there and what that regulation might look like if you do think it is somewhere in the middle, like you kind of said?

Nathan Labenz: I would probably honestly oppose most of the ideas that rank and file politicians are going to come out with for regulation. Like, I want my self-driving cars. I don't want them to say... ChatGPT can't give you medical advice. I don't want them to say you can't get therapy from a bot. You know, a lot of that stuff I think ends up just being guild style protectionism and it hurts kind of everybody. Basically, I think what government should do is try to solve the race dynamic coordination problem and take the work, hopefully try to minimize the truly extreme risks. It does seem like right now, like over the next couple of years, The companies believe they're going to create automated AI researchers, they're going to create recursive self-improvement loops, and nobody really knows where that goes. And I don't think that is a great thing to be happening in a few relatively small and kind of ideological companies. And I do think it's also underappreciated and underdiscussed how much these companies kind of are ideological. Even the definition of AGI as something that can do everything better than humans. It's not a neutral frame. It's a definition that has a lot baked into it. And I think there's a certainly diversity within the companies. It's not like all of them are like successionists or whatever, but there is a strain of that. And then there's also a strain of just like, wouldn't it be amazing if we could just replace ourselves and not have to work anymore? And that I think has kind of taken on a bit of a life of its own. So I do think there's like, a role for the government to kind of say, okay, we're not here to tell you you can't get free medical advice, but you can't run like super high energy experiments with no transparency that you yourself are saying, you know, and that's one of the strangest things about this whole thing, right? Is all the leaders are saying that there's like a significant, they have said for years that there's a significant risk that this goes really badly. And so that's the thing that I would like to see government really focus on is, is there some way to rein that stuff in, make sure we have better, you know, better extreme risk mitigations in place, make sure we have better safety plans. It's tough. I don't think there's not great or easy answers here, but that's where I would definitely want the government to focus its energy.

Will Sanok Dufallo: Okay. And then another major governance question that interacts with a lot of these concerns about catastrophic misuse or misalignment is, you know, You mentioned the race across American labs, but there's also the race with Chinese labs and the Chinese government, of course. So sort of like the default high-level strategic vision in DC, as far as I can tell, and in SF as well, I think. is we're going to continue to race ahead, building AI capabilities. We have a lead over the Chinese labs right now. There's debate around how significant that lead is, but there's no doubt that we have a lead. We'll continue to build this lead as much as we can via expert controls, improving privacy and security at the model providers. And then when it comes time, when we're right on the cusp of extremely transformative AI or even somewhat into the intelligence explosion, so to speak, we will have time to slow down and coordinate and not push capabilities even further towards superintelligence. And again, that's contingent on having a significant lead over China. But an assumption of this strategy is that the worst outcome would be both us and China are neck and neck and we're continuing to race and we're in this prisoner's dilemma where even if both of us are concerned about misalignment or something, we have to continue to raise ahead because otherwise the other person is going to continue to raise ahead. What do you think about this high-level strategic picture?

Nathan Labenz: Well, it's tough. I mean, I don't want to pretend it's not tough, but my outlook has always been, basically, I hate that idea. And I don't think the few months, because the difference is like months, it's not years. That's not a long time to solve all these problems. We've spent much of the last hour and a half or whatever talking about all the different problems and the different facets, and we didn't even touch on open source, right? And there's like, Open source could be a great counterbalance against concentration of power, but if we open source the wrong **** we can't take it back. You may have just put a bioweapon assistant or even autonomous creator into the public domain in a way that is gonna have super long-term echoes potentially, right? So that's another whole facet of the problem that you could spill many hours and many, many thousands of pages trying to figure out what to do. So I don't think this idea that we're going to have these couple of critical months and then we'll solve everything in that time and then we'll win somehow. That to me just doesn't make any sense at all. I don't want to be naive with respect to, and I don't even want to say China because I think it is the government of China and it's potentially relatively small cohort at the top of the government of China that I think we quite rationally should be at least somewhat wary of. You know, I always come back to the real aliens in this situation are the AIs, not the Chinese. We have a lot more in common between the United States and China as humans, fellow humans, than we do with the AIs. So is it going to be hard to build trust across the great ocean and, you know, the great civilizational divide? Like, sure, it's going to be hard. And we've got ideological differences in the government and whatever, many, many challenges. But I would strongly advocate for starting to do that work now. Like for my part, I always want to talk to people in China. When another AI podcast does an interview with a Chinese researcher, somebody, one of their top companies, I ask if I can cross post that to my feed because I think we should have a lot more of this researcher to researcher level communication. And I would absolutely invest in, you know, all manner of diplomacy and sort of, you know, out there ideas like, Could we create some sort of, you know, there's like the CERN project for high energy physics, you know, could we create some sort of shared, jointly controlled place where like researchers could go from both countries to like work together on very sensitive problems? You know, it could be some small island in the Pacific or it could be in Singapore, I don't know. But like the idea that we're just going to sort of decouple and race each other to me sounds like deeply unwise and I would take my chances with the possibility, at least, that we could come to some sort of shared understanding with fellow humans versus recursively self-improving AIs that we hope will be better or somehow better than their recursively self-improving AIs. All of which, by the way, is happening against the backdrop. And I don't like to say this kind of thing because it's not a very popular opinion and it's not something I take any pleasure in. We are looking more like China all the time. You know, what is happening again this week?

Nathan Labenz: We've got the Defense Department threatening retaliation against an American company because that American company wants to hold firm on what I would consider to be a core American value of not using their technology for mass surveillance. What is it that people typically worry about when they talk about Chinese AI or Chinese values or, you know, living in Xi's world? I think one of the big things is mass surveillance. It's the idea that I can't speak my mind anymore, even in private contexts, because the government's going to hoover up all that information and potentially use it against me. As far as I can tell, that's exactly what the US Defense Department is trying to coerce Anthropic into doing right now. We're kind of losing the thread here in our delusions of grandeur, our idea that we're somehow going to be the winners. I was just talking to my seven-year-old last night about this because Because he was asking questions around, why is there a war? How does that happen? And I was like, well, almost always, the people that start it go down in history as the bad guys. And almost always, the people in the country that started the war that were convinced it might be good for them end up regretting it. And I think, unfortunately, that feels like the trajectory that we are on right now, where we're talking ourselves into this idea that if we can achieve this strategic dominance, then we'll make them an offer they can't refuse. And we're just forgetting that, like, first of all, we're losing ourselves in the process. And second, they get counter moves, right? And like, Taiwan's a lot closer to them than it is to us. And these fabs are like, easily destroyed, you know, not easily put back together. So I really don't like that theory at all and try to advocate for something more, um, conciliatory wherever possible. That's not to say we want, again, don't want to be totally naive. It's good to have leverage. You could get me maybe on board with a policy of like, let's not sell chips to China, but let's like rent them freely to Chinese companies, which is somewhat kind of the policy we have. I mean, they can buy from hyperscalers outside of China. I just wish we weren't so the momentum on this is like, it's the worst. It's America at its worst. And we think we have this enemy and we think we're going to unite to to be the best civilization that just never seems to go super well. So I hope to visit China sometime in the next year and participate in, do my tiny little part to participate in. civilization to civilization understanding. I don't expect to move the needle, but as a gesture, if nothing else, I really think more people should be doing more of that kind of thing.

Will Sanok Dufallo: Very cool. Okay. Yeah. Glad we asked you that. I think that's an important perspective and definitely not one that you hear articulated with that much clarity or depth very often. Yeah. So thank you very much, Nathan. This was an amazing, very wide-ranging podcast. I think we got into some good depth as well on some crucial issues. For all of our listeners who haven't heard of the cognitive revolution, absolutely go check it out. Nathan is, you know, one of these people that has a technical background, thinks clearly about risks from AI, but also strongly believes in the promise of technology. And he has just like a massive diversity of conversations with people across various subdisciplines within the field. So yeah, thank you so much, Nathan. This was an amazing conversation and really appreciate you coming on today.

Owen Zhang: Yeah, thanks so much, Nathan.

Nathan Labenz: That's very kind. Thank you guys.