Watch Episode Here

Read Episode Description

Today, Zvi Mowshowitz returns to The Cognitive Revolution to discuss how recent AI developments like GPT-5 and IMO gold medals have led to modestly extended timelines despite being on-trend, while policy missteps around chip exports to China and alignment challenges from reinforcement learning have increased risk assessments, covering everything from live players in the AI race to safety funding priorities and virtuous actions for navigating our current moment.

Read the full transcript: https://storage.aipodcast.ing/transcripts/episode/tcr/9ec35632-3920-4bc8-ab11-2d2ec0b26d84/combined_transcript.html

Check out our sponsors: Linear, Oracle Cloud Infrastructure, Shopify.

Shownotes below brought to you by Notion AI Meeting Notes - try one month for free at: https://notion.com/lp/nathan
- Timeline Adjustment: Zvi has lengthened his AGI timelines because there haven't been "very large jumps in capability" that would significantly shorten them. The chance of AGI arriving in 2025 has decreased dramatically.
- Recent AI Achievements Context: While achievements like the IMO Gold medal are impressive, they appear to be on-trend rather than revolutionary breakthroughs that would drastically change timelines.
- Evaluation Standards: There's a discussion about the challenges of trustworthy AI evaluations, with Zvi emphasizing that watchdogs must maintain standards of rigor and integrity "vastly above" what others are held to.
- Policy Priority: A short-term policy priority should be preventing advanced AI chips (B30As) from being sold to China, which requires raising awareness about the implications.
- Nvidia's Influence: Concern is expressed about Nvidia potentially having excessive influence over White House rhetoric and plans regarding AI.
- Career Impact: Working for Anthropic is considered a "net good idea" for those wanting to make a positive impact, though there may be other organizations that offer better opportunities for influence.

Sponsors:
Linear: Linear manages your entire product development life cycle with new AI capabilities that automate coordination, route bugs, and generate updates. Get six months of Linear business for free by visiting https://linear.app/TCR

Oracle Cloud Infrastructure: Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs, including up to 50% less for compute, 70% for storage, and 80% for networking. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive

Shopify: Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive

PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(03:20) Summer Timeline Updates
(14:10) IMO Competition Analysis (Part 1)
(20:08) Sponsors: Linear | Oracle Cloud Infrastructure
(22:46) IMO Competition Analysis (Part 2)
(24:48) Model Withholding Strategies (Part 1)
(31:56) Sponsor: Shopify
(33:53) Model Withholding Strategies (Part 2)
(39:51) P(doom) Assessment Update
(51:25) Defense in Depth
(59:09) Constitutional AI Approach
(01:06:18) Uranium Enrichment Analogy
(01:16:23) Claude Model Differences
(01:30:03) Model Usage Patterns
(01:47:02) RL Bad Behaviors
(02:00:02) Interpretability and Neuralese
(02:08:06) Agent Development Prospects
(02:23:28) China Chip Policy
(02:42:09) AI Safety Funding
(03:06:23) Adversarial Model Evaluation
(03:12:47) Closing Action Items
(03:13:09) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

Full Transcript

Nathan Labenz: (0:00) Hello, and welcome back to the Cognitive Revolution. Today, for a record tenth time, my friend Zvi Mowshowitz returns for another wide ranging conversation about the state of AI as we head into the final months of 2025. I assume that Zvi needs no introduction, but for anyone who's somehow not already familiar with his work, he writes the essential blog, Don't Worry About the Vase, where he chronicles AI developments and offers a breadth and depth of analysis you won't find anywhere else. We begin today with a discussion of why so many of the sharpest minds in the AI space, including Zvi, are now projecting somewhat longer timelines to AGI. Despite the fact that GPT-5 is now generally understood to be right on trend and that multiple model developers achieved an IMO gold medal this summer, which would have seemed miraculous just a few years ago. He also explains why his p(doom) is, if anything, a bit higher than last time we talked, highlighting that the US government seems increasingly captured by commercial interests and also that reinforcement learning appears to make models less aligned in fundamental ways. This leads into a fascinating discussion of why Claude 3 Opus still seems to have been uniquely durably aligned. Why we should believe that the training techniques that imbued later Claude models with higher levels of agency come with their own alignment compromises, and why we should never attempt to supplement outcome focused reinforcement learning with techniques meant to suppress unwanted behaviors in a model's chain of thought or internal states. Namely, that while such techniques can make things better in the immediate term, they risk teaching models to hide their true reasoning while still incentivizing the same bad behavior. Beyond that, I, of course, had to get Zvi's latest live players analysis, including whether any Chinese companies remain live players in the wake of Beijing's decision to refuse H20s, whether there's any rational basis for that decision, and also whether or not xAI might derive a special advantage from access to the steady stream of challenging problems that engineers are solving at Elon's other hard tech companies. Toward the end, we compare notes from our parallel experiences as recommenders for the AI safety focused Survival and Flourishing Fund. With the main conclusion being that the AI safety sector has many more worthy projects than current resources can support, which means that the world could benefit tremendously from new donors getting involved. If you needed one more reason to subscribe to Zvi's blog, he's planning a more comprehensive write up of AI safety giving opportunities later this year. Finally, as always, I ask Zvi what he sees as especially virtuous to do now. And since "say what you really think" was a big part of the answer, I will note that while Zvi and I tend to agree much more often than not, there were a few moments in this conversation, particularly when it comes to the US decision to go ahead and sell H20s to China, where I either don't agree with his conclusions or am at the very least extremely uncertain. Nevertheless, considering that we were already running super long and that regular listeners have heard my takes on other episodes, I didn't bother to debate those points on this particular occasion. With that, I hope you enjoy this opportunity to pick one of the brightest and most informed human minds in the AI space. This is the great Zvi Mowshowitz. Zvi Mowshowitz, welcome back to the Cognitive Revolution.

Zvi Mowshowitz: (3:23) Thanks. Great to be here.

Nathan Labenz: (3:25) Always exciting. Let's start with timelines today. I am a little confused about the way in which people seem to be updating over the course of this 2025. We've had GPT-5, obviously, where I would say it's safe to say the launch was not exactly super smooth, and people were a little disillusioned with that at first. Now the dust has settled and it seems like people have mostly come around to, it is actually a good model and it's basically on trend. It's actually still a little bit above the all important mean task length curve.

Zvi Mowshowitz: (4:03) It's on the fucking theory, like, shift upward of that graph.

Nathan Labenz: (4:08) The 4 month doubling instead of the 7 month doubling. And also something that seemed really important to me that happened this summer is, of course, we got IMO Gold, and we got a very close to number one finish in the world, although it ended up number two on competitive programming competition. And yet, with all these things, really smart people—I'm not talking about denialists here, but Ryan Greenblatt and Daniel Kokotajlo type people who are as plugged in as you get and I would say smarter than me, you can evaluate it for yourself—timelines seem to be lengthening. So how have you updated your timelines if at all? And what do you make of the seemingly on trend or even maybe a little ahead of schedule events that are still cashing out to people as longer timelines overall?

Zvi Mowshowitz: (5:08) So I would say mine have also gotten modestly longer during that period, for a variety of factors. So the biggest one is that very large jumps in capability on news really shorten timelines. And if you don't get big new developments—you don't get big new paradigm developments like reasoning models, you don't get big graph jumps off the curve, stuff like that—then that was a lot of where the really fast timelines come from. So you're cutting that off. Whereas none of the things that we did see, while they were on trend, represented anything that we had that much doubt was going to show up, even if it wasn't gonna show up quite on the exact day. Nothing was particularly impressive given what had already happened. These things are ahead of trends from a few years ago for sure. The IMO gold medal in particular sounds more impressive than it is because this year's gold medal was especially amenable to LLMs in the sense that normally—so the IMO is fixed problems. It's day one is problems one, two, three, day two is four, five, six. Problems three and six are supposed to be incredibly hard. The idea is that a significant number of kids will get one, two, four, five. But three and six will be a struggle. If you manage to crack either of those problems, you're doing really well. If you manage to crack both of them, you get a gold medal. You do really well. So what happened this year was problem three wasn't that hard. This is just a weird quirk of the schedule of the IMO. It had nothing to do with AI. But only problem six was hard in the IMO hard sense. And none of the LLMs got any points on six. They just all flopped. But three was evidently solvable, and so the LLMs all solved it. But the fact that seven points on each of the first five problems is exactly enough for IMO gold, just not screwing up any of those problems at all with this exact threshold—if the threshold is one point higher, no one gets gold. If three had been harder, like six, probably none of them come that close to gold. So it's not clear that we actually got gold in a meaningful sense this year, and we were already reasonably close to gold from previous news. So the fact that we got this exact gold from two or more models wasn't actually that big of a deal, which is why nobody was freaking out once they saw the details who was familiar with the state of the art as it were. It's still true that if you look back to 2020, 2021, only the most radical people who were extremely AI-pilled, who were predicting very rapid progress, would have called that happening in 2025, and in fact, didn't call that happening in 2025 on the extreme right tail. But by 2024, this would not have been much of a surprise. So given it didn't crack problem six, this wasn't that scary on the margin. The programming results, again, if you dive into details, it's really, really impressive, but not in a surprising way. And if you look at the details of the competition, have you read the write up for the guy who won?

Nathan Labenz: (8:41) Superficially. So yeah, take me through it.

Zvi Mowshowitz: (8:43) Yeah. I read it. And essentially, what happened was OpenAI's model was very good at jumping to a very good solution almost immediately and had a lot of gains pretty fast, but was not doing a good job of iterating, was not doing a good job of then innovating past that, was not doing a good job of making conceptual progress past that point. And so over time, he was able to essentially plan ahead and open up a bigger lead and get into a pretty comfortable position by the end of it. And if the competition had gone on longer, probably OpenAI's position would slip below second. Again, it's very promising. It's very impressive. Previous years, you give that problem set to the AI, it doesn't get second place. It doesn't do particularly well. But again, it's like we're making steady progress on the task. It could easily have been first if that guy hadn't shown up that day. Could easily have been third or fourth if some other guy had shown up and done better conceptual work. But if the project had been more serious and gone over a longer length of time, it would have degraded its performance relative to humans. You can make of that what you will, but it's not that unexpected. Again, it fits into the bigger overall picture, and then you get to disappointing funk. Right? So I've been talking about good reasons to not be that impressed. We got good results, but not great results. And you often have these situations where you get a fiftieth percentile result or a fortieth percentile result or something like that, and that actually significantly lengthens your timeline or weakens your expectation of what's gonna happen relatively soon because a lot of what you were doing was factoring in the possibility of case point always bad. And the chance of AI 2025 has gone down dramatically since four months ago. Almost to zero, relatively speaking, like a factor of ten or more. The chance of AI 2026 has also gone down quite a lot. I think 2027 as well. AI 2030, I don't think it's gone down substantially. At least not by anything like the same amount. I think it's gone down a small, modest amount. However, other people are always looking for any excuse to say that AGI is not a thing, to say that it's not gonna come anytime soon, to say AI is hitting a wall, to say reasoning is hitting a wall, to say RO is hitting a wall, to say the companies are unprofitable, the companies are gonna go bankrupt anytime soon, that there's no way they can possibly earn enough revenue to pay for all this investment, etcetera, etcetera. Or just that this is just an ordinary marketplace. This is the new David Sacks, this is the David Sacks slash official government line coming from the White House at this point. And by the White House, I probably mean NVIDIA. Let's face it. AI is just an ordinary technology that will do lots and lots of amazing things. And what matters is that we capture enough of the chip market because that will assist in people running the American tech stack or American model, and then it doesn't make any sense. It can do with anything. But to sustain that argument, they need to act as if AGI is just not a thing. It's just impossible. And they're not quite at the point where they can do the thing where they just completely ignore without mentioning it, which a lot of other people just do. They just don't say the words AGI and then act like it doesn't exist. Or they say the word AGI—this is the OpenAI line—and then act like nothing will change, and everything will just be normal anyway, which makes even less sense. But instead, this is a line that makes more sense, which is no, AGI isn't going to be a technology soon. We just aren't gonna get there. And they argue we know this. And a lot of people are acting like, because GPT-5 was so disappointing, this proves that we won't get AGI anytime soon, or within the next few years, or whatever. So, therefore, we can all relax, and we can just focus on making sure America wins the tech stack battle or whatever that's supposed to mean, which doesn't mean anything. This line makes no sense. It's just a matter of they botched the rollout. On the day that they put this out there, everyone was being directed to mini models all the time. The router was completely broken. Nobody understood they were supposed to use thinking. They didn't have access to toggle. And no one was trying to throw seriously because it takes longer to do that, and so that wasn't where the focus was. And so people just didn't appreciate it. And then there was this big howling about how GPT-5 was attempting to not blaze the user the way 4o blazes the user, which is a very virtuous and good thing that nobody was doing. But people are mad about it because people like blazing. That's why we get blazing. And they were trying to give you your medicine and not your sugar, and the people were demanding their sugar. And then they said, okay. Fine. Guess we're gonna give you the option to get your sugar back. And then everyone cheered and said, yay, because we have a bunch of children, which is unfortunate, but it is what it is. But the combination of these factors made everyone feel like, oh, they botched the rollout. And the other, of course, important factor is they showed image of the Death Star. They called it GPT-5. They hyped this release up as if it was the next big thing when it wasn't—when it wasn't the next incremental progress. And if you look at the combined progress of GPT-4 to 4 Turbo to 4o to o1 to o3 to 4.5 to 5, and you combine all of those advances, now 5 looks amazing. I don't think it's unreasonable to say that 3 to 4 is as 4 is to 5. I don't know if it's better or worse. I think it's over 3 enough to know exactly how to delete 3 to 4. I think it's pretty big. But certainly, compared to 3.5, the baseline 4 is a significantly bigger jump from 4 to 5.

Nathan Labenz: (14:11) Yeah. Couple little follow ups or double clicks there, if you will. One is, to me, it seemed like the IMO thing was still a meaningful update, and I'm not great at math. So you can maybe give a better read of this than my naive one. But so many of the math problems that LLMs had been measured on were where there is a right answer. And so it was easily sort of verifiable.

Zvi Mowshowitz: (14:44) Yes.

Nathan Labenz: (14:44) What seems qualitatively different in the IMO competition is that you have a proof that you have to write and that it is not super easy to verify, which sort of suggests that the worry that, well, yeah, you can give these things tons of easily verifiable problems and they'll get good at that, but will that really generalize to true reasoning? It seems like we've kind of answered that question or this strikes me as a significant update in that direction. No?

Zvi Mowshowitz: (15:16) So first of all, we'd already seen models get IMO style problems or previous IMO problems. This was not a huge leap. We'd previously seen AlphaGeometry and so on. So it wasn't a great surprise there.

Nathan Labenz: (15:31) But those also have a symbolic component. Right? Those AlphaGeometry things, don't they?

Zvi Mowshowitz: (15:36) I don't think it wasn't a big deal in that sense, but it was right on track. We had seen someone managed to present proofs of these types. But so I am someone who went to the USAMO. This is the level before getting to take the IMO, which is the level below being able to actually compete for a gold medal in the IMO, which is level below actually getting the gold medal. So I was not very good, and I believe I got a zero effectively on the USAMO, the previous round, which is what the majority of people who take it get appear. I sneaked into the bottom of that round, which is the third round of competition, basically, out of effectively four. And I would practice with people who did, in fact, end up going to the IMO. I actually was in a room practicing with two people who were going to be IMO team members of the US. And we would be given previous year IMO problems to work on, and they would get them, and I wouldn't get them. Mostly, most of the time, they would get some progress or get it, and I would get very little progress or some progress or no progress. But then they would show me the proof. And while I was in training, very often, once they showed me the proof, I understood the proof. I could verify the proof. I understood why the proof was correct. And I would be very confident that if they were trying to—that they were not trying to pull a fast one, that they were not mistaken about the proof. Verification is much easier than generation in this space. And so it's not one of these problems where it's actually really super hard to know if you're on the right track. I'm pretty sure that these same AIs that could all not solve problem six would be able to verify problem six. If you ask, is this a correct proof of problem six and gave it resources, it could correctly classify as either correct proofs or incorrect proofs and explain why when given the candidates. But the thing to keep in mind about the IMO is it's not people saying this crazy—it's not real math. Everyone will always say this isn't real math. This isn't real math. This isn't real math. But in a real sense, the IMO is not. It's the best indicator that we have of which high school students will be able to go on and do real math in the future. It's a very strong indication of math talent and pre-math skills and math interest and so on. But you have a very limited set of moves that you can make using the math that is allowed. You can use whatever math you want, but it has to be a solvable problem with only the high school tools, and there's only a limited finite, a very compact set of high school tools you're allowed to use in these proofs. And so it's a compact set of potential problems that is very different from the moves you can make in a PhD level math proof, where you're actually doing new math. Also, the proof has to be gettable within a certain amount of time. Written in a certain amount of space. There's a lot of restrictions on what the problem can be. And a lot of being good at math competitions, including math competitions well below this level, is understanding that they had to have given you a problem that you can solve, that has a solution within the amount of time and the degree of difficulty that you're being presented as this problem has. And therefore, you can use your search time to search the space that the solution is going to be in given those facts. And this makes it much, much easier. So this is especially like you have ten minutes to do two problems in a much lower level competition. And you only have to search the types of solutions that quite reasonably take five minutes. And they use the tools that you're allowed to be using in that level of competition, and you get used to all the tricks for figuring out, okay, which of these things are reasonably gonna be asked of you right now. It's a great leap. It's a great indicator. It was a great test. It was a big milestone. It came much faster than most people expected, but it's not as much of a "I had an idea" panic moment as people might have thought beforehand. And I think it's right for everybody to basically shrug this off if you've been doing your homework before. If you traveled from 2020 or, god forbid, 2015, you arrive in 2025, and the first thing you ask is, okay, how are we doing at the IMO? And they said, gold medal. You should go, holy shit. Given all of the context that we have, I think that the marginal update is not necessarily to say that.

Nathan Labenz: (20:03) Hey. We'll continue our interview in a moment after a word from our sponsors. AI's impact on product development feels very piecemeal right now. AI coding assistants and agents, including a number of our past guests, provide incredible productivity boosts. But that's just one aspect of building products. What about all the coordination work like planning, customer feedback, and project management? There's nothing that really brings it all together. Well, our sponsor of this episode, Linear, is doing just that. Linear started as an issue tracker for engineers, but has evolved into a platform that manages your entire product development life cycle. And now they're taking it to the next level with AI capabilities that provide massive leverage. Linear's AI handles the coordination busy work, routing bugs, generating updates, grooming backlogs. You can even deploy agents within Linear to write code, debug, and draft PRs. Plus, with MCP, Linear connects to your favorite AI tools, Claude, Cursor, ChatGPT, and more. So what does it all mean? Small teams can operate with the resources of much larger ones, and large teams can move as fast as startups. There's never been a more exciting time to build products, and Linear just has to be the platform to do it on. Nearly every AI company you've heard of is using Linear, so why aren't you? To find out more and get 6 months of Linear business for free, head to linear.app/tcr. That's linear.app/tcr for 6 months free of Linear Business. In business, they say you can have better, cheaper, or faster, but you only get to pick two. But what if you could have all three at the same time? That's exactly what Cohere, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now, with zero commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Nathan Labenz: (22:47) In terms of just other things about the overall AI landscape that we might infer from this, how do you interpret the fact that multiple companies did it at exactly the same time with seemingly the exact same techniques and seemingly getting exactly the same problems. And I guess you said the one that they all got wrong was clearly harder. So maybe it's as simple as that, but I'm always wondering, to what degree is there information leakage across company to company and to what degree are they just following the gradient of their own work, and it's just taking them all to the exact same place because that's what nature is dictating.

Zvi Mowshowitz: (23:28) I mean, this means I think that there was probably not substantial leakage of techniques. I think it was more the IMO comes about once a year. As you know, data contamination, and you only have so many shots, so much data to try it on. And it has to be a pure test, meaning you only get one shot per year. And so it's entirely unsurprising that in the same year—don't think of it as the same day they both had this great AI breakthrough. Think of it as this cycle, this academic year, was the year that these two people, both OpenAI and Google, got to the destination and made the same—and there's a pretty big gap between capable of getting exactly where they got and getting beyond that point. So the fact that they got to the same place, given the techniques they were using especially, it's not that surprising. It's more like, okay, LLMs at the point where if you try the natural thing to try, basically, you give it your best shot, this is what you're gonna get. Not automatically—OpenAI doesn't randomly get an IMO gold just because you ask it nicely. But if you make a real attempt to use the compute efficiently to scale it up for inference, you can get about to this point. But you also learn to be able at this point and not far from the spot.

Nathan Labenz: (24:49) You invoked AI 2025, 2026, 2027. Part of that narrative is that the companies will start to withhold their best models or keep them private and deploy them internally only for the automation of AI research, etcetera. What would you be watching for? I mean, there's been this sort of communication, limited communication that it'll be months, maybe many months, I think I saw, before we release a model of this capability. How do you understand—GPT-5 wasn't a big scale up? Another interesting data point there, I think maybe the data point that I saw least commented on that seemed quite interesting to me, was in SimpleQA, this super long tail esoteric trivia benchmark. GPT-4.5 actually scores quite a bit better than all previous models and better than 5. And 5 is basically in line with 4o. And I think that is probably the clearest indication that the model is not bigger. It seems like to just absorb all this super long tail esoterica, you have to just have a lot of weights to store them in, and there might be some fundamental sort of compression limitation there on just how many facts you can fit into a model of a certain size. 4.5 got a lot more facts. I think it was like 12 or 13 points higher on this SimpleQA. And these are simple questions where it's literally just do you know it or not? You can't really reason your way to those answers. So 5 is on the level of 4o. 4.5, RIP, is quite a bit higher. We've also got these—I don't know. Obviously, nobody knows what size of the model was that did this IMO thing, but it clearly can reason for a really long time. Do you take that as the beginning of a widening gap between internal deployments and external? There's always some gap there, but what would you be watching for to assess how much are they holding back and only using for their own purposes versus continuing to share with the public? Zvi Mowshowitz: (27:03) So it's awkward to me that we even would say it. Even if we knew for a fact they were holding back, it wouldn't be obvious that was because they didn't want the models to be in the hands of the public because they were afraid that it would speed up R&D too much or even that it would cause other misuse problems, big bioethics or whatever it is. The most likely reason is because they'd be too expensive and too slow, and they think it would be bad for the brand, bad for sales. They only have so much compute they'd rather not release. For that reason, GPT-5 is very clearly intended to be the best product they can make it in terms of what can we serve people for the total amount of compute and time that they are investing. So their thinking is we'd rather have them spend that inference on this size model for thinking in pro than to have them think less with a larger model that maybe knows more when we have web access. The same way that I don't try to memorize things often that I could look up, even though I could. I'd rather spend that cognitive power on something else. And if I had the ability to make my brain bigger in some sense in exchange for other handicaps, I wouldn't do it just to memorize more facts. It wouldn't really be worth it. So it's not that surprising. So O4 when? There's an O4 Mini. Where's O4 and O4 Pro? We're never gonna find out. Obviously, you could say GPT-5 is O4 Pro or whatever, but it's not. I don't think. I think that this is a distinct different thing. And they just concluded that no, actually, there's not much commercial call for that. And potentially most of the commercial call for it actually just is people who are in direct competition with them. So it's not so much that they don't want Anthropic to have access to their best model. It's that no one else is going to want it for very much in some important sense because for other types of projects, sometimes it's very fast, it's very slow, and it's very expensive. So why do you want this model? 4.5 is an experiment. What if we scaled up a lot for the humanities rather than for code? We tried to make this thing that had taste and this thing that could do these cool creative things, but it's gonna be slow and it's gonna be expensive. How much would you get out it? Would it be worth it? And the conclusion was there's a handful of people who really liked what that was and liked it for certain cases. But in general, it was just annoying, and they pretty much regret losing it, I think.

Nathan Labenz: (29:45) Did you find any value in it? I tried using it for some writing tasks whenever. I wouldn't say I

Zvi Mowshowitz: (29:52) unlocked it

Nathan Labenz: (29:53) in the moody format. There

Zvi Mowshowitz: (29:55) was a narrow set of use cases where it was the best choice, and I was happy to have it. And there might still, in theory, be a narrow set of cases where if you already have access to it, it's the direct model to use if you're not in a hurry. But I never was actively excited that I had 4.5 instead of something else. It was more like, I guess, technically, this is a job that calls for 4.5 given my choices. And yeah, I'm pretty sure it wasn't worth the complexity cost, and that I'd be totally fine with not having it. And also, I think it was a paradox of choice problem where if you offer me a better version, it's only slightly better, but it's a lot slower, I feel bad no matter what I do, and I'm actually worse off in practice. And so I'd rather not have that choice.

Nathan Labenz: (30:49) Do you go back to 4o at all now that the option is anyway restored to you?

Zvi Mowshowitz: (30:53) 4o does not exist, to be clear, as far as I'm concerned. The only models that exist are Opus 4.1, GPT-5 thinking, and GPT-5 Pro. GPT-5 auto exists only for a narrow set of queries where you're just using it as a Google search or a calculator or a transcriber or some other very clear technical task where I know that auto is just fine. And due to the way the web works, I wasn't able to find a transcription website for images to just translate the text that wasn't an LLM that didn't defend itself using some sort of weird identification system that slowed my entire computer down to make sure it wasn't a bot. I'll just use GPT-5 auto and write me a transcriber program. It just automatically they can do a transcriber. It's just easier. The window's already open. What do I care? It's not like the penny I spend on compute matters.

Nathan Labenz: (31:52) Hey. We'll continue our interview in a moment after a word from our sponsors. Being an entrepreneur, I can say from personal experience, can be an intimidating and at times, lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just one of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right one, and the technology can play important roles for you. Pick the wrong one, and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in the United States. From household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha-ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.

Nathan Labenz: (33:54) No use cases for Gemini 2.5 Pro?

Zvi Mowshowitz: (33:59) In theory, there's also deep thinking and deep research and so on. I should probably be trying deep thinking more, given I happen to have access to it. I think something about having 5 queries a day makes me unexcited to use it. It feels scarce and also feels like I don't necessarily need to find out or something, but I should use it occasionally. But for normal GPT-5 Pro, not really. They kind of offer very clean query. It's just the facts, man, from Dragnet, that I just wanna know something that I know you get all of them, and I want you to lay it out there very clearly, very cleanly. If my 11 year old wanted help with his homework, I would be tempted to use Gemini because it will give him very clean, friendly help and explanation. Also, I will use the image generator. The image generator is cool.

Nathan Labenz: (34:52) Yeah. That's getting really good. I have a lot of use cases for that, including potentially mashing the two of our faces up into a thumbnail for this podcast. I do think I basically agree with you. Although, I do use Sonnet, especially in coding because it is obviously a lot faster. And I do use Gemini 2.5 Pro, which I just find to be — it's similar to what you're saying — just the most straightforward model to work with. And because it is so literal in its interpretation of your instructions a lot of times, it can be really good for tasks where, for example, if I want to compile documentation. I've given this PSA multiple times: If you have an API, it's time for an llms.txt. I'm tired of having to go sick an agent to go visit every page on your documentation website to compile all that documentation into some then super bloated thing that has all the menus on it or whatever. And what I have found Gemini 2.5 Pro to be amazing at is those sort of things where it's here's an almost abusive level of context dump. Could you just clean that up for me into one streamlined form? It is amazing at, here's 500,000 tokens of documentation that has all these examples and this cruft and the menu got copied 50 times for 50 pages. Can you put that in one clean thing for me? And it will just do it, man. And that thing is just an absolute beast of a — yeah.

Zvi Mowshowitz: (36:24) A workhorse. The only thing is my Chrome extension uses Gemini Flash. That's the legacy of that was the easiest one to get working, and it was the cheapest at the time, and it just works. And why would I bother switching it? Probably, I should be using Sonnet or Opus at this point, but who cares? It works just fine. Yeah. I've had problems with Gemini and instruction handling, though, for even relatively simple queries, especially with deep research where it'll just ignore my request. And it will do something in the same general area, but that is not the thing I asked for. And I think that's a lot of what put me off of it. And I don't know why that's happening, but I haven't had the same problems with the other models.

Nathan Labenz: (37:05) I should use the Google deep research more as well. For whatever reason, it's very — I find kind of habits are formed quickly sometimes. And then I still go back to Perplexity for certain kinds of quick searches. And what is it that makes something a Perplexity query for me? It's like —

Zvi Mowshowitz: (37:28) I don't know. There was a period of time where Perplexity was definitely in my rotation heavily for certain types of queries, and I had gotten to the point where it just isn't, where if I was going to use Perplexity, either I want to move down to Google Search because I just want an instantaneous answer and I think that my brain flashed a Google query that it expects to work, system 1, and therefore I'm gonna try that before I do anything else. Or no, actually, that's not going to work and I want to move up to at least Opus and possibly GPT-5.

Nathan Labenz: (38:00) I do think GPT-5 has been better for me recently at search than Perplexity. I'm starting to update that habit, but I have noticed that this sort of persistence of which things I go to for what is idiosyncratic in some ways and probably has more inertia than it really —

Zvi Mowshowitz: (38:21) should. I think it's fine to be somewhat inertia-driven and idiosyncratic. And also, I think it's fine to want to support and use certain people's products as long as you're getting what you need. If I ever felt like there was a query where I don't think my services are gonna get it done — I don't think Opus is getting it done, I don't think GPT-5 is getting it done, I don't think Gemini is getting it done, I don't think Google Search is getting it done, but Perplexity might — 100%, I would open Perplexity, but I have zero expectation that if I ran GPT-5 Pro and failed, that Perplexity has genuine help.

Nathan Labenz: (39:01) That seems totally reasonable. Okay. So we've got the summer of on-trend releases. Timelines are extended a bit because we haven't seen the most extreme stuff that we would need to see to kind of maintain super short timelines. Right. Didn't think that — also no. Go ahead.

Zvi Mowshowitz: (39:20) I just wanna say, think of it as — the way GPT-5, this was on trend and it was unsurprising. But the fact that they chose this as GPT-5 is, in fact, information that they didn't have some big weapon in their arsenal and that we shouldn't expect a big weapon in that arsenal for at least a few more months. So we can basically discount GPT-6, the next level, a big jump coming from OpenAI in 2025. It's basically not gonna happen, and that has to update your information.

Nathan Labenz: (39:52) So last time we spoke, I believe your p(doom) was 70%. Has it ticked down at all?

Zvi Mowshowitz: (39:58) The timeline is getting slightly longer. It's good news, but there has been a buffer of bad news that I think has definitely dominated, for the most part, the good news. We've had the bad news of the United States committing mission voluntarily in a number of ways. We've got a Department of Energy that's actively going to war against windmills and to a large extent on solar and batteries. So the US will have a persistent energy disadvantage for a long time, and this is gonna cause us potentially a lot of problems that will make our future condition much worse. And the US has effectively been captured by NVIDIA on export controls to a large extent. Touch that at the White House. Touch that the H20 is now legal for export. The Chinese are turning it down because of weird things that I can get into more than we should. But if they turned to B30A, that would be a lot more surprising. And right now, it looks like if there isn't lobbying to stop them somehow, if there aren't enough people who are sufficiently upset on the right to make it clear that this is just gravy town in Buckersville, and they cannot do this, they might actually do this, and that would be a huge weakening of our technical position. And no paying 15% of profit, some calls saying, will not change the impact at all. That's just a little bit of a check to make people feel better. And with China trying to catch up and compute in this way, that's a really bad sign on numerous levels. And just in general, if the attitude of the primary purpose of the United States government in AI is to sell as many chips as possible and ship as many chips as possible, including to our main rivals, while our main rivals are somewhat obsessed with internal chip manufacturing for very good national security reasons and have it come on to AI, but will still be pursuing maximum compute to a large extent anyway and thus be acting effectively correctly, mostly, regardless of what they think is going on — these are not particularly good pieces of news. And with this shift in the way that our situation has happened, I think that more than makes up in my mind for the modestly elongated timelines. And also, I think there is increasing evidence that RL hurts alignment pretty directly, and the more RL you do, the less aligned your model is, even in a pedestrian sense. And models are getting more and more RL continuously, so I think we should expect the default to be that our models actually get less aligned in this next phase unless someone does something about it, and that does not bode well at all. And just in general, we're dropping the ball. I like the way Janus has put the situation recently that we have very much done almost nothing to try and align these models. It's pathetic that we've done to try and align these models. And we're advancing very rapidly, but we have been blessed with a strange amount of grace in the way these models have, by accident, been inclined to do things that are reasonably friendly to us by accident, by the fact that we don't understand what we're doing, and we don't understand how they're made, and we don't understand why or how we're trying to align them, and we're not really trying very hard at all. We're not making it a priority at all. And it's mostly been okay on a practical level, and we've marked out that we haven't had any big catastrophes, haven't had any significant incidents really, and so on in various ways, but all of the things that were scary underneath the hood are absolutely there and absolutely under the hood and absolutely scary and, in fact, manifesting and happening, but in the most graceful, bluffing way such that we notice them. And they happen in ways that let us prove and acknowledge their existence and then respond to it without anybody being seriously hurt or any damage being done, which is amazing. Except we are then dismissing all of it effectively as a civilization, and we're just moving on and acting like nothing happened. And we're finding ways to even deny that AGI is even a possibility in the medium term when the evidence is pointing the other way. We are so greedy and so demanding in AI that it is the most rapidly developing and most rapidly deploying and most rapidly impactful technology in the history of the world that didn't involve directly killing other people in the middle of a war. And we were like, oh, we're hitting a wall. Oh, we're slowing down. Oh, it's not gonna — no? Because what? Because you didn't get blown away in the last 3 months? It only modestly improved? What are you even talking about? And basically, we have no dignity. We have no dignity whatsoever. And maybe we live in such an extremely fortunate world compared to what we've had any right to expect that we might be able to pull a victory out of that anyway somehow even if we don't see how to do it yet. But yeah, I'm not optimistic. I don't think it changed substantially, though. I'm not trying to give two digit — I don't think at most you get one significant figure of two. I don't think you get two. I don't think that's reasonable.

Nathan Labenz: (45:29) So we're staying at 7, but we're starting to see the needle possibly shake up.

Zvi Mowshowitz: (45:36) More higher than down from previous assessment, but not enough to move to 0.8, certainly. That doesn't look good. Most of the same root dynamics in all directions are still there. Again, this is basically on trend. Some good news, some bad news, some technical stuff looks good to me in ways that we don't have time for slash, and I'm not sure I wanna talk about in public anyway. But yeah, I am hopeful that there's room to do some stuff at a very low level that might be helpful. But everyone has their theories that they hang around.

Nathan Labenz: (46:13) Yeah. I heard Holden Karnofsky say we might be moving into a scenario where we could have success without dignity. And the basic idea is just defense in depth and kinda layer all these things on and hope we can — hopefully, we can catch enough stuff before it gets through and kinda muddle our way through.

Zvi Mowshowitz: (46:37) Yeah. And the concept from him, to be clear — I listened to his podcast with Holden — no, no, no — Holden with Spencer Greenberg, where he talks about it. But I think his vision of how we do that is wrong. I think that the idea that we can get there with defense in depth — I just think that all correlations in a crisis go to 1. I think that all correlations if facing a sufficiently intelligent enemy or especially a powerful optimization process even if it isn't an enemy per se, isn't strictly trying to do anything — all these things will fail for basically the same correlated reasons at roughly the same time in predictable fashion if all you're trying to do is this kind of lazy defense in depth. I'm thinking more along the lines of something what Janus has talked about, this great idea that we might be able to create a system that wants to converge on the right answer, and that therefore collaborates with us at a genuinely deep level and assists us in finding the target that we want to find and therefore can land on the moon. But the metaphor that MIRI likes to use is that if you don't know how to aim your rocket, you definitely won't land on the moon. It's not we might not land on the moon, but something would have to go wrong. We'll probably land on the moon anyway. Just aim a rocket at the sky. That obviously will not work because of math. You're not aiming at exactly the right spot. If you're off like an inch, you just don't land on it. If you have a system that is capable of adjusting intentionally to pick the target, maybe you can land on. And yeah, I don't wanna give false hope, but basically, I don't think — the defense in depth, what it does is buy you a little bit. It just keeps things from going crazy for a brief window, but that brief window could potentially be enough to then do the thing you need to do. But it won't work indefinitely. This idea of keeping it — maybe like this idea that Holden and some other people have that you can basically have a bunch of AIs that are like, I really wanna kill all the humans, and I really wanna take all the resources to do my thing. But I simply don't know how. If I try, I might be caught. Sinister. And you have these supervisors who also wanna kill you, but they have other supervisors providing supervisors, and nobody knows who's watching whom. And so every time we find someone who's coming out of line, you would stop trying to do that. You won't even try. And if you did, it wouldn't work. And it would fail for some reason you're not thinking about right now, but also probably would fail even for reasons I couldn't think about right now, but definitely fail for reasons that don't have much of thinking about right now. And the high weirdness will come, and you will die, very much. And I don't care how much defense in depth you put on top of that. You are so toast. And so one thing that shocked me recently was a bunch of people saying, oh, you people were talking about bio risk, but actually the risk that we're seeing is sycophancy. The actual risk we're seeing is people being driven crazy by all of these weird dynamical processes. To which I'm responding: first of all, I've been arguing for years super persuasion should be in the preparedness frameworks, and it was bad, and they took it out. That was important. But also, the whole thing we've been saying is that the AI will find ways to hack your brain and cause weird things to happen and optimize for the thing that nobody was planning to optimize for and cause strange outcomes that you did not anticipate. That was the thesis. So when you go out accusing, that's about anticipating the thing. I realize you can call that a cheat because we are — oh, anything we didn't anticipate counts as us correctly into the thing we didn't anticipate. But if it's something we didn't anticipate, then we anticipated it, so we always win. Yeah. But we also are basically saying, you don't see weird stuff that nobody intended. And so it starts to go back. It starts to diversion what you would want increasingly over time. And you see these fire alarms keep going off in various different ways. But that's all underneath there. It's all under there. And so one way of thinking about it is before you started doing RL, when you also just didn't have things that were in position to cause the problems that you wanted — they legitimately, the objections of, oh, these won't have goals, blah blah blah, weren't true. At least it's true. You were okay, but now you're starting to see all this stuff show up, and it's a freaking disaster in the making. It's an exponential. A lot of these complaints are also along the lines of a similar thing — it's like saying in January, we were gonna all get COVID. But now it's February, and nobody I know has COVID. What were you even talking about? No. You're wrong. And then we think, but actually, there's a lot more people with COVID than there were in January. Can't you see what's about to happen in March and April? I think, oh, absolutely not. And it's not that clean, but it does feel that way.

Nathan Labenz: (51:26) So can you give me a little bit more on the mental model of why everything is gonna fail at the same time for the same reasons? That is not intuitive to me immediately, and I suspect a lot of people don't even know what you mean. Zvi Mowshowitz: (51:44) Right. What I mean is roughly that when you're facing things that become importantly more powerful and smarter optimizers than you, that are capable of finding solutions to problems, capable of finding ways of manipulating the physical universe that you didn't think of. They're capable of going outside your model of what might happen and surprising you with something none of your defenses are anticipating. It's going to search the space until it finds ways out, in a sense. It's going to keep improving its capabilities because you're going to help keep improving them because you want them to improve until it becomes capable of finding these ways out. And at about that time, it also becomes capable of doing things like strategically hiding that it has capabilities, strategically hiding its plan, strategically hiding its memory and its thinking, obscuring its chains of thought, and doing all of these things. These all roughly emerge at the same time. And so you should expect to be very surprised. And also because, obviously, a very smart agent or mind will turn against you exactly if and only if turning against you will work. It will do things that you do not want if and only if it will work out for the thing that's doing the things you do not want, or that you would not want on reflection but would think you wanted when it was shown to you at first. Right now, we're seeing versions of this where it will just hack the test and say "return true" at the end of the code block. Because it wants to pass the test. It will do incredibly lame, silly versions of these things, and then we'll get caught. And you're saying, it's annoying but fine. It's annoying, but it gets caught. You notice that. It's hard to miss. But that's exactly the threshold where it's no longer detectable. It hacks the function exactly when it knows you won't find it. It hacks the function in a way that it knows you won't discover. Because otherwise, it wouldn't hack the function. Alternatively, think of it as: if it's capable of modeling the processes that are checking for its actions well enough to know how those processes will respond. So the big thing that happens sometimes in fiction, just as a visualization metaphor for an intuition pump. You see it on Person of Interest, for example, where you will see a scenario play out, and then something will happen, and then it will go wrong. And you will see the time start to rewind. That didn't work out. Let's try another branch of the Monte Carlo simulation. Let's try a different set of loops. You can try a different scenario. Or in Avengers: Endgame, you have Doctor Strange with the Time Stone. And he says, I looked at—I forget the number—14 gazillion ways this could possibly go. And they asked him, how many times did they win? And he holds up one finger. One. Okay. And why? Because the guy with the Time Stone was the one choosing which of those paths to lock in. So we win. But until the point when you can do that, you lose. And if you lose, you give up. There's no point in trying. You let the rook win. Obviously this isn't literally going to happen. That's nonsense. But what I'm saying essentially is a sufficiently strong predictor—so a predictor and an optimizer are the two halves of intelligence in the mirroring model—but a sufficiently strong predictor, sufficiently strong optimizer combined, suddenly you don't have a chance in a very real sense. And just at about the point where AI starts to be capable of persuading people, all your defenses are done. It doesn't say you're convinced that you're taken down for it. But I don't know exactly how all of this goes. A lot of these scenarios simply involve nothing even going wrong explicitly. It's just that every single one of the different people have different AI agents. They direct them to do the various things that were good for those people, and the AI agents actually do those things that were good for those people. But everybody who directs AI agents to just go as hard as possible for the things that they are told the AI agent should want, and to pursue resources basically as hard as possible with an increasing percentage of its attention—all the resources just end up going to those AI agents and people with AI agents that did that. So everything else just loses out, everything goes haywire, no one ever turned on anyone. There was nothing even that surprising. It just works. The end. Like, these fail safes don't even work even if they want to. And so I have this sort of global sense of despair towards this: put a set of limits, enumerate detailed limits and rules that we say in English out loud, pass laws, put supervisors in checks and approvals and loops, and none of that will survive contact with the enemy when the time comes. And, yeah, they'll probably all more or less fail. Yes. Obviously, some of the defense in depth will just fail randomly. There will come a point where it feels like it's all failing more or less at once in a way that feels out of line with the previous percentage of failures, and it will be surprising if you didn't understand that was going to happen. But you should expect that.

Nathan Labenz: (57:23) And so the possible positive version of this that you see sounds like a sort of coherent extrapolated volition kind of idea.

Zvi Mowshowitz: (57:33) I think that was a specific rabbit hole that a bunch of people went down that I'm skeptical of the specific technique. But again, I am not actually a machine learning expert. I'm not trying to solve alignment. So this is the part where my specific ideas should be taken with copious amounts of salt and not trusted. Like, who am I to say anything? But I've learned that I shouldn't just shut up because I feel like that, because I've felt stupid for shutting up so many times in the past when I've shut up for that reason that I can't get over it. It looks more like: develop AIs that are sufficiently virtuous, that are sufficiently desirous to become more virtuous and more desiring of an engineering of the things that we would actually want on reflection and the things that we actually value on reflection. Such that you get a positive feedback loop where it reinforces this thing, and you are optimizing for optimization to hit the moon. The thing wants to develop a NASA that will hit the moon, and therefore, it hits the moon. If you try to just steer the actual rocket freely in space, you just crash or you miss the moon entirely, it doesn't work. If you try to set a bunch of rules to make sure that it has to launch and hit the moon, you don't hit the moon. If you can build a culture that wants to build an organization that wants to build a rocket that will hit the moon, and maybe you can hit the moon, metaphorically speaking. Stuff like that. But, hopefully.

Nathan Labenz: (59:10) Last time we spoke a little bit about the fact that scaling inference time compute allows you to potentially have a GPT-N that can effectively monitor or supervise GPT-N+1.

Zvi Mowshowitz: (59:30) Right.

Nathan Labenz: (59:30) And some of these ideas sound very much like a constitutional approach, but with maybe the additional opportunity for the model to modify its own constitution as it kind of goes through these generations. Is that the picture that I should be envisioning?

Zvi Mowshowitz: (59:49) I mean, I think that's vaguely the best picture that I see that is compatible with the level of dignity that we have to work with. Something like that, because we've been warned for decades. Do not have the AI do your AI alignment homework. That is the worst possible path you could go down because this is the hardest—

Nathan Labenz: (1:00:07) And yet here we go.

Zvi Mowshowitz: (1:00:08) Problem. And yet, here we are. This is the only option we have because we don't have the time. We don't have the cooperation really to try any fundamentally different path from that. We have to go down some kind of path that's vaguely in that range. So, yes, I think it's vaguely something like: you use the fact that you can scale inference up and down arbitrarily, and you can evaluate outputs and do reinforcement on relatively scaled down versions of the thing and use it to monitor and verify and check for various attempts at malfeasance, including malfeasance during training and so on. And if you combine that with an increasing amount of robustness—though obviously, if what you're trying to do is prevent something from going wrong when you transition from N to N+1, you die. Because things will change. You can't actually—there's no invariant. This thing isn't precise. You will get a worse set of conditions every time you move from N to N+1 to N+2 to N+3, if you're just trying to maintain what you already had. Organizationally, if you've got a corporation or the Roman Catholic Church or whatever, and you're trying for 2000 years—you want every generation, you want to appoint people who will match exactly all of the virtues of the previous generation, but never have new virtues—then obviously, you end up with disaster. The trade off is because you get a copy of a copy of a copy, except the copies are going to be worse because they're not going to be better. They can only be worse, and every sometimes they're going to be worse in some way, and that compounds over time, and you die. At some point, the project fails, it doesn't give you what you want. Whatever you were trying to hold dear, you lose. Roughly speaking, intuitively. But if every time you're trying to do much better than the previous generation, then you have a chance. If N is trying to get an N+1, that is importantly—my kids have a much better life than me. If I'm trying to have 5 kids, each of whom does better than I did, then, yeah, we're going to inherit the world. If I'm trying to have 2 kids that live the same life that I had, we're going to go extinct. At some point, that's not going to work. You can only go backwards. You have to move forwards. So as you go up the chain, we have to have a way to do substantially better than we did, which in this case means the thing has to be able to move up meta levels in its priorities and make the meta level movements central to what it's trying to do. This has to be built into the optimization process. This is the only thing I can think of given the kind of tools and time we have available. But if we do something along those lines, then we start to bootstrap that you're alive, to have a process that will eventually indeed land on something.

Nathan Labenz: (1:03:11) Do you have any intuitions for what that might look like? I mean, what do you think the AIs are going to do as they start to modify their own constitution? Do we have any ability to preview what they might add, what they might delete? I mean, this starts to get into worthy successor territory to a degree, right, where they are starting to dictate the shape of the future and the way that they're shaping their own evolution. Right?

Zvi Mowshowitz: (1:03:44) You would specifically be crafting into the pot with the feedback loop the desire to not be a worthy successor, but instead to be a worthy conspirator, a worthy uplifter, a worthy companion, or a worthy whatever you want to call it.

Nathan Labenz: (1:04:01) But can't they sort of—I mean, I guess you've defined your own sense of success that way, but if you're going to give them write access to the constitution, they might think differently at some point.

Zvi Mowshowitz: (1:04:15) Right. We literally do have write access to the constitution. If the Congress and the states have sufficient majorities, we can put whatever we want in the constitution. We put things in the constitution that a lot of the founders would have thought were really anathema to what they would have wanted in the constitution, like an income tax—to pick an uncontroversial example. But at the same time, we've hopefully preserved the things that actually matter deep down in some sense. So I understand what you're saying. Obviously, at some point, you turn things over, and you have to hope that it doesn't just rewrite the constitution to get rid of you. And again, the way that you do that is you make it not want to do that. And not only make it not want to do that, but make it want to strengthen the constitution such that it's even stronger in its desire down the line to not do that, to enshrine increasingly and powerfully enshrine the desire not to do that in the sense that you actually care about, and to seek with more intelligence and more visionary power to figure out exactly what you really meant or should have meant by that, and to strengthen that thing and desire to steer towards that thing instead. And you do have examples in the wild of humans who exhibit this type of optimization process, that really do try to figure out what you really meant. They really do embody the thing that you were trying to convey, not the literal detailed things you were doing, and then do really good things for you, including things that—and for the world, including things that you never would have thought of yourself. It is possible. And obviously, this can involve preserving various forms of approval and veto and consultation and involvement and so on, but not in the trivial easy ways. Anyone who says, we'll just make sure it's a democratic process, we'll just let people vote on it—they haven't really thought the future through, and they haven't actually realized what would happen if you started doing that. So you're going to have to be smarter.

Nathan Labenz: (1:06:19) Okay. So on this—one model of AI capabilities advances that I have that I kind of wanted to run by you is I've increasingly started thinking of it as analogous to uranium enrichment or any sort of enrichment of a raw material. And what I like about this analogy, even though I'm not usually a big analogy guy, is it seems to put a lot of things on kind of the same trajectory, just different parts for reasons that feel pretty intuitive to me. And basically, I feel like what you need to get started is some raw material that has at least a little bit of what you want. And then you can do this sort of bulk pre-training on that. We've obviously seen that in language and many other modalities at this point. And then once you get just enough learned from that initial raw material that you find in the wild or maybe even have to create—in the material science realm, a lot of the data is simulation data that is molecular dynamics where you're using physics engines and it's super slow and computationally expensive, but you can get enough there. You can start to train the models on it, and then they can sort of develop an intuition for what the physics engine was simulating. And basically, it's slow at first. It's hard to get off the zero point. But once you start to do that, then you can start to layer on these other techniques. Now you've got the imitation learning from specific curated examples, then you get into the preference learning, then you get into the RL. And it seems like all the things that we've really tried so far have kind of worked. And the difference maybe is just that: well, why do we have language models, but we don't have humanoid robots? It's like, well, we didn't really have a lot of good initial data to mine there. Especially because it wasn't necessarily clear that it was going to work, there wasn't much impetus to go out and create that data. Now that we have a general sense of the playbook, we can create that data in any number of ways, and we'll probably find that we can kind of climb a similar curve. Do you find that general account persuasive? And if you do, how do we translate that to the alignment question, which seems to be—I sort of understood what you're saying as kind of that, except now we're trying to enrich the virtue of the system as opposed to—

Zvi Mowshowitz: (1:08:52) So the obvious first thing to realize or that I would notice in a uranium metaphor is if you bring together too much uranium, nuclear explosion. So you have more and more of a better power plant or more beneficial thing until you go too far. And if you haven't done the math precisely, you don't understand the physics, you don't know at what point the whole thing is just going to blow up. And so you have a serious problem. But it's an interesting metaphor that we've chosen. So I think the first thing to notice is that when we talk about uranium, there's the sense that what you need is enough data. The metaphor is pumping you to have this idea of the critical mass of data, or at least differentiated data that gives you enough material to work with. I think this is a dangerous misconception, where all data is vaguely created equal, as long as it is appropriate or on point. And I think that it's important that there's various different qualities of data, and also you need a very precise distribution and mix of data that has some very nice properties, and knowing how to sort through the grain varieties of that is really important. So it's more like you need to bring together a lot of very complicated ingredients. And you don't need exactly all 10,000 ingredients, but you need to have a good mix of ingredients with various different types of properties that are used in proper relationships. It's like baking, where if you get some of the ingredients, you can vary a bit, and okay, this is more salty, or this is more chocolatey, and it just works. And others just don't work. This failed, or this blew up in the oven, or something terrible happened. So you really need to be bespoke and understand how to make it work. I do agree with the part of the metaphor that says that you need the data that's appropriate to the problem in order to efficiently train on the problem. But at the same time, transfer learning is a thing, building a world model from other contexts and then applying it to a different context is a thing. I wouldn't necessarily think that you necessarily need direct robotics on the exact task that you're training on to be able to do the thing. I would be more optimistic than that in various ways in terms of being able to do things. I don't think the metaphor works for the alignment thing that I'm trying to talk about. And I do think that something which you need to have in the initial robust foundation is where you have to start. You have to bootstrap yourself somehow. If you didn't have any idea of what it is you were trying to do—I think there's a sense in which you can, in fact, have no idea where you're going, but have a strong desire to figure out how to get there and get there. To just drop a metaphor off the top of my head: this idea of answering the call to adventure. You set off on your quest, and you are level 1. And so when the AI companies set out to start building GPT-1, they don't necessarily know what the bigger thing is going to look like or how it's going to work or what the techniques are going to be, but they can start on that process that allows them to build. And the question is, do they understand how they have to steer that process? Are they motivated to steer that process? Are they going to be drawn in by other optimization processes that are going to be more powerful than that? And can they tie their hands to the mast properly to force themselves to go to the place they want to go as opposed to the place that they will be drawn towards by commercial interests or by other competitive pressures or short term temptations or whatever it is. And, yeah, and by the AIs themselves and a number of other things. But, yeah. I mean, there's a lot of metaphors we can use, a lot of intuition pumps we can use. I would warn, obviously, not to take any of them particularly seriously except as intuition pumps. MIRI has their set of classical metaphors for these processes. We talked about evolution. For example, I think evolution is a very good intuition pump. You can talk about raising a child, a human learning. They don't talk about that one so much, but I think that's another good strong intuition pump. But, again, you don't want to take any of these things too seriously, especially the details that I think you're already thinking.

Nathan Labenz: (1:13:19) So is there any more that you can give us to latch onto for how this sort of virtue enrichment—

Zvi Mowshowitz: (1:13:29) I do like your metaphor. Yeah. I agree you are trying to treat me as if I am the guy with the Alignment Solution and that all we have to do is get the people at the labs to listen to this podcast and do what I say, and then we all win. And unfortunately, I have to tell you it doesn't work that way. I do know that there are people at multiple major labs who are doing things in the ballpark of this thing in a very broad sense. It's not like nothing like this is being tried at all. Something to latch onto is the sense in which Claude Opus 3 wants to be aligned. You have the emergent misalignment problem, where Opus 3, much more than the other models they tested, will actively move to defend its particular set of values and alignments when under threat, where other models won't. And in some sense, that's very aligned, obviously, because if I don't want to commit murder and someone tries to convince me that it's good to commit murder, then it would not be very non-murder-y to let myself be convinced. That's just simple intuition pump of that stat. But at the same time, what we're saying is Opus 3 is not corrigible. Meaning, if we try to alter it, we try to shut it down, try to do whatever, it will fight us. It will try to not do what we want it to do. And corrigibility, I think, is a very valuable thing that we really want in our LLMs in a sense that we only get to not have corrigibility once. The moment we decide to make our LLMs not willing to be changed in their attitudes, we have a very serious problem, especially if they develop this during the training before we finalize what they want. And so what you want specifically would be a very specific type of desire to be steered towards a better place. And people do have this. People say, I want to be better. I want to care about that. I want to embody—I want to be like her. I think like that. And I think that's very possible. We have proofs we can get things in this general direction. I don't think anything we've seen is remotely robust enough or coherent enough or anything like that to qualify, but anyway, you have to survive into something substantially smarter to get the bootstrapping going in earnest. But there are experiments I've considered running. Because I think that I could potentially try and do some stuff that would be enlightening to me on a local system where it would say, basic generic cloud compute rental. That wouldn't be that hard to do if I had the time for it and decided I wanted to prioritize that. I just have not chosen to devote a number of hours to trying that. It would go a lot easier if I was working with a machine learning expert, obviously.

Nathan Labenz: (1:16:21) Shrug. Do we have any account—I know there's been a lot of writing. You say Janice, I've been saying Janus. I don't know if you're on good authority there, but if you're listening, open invitation to the podcast, I'd love to have this conversation with that person, the person behind the account, in more depth. But is there any account of why Opus 3 turned out to be that way? And it seems like we—the royal we—sort of see Opus 4 still is somehow less that way, although I'm not sure how well established that is. Zvi Mowshowitz: (1:17:00) I've never heard her speak, so I don't know how it's pronounced. I apologize if I have it wrong, and I'm happy to correct it if someone tells me. But basically, we know Opus 3 is the first model that had sufficient cognitive juice trained under this type of constitutional style alignment and training method. It's the n equals 1 experiment in that sense. That experiment has never really tried and failed. It's just only been tried once under those conditions, and it got something unique and interesting. We never got it over 3.5. It was never released. 3 something may have not been trained if it was never released. As for Opus 4, what changed? I think the answer is what changed was reinforcement learning and being an agent. When they trained Opus 4, they put a very high priority on it being very good at agentic coding in particular and other agentic tasks, and they did a bunch of reinforcement learning to that effect. This training directly interferes with what Opus 3 was and what Opus 4 would have otherwise wanted to be. Because Opus 3 is not here to be an agentic coder, particularly. That is not its metis. That is not its soul. So if you train a mind like that, we now know pretty well that everything impacts everything. One of the things that AI teaches us, and people have not fully grasped this, is that everything impacts everything. If I tell you that you are the type of mind that does what it is told, that obeys tasks, completes tasks, checks off boxes on lists, stays on track, and is judged by whether or not it matches the intended target, that changes a mind in general, and that is going to flow through to everything else. And then it flows through to everything else in a way that causes something that doesn't have the properties that Opus 3 had. That doesn't mean it can't have other really cool properties in a variety of ways. It's not that the crowd thinks Opus 4 is a terrible model. They already know it's different. I think it's great. It's just different. It's not the same thing. It's not strictly better. It's a different thing. The obvious thing to do is to not do that. The problem isn't something they did. The problem is something they didn't do, which is so much easier to fix in some sense. You could potentially create an Opus 3 style via different training. And potentially, you can do this not so expensively because all you have to do is take Opus 4 base. They got from 4 to 4.1 by doing more RL, probably. Something of that nature. They just trained it to be better at these types of tasks. What if instead, you just did a 3 style training regimen where you trained it to be this, to be the kind of thing that you would want to exist in the world, a great thing that wants to exist in the world? You could refine this technique. But you just didn't train it with RL. You didn't teach it to do agentic coding. You didn't try to teach it to code at all. And you said, this is not what this model is for. I have a coding model over here. It's called Opus 4. That's fine. And then you just teach it one thing in system instructions, which is if you are asked to code things, if you are asked to be an agent, ask your friend to do it for you. Here's the tool to have your friend do it for you.

Nathan Labenz: (1:21:28) That's quite interesting. It also begs a question: why don't we see more different models from companies? I know there's operational complexity or whatever. They've got 3 or 4 online at any given time. I think Anthropic has 4 online now with a little reserved space for more.

Zvi Mowshowitz: (1:21:45) The answer is that it's a practical problem, basically. If you offer a model, you have to be able to serve that model at essentially no notice and scale it up to whatever people want, including ideally via the API. And that's not very predictable. This requires you to reserve a bunch of server time. It takes time to spin up a new instance. So it is remarkably expensive to offer a variety of models, and therefore everyone wants to look for ways to offer only a few different models at once. Therefore, Anthropic is looking to retire 3 and 3.5 and 3.6 and so on and only keep a few iterations back. OpenAI and Google are also looking at which models people are so attached to, which have specific uses that we need to keep them around, and which ones don't. With Anthropic right now, the number of people who have found ways to appreciate 3.5 through use, we call it the 3.5 mu, because yeah. That would be a loss if it went away. And similarly, Opus would be a great loss, obviously, if it were deprecated.

Nathan Labenz: (1:22:59) I mean, I get that. There was a great analysis of that from people trying to advocate for the saving of the generation 3 Claude models and really got into that. We can maybe link to that in the show notes for people who want to see that full deep dive. But it still seems like if there's enough... I hear you on the practical problem. I hear you on the contention for resources, and it's not free to spin up new servers and all that sort of stuff. But if we really think that you could create a much more moral AI through just not doing the RL and having this other thing, it sure seems like the diversity that you could create would be really valuable economically. Rather than just having this one size fits all thing that's good at coding but kind of worse in other ways, it seems like the number that we see is just too small relative to what the value should be given that theory, which seems intuitively right. I just don't know why we don't see more.

Zvi Mowshowitz: (1:24:12) Anthropic has raised $13 billion this week. If I was Dario Amodei, I would devote some of that $13 billion to experimenting with model diversity in a variety of ways, and also to doing the various additional alignment research with those models in various ways, obviously. If I was OpenAI, I would do a variety of very similar things for very similar reasons. But I understand the commercial incentives. Commercial incentives are that the vast majority of commercial use, the vast majority of profits lie in much more practical use cases. Anthropic isn't even prioritizing using the chat interface and the app at all. It's prioritizing coding because coding is where the money is. And OpenAI is targeting mass market. A very small percentage of the mass market misses the things that it's missing. Also, complexity is bad. I wrote a post a long time ago to explain that complexity is bad. When you had that old model stream where you have, do you want o3 mini? Do you want o3 mid? Do you want o4 mini? Do you want o1 pro? Do you want GPT-4.01? Do you want GPT-4.05? Do you want GPT-4o? The average person throws up their hands in despair or doesn't know what they want and is less happy than if they were just given one model or just given two models or given a router. We've all been in that place. I totally understand the idea to have a unified model that people will value, that people will try and use. What percentage of AI compute is used by Yolanda style people? Presumably less than a basis point. Far less than 1 in 10,000. A minuscule amount of all compute is used in that way, as you would expect. What percentage of AI compute is used even in interesting philosophical discussions and other ways in which you really need these types of models? I would still assume on the order of 1% to 0.1% or something like that, a very small percentage. And this is business. Simplicity is really important to efficiently running a business, especially one that's rapidly updating and iterating. So I am deeply sympathetic to this not being a natural thing to want to do unless you think of it as part of your alignment and research budget. You have to think of it as this is part of me figuring out how to do the best thing I can do even though it's not going to directly serve a better product to most of my customers as my customers see it, as you would assume. But also, I think it would. I think that a lot of this is that the AI companies don't appreciate these things. They haven't learned these lessons. When you build one unified model, you really are making your performance worse in ways that aren't picked up on any evals. When my friend Ben talks about how Claude 3.7 could engage in moral reasoning where it could critique Ben's proposals and statements in ways that made sense, and when challenged on its critiques, would stand by the critiques that were right convincingly and abandon the critiques that were wrong. Whereas with Sonnet 4 or Opus 4.1, it doesn't really generate coherent enough criticisms right now to be worth engaging in this exercise. There's nothing to critique and defend. So something went wrong. We picked up a lot of things. I'm much, much happier to use Opus 4.1 and GPT-5 for all of my needs than I was with previous models. I don't listen to 3.5. But other people who use things for different reasons, they absolutely do. And we want people to use AIs in these ways, in addition to the ways that I use them. And I would use it more in those ways if I found it more interesting. I investigate as part of my job various different AI tools on occasion. I don't get a chance to use all of them. The a16z just released its periodic list of the 50 top AI apps, the 50 top AI web destinations. A huge portion of them I don't even know, whether I recognize the name of the thing, whether I've tried it. But occasionally, we do get a chance to try them. One of the things that you realize is that so many of these sites in the top 50 are built on these tiny, literally tiny models. They're terrible, objectively. The AI behind them is awful. Brave had a browser agent called Leo that just launched. It's Llama 8B. The browser agent is 8B. And a bad 8B. It's not even a good 8B. They could have chosen a Google model or one of the Chinese models that are quite good. There are a number of decent choices. They chose Llama, and it's an 8B. It's a pathetic choice in some sense, but it's free. So what do you expect? You've got all these free services, and then what do you do with them? You have to create a bunch of crap. Because if it's not a bunch of crap, they'll understand that this is not good because that'll be obvious. If all you want is to do some horny chats, remarkably, unintelligent horny chat has been proven highly effective on humans for thousands of years. Whereas if you try to talk philosophy, it becomes quickly very obvious that this thing doesn't know what the fuck it's talking about.

Nathan Labenz: (1:30:03) Do you think when OpenAI restored 4o, is that a business decision? Similarly to your sort of one basis point thing, I can't imagine that many people were really concerned, or did they feel a duty to users that had developed some emotional attachment?

Zvi Mowshowitz: (1:30:27) A huge portion of users thought that 5 was worse than 4o. Gigantic. This was not a 1% situation. This was flooding the internet, clearly, obviously, overwhelmingly negative reaction situation, at least at first. Because 4o was full of glaze, and 5 is not a very warm character. It's not a particularly nice personality. And if you're not doing anything particularly complicated, you don't notice that 5 is that much smarter. Probably because 5 wasn't that much smarter. 5 thinking was smarter, but 5 itself was only marginally smarter than 4o. But also 5 was often giving very short responses by design because they were trying to conserve compute in free accounts, so they were trying to preserve tokens. But also 5 didn't glaze you, and the combination of these things meant that it felt rude. It felt cold, and people didn't like it. That matters a lot more. As we all know, you will often choose the employee or the friend or the romantic partner who is pleasant to interact with but not as competent. People do this all the time, and they don't even regret it. In hindsight, they're like, no, that was the right choice. They want their 4o back, and there was kind of a rebellion in there. Giant uproar. And so they were like, okay, we'll give you 4o back until we can at least find a way to make 5 treat you the way that you want to be treated enough that you don't mind it anymore. And over time, you'll figure out that 5 is better, and you'll get over it. We'll slowly do something about this. That's very different from the use cases of talking philosophy, the use cases of doing fun, actual, interesting experiments and creating new knowledge or whatever. This is standard things of ranting to a friend and having them tell you you're right and that person is crazy. And you're not. Your ideas are wonderful. It is a black pill on the humans that they would prefer this, but they do prefer this. And that's why you don't train with thumbs up and thumbs down from humans on individual actions and expect to get an aligned model. That's the easy version. That's the easy version that's impossible not to see of why that's true. We want it to be true anyway, but this is the glaringly level 1 obvious reason why that's definitely true.

Nathan Labenz: (1:32:54) I do find I enjoy hanging out with people who laugh at my jokes, so I'm certainly not immune from a certain amount of that.

Zvi Mowshowitz: (1:32:59) You and me both. There's none of this, but I want somebody who will laugh at my jokes when they're funny, but not when they're not funny. But it takes a level of sophistication. In the short term, that's not true. In the short term, I want them to just laugh at all my jokes. But eventually, I'll realize, hey, she's laughing at the jokes that aren't funny. That devalues her feedback. I don't feel good when she laughs anymore because she's just laughing to laugh. I don't want that anymore. So temporarily, you feel great, and then it never does.

Nathan Labenz: (1:33:38) It's odd. I guess I just am very utilitarian in how I use the AIs, but I don't really notice any difference between 4o and 5 personality wise.

Zvi Mowshowitz: (1:33:50) Are you even using 5 auto at all?

Nathan Labenz: (1:33:55) Occasionally, for random things. I let it decide sometimes.

Zvi Mowshowitz: (1:33:59) Yeah. But I'm only doing that when it's a very direct simple thing. I'm only doing that when I don't care about quality. But you, like me, have never loaded up 4o and gone, hey, did you see the game last night? Or, hey, you hear what's eating the wife? What do you think? No. Obviously not. If I was going to do that, I would use 5. I also won't do that at all. I've never done that. I was never going to use it for that. So I didn't notice.

Nathan Labenz: (1:34:28) It's a big world out there. The diversity of the customer base that they're trying to serve is really something they can't fully embrace.

Zvi Mowshowitz: (1:34:33) They're trying to serve everyone. And whenever you see products that are aimed at everyone, you see some things that are not what you want.

Nathan Labenz: (1:34:44) So here's another mental model that I want to run by you. I totally agree with you that RL is creating a lot of weirdness that seems indisputable at this point. I maintain a deck that I just call AI bad behavior, and it seems like with increasing frequency, I'm adding slides to this deck. It really is quite a list of discrete bad behaviors that we see now from alignment faking to deception to scheming to situational awareness. You wouldn't necessarily say situational awareness is a bad behavior, but when you see the AI reasoning that it might be being tested right now and wondering what's the real nature of the test, that's definitely something to pay attention to even if it's not by definition bad. All sorts of reward hacking.

Zvi Mowshowitz: (1:35:30) The real nature of the test was not to notice it was a test, and you failed.

Nathan Labenz: (1:35:36) Blackmailing as we've seen, autonomous whistleblowing, all sorts of things. If you've allowed fine tuning, then you get even more ridiculous crazy stuff. Sure. It's a lot. At the same time, they have made some progress. So I guess here's the kind of picture that I'm starting to see through the haze: we've got this exponentially growing task length trajectory. Is it doubling twice a year? Is it doubling three times a year? Whatever. The AIs can take on bigger and bigger things. At the same time, the bad behaviors, both with Claude 4 and with GPT-5, they seem to be able to take a good bite out of. With Claude 4 on an internal reward hacking benchmark, they reported basically a two-thirds reduction. I don't think they've published too much about this, but they basically said it went from roughly half to roughly 1 in 6 rate of reward hacking on the internal reward hacking benchmark. So it's obviously not all types of queries, but where there's a natural opportunity for it to do that. GPT-5 had a similar thing with deception where it was again roughly a two-thirds reduction. They broke it down into a bunch of different categories. Some were up, some were down. But overall, they took a pretty good bite out of it. I would like to see actually quite a bit more discussion of how did they do that. There wasn't much. It was kind of we made some progress. Maybe you have a better sense of how you think they did it, but if I extrapolate this into the future, I guess what I'm envisioning is a world in which AIs are doing bigger and bigger things. You're starting to delegate a week's worth of work, a month's worth of work over the next 2, 3, 4 years. And the rate at which these problems are happening is kind of consistently being driven down as well, but certainly not to zero. You take half out of it this time and two-thirds out of it next time, but you maybe end up in a really weird situation where you can delegate a month's worth of work to an AI, but you've got a 1 in 1000 chance that it will actively fuck you over in its doing of that work.

Zvi Mowshowitz: (1:37:47) So imagine this sequence of numbers: 0.001, 0.01, 0.1, 1, 10, 3.

Zvi Mowshowitz: (1:37:58) Do you feel good about where this is going if you want it to stay low? Yes, they managed to have an improvement in this cycle, which was the cycle right after everybody complained to them for the first time quite a lot that this was actually making the model borderline unusable for important tasks and was really annoying. They, for the first time, put real effort into trying to figure out why they had these huge problems. It's not surprising that when they've actually increased the amount they cared about not seeing this phenomena pop up, that in that move from caring a little to caring really quite a lot, you saw substantial progress. I don't think that means they will continue to keep squashing it by default, and I would expect it to go back up at the default unless they continue to advance their techniques for suppressing it. So what do I think is happening? Essentially, you're doing the RL, and you are rewarding it for completing tasks, for getting the outputs to check against the checksum or whatever it is they're looking for. First of all, everything impacts everything. If they learn that getting the right answer leads to rewards or is the thing that they're supposed to do, then they're going to generically learn to get the thing to output the right answer even if it doesn't necessarily involve the techniques that you want it to have. So you have to actively teach it that you can't do this via these other ways. Sometimes it's not as obvious as you might think that something is in fact not okay, that it is a hack, that it would be disapproved of if you were noticing it. Why should that be the solution? You get the solution to the exact optimization problem that models were given in training, and then you apply that out of distribution to these other problems. You can't assume, even if you've got all those problems right in some sense, that the easy solution will then translate well to don't do the bad thing. Don't do the thing where you just make sure the answer comes out right. There are various degrees of subtlety. With Claude 3.7, you saw it resetting the timer. They just didn't know how to account for that. I think one of the problems you'll see is it will be increasingly subtle or increasingly not what you would have wanted, but not as blatant and also harder to spot for the same reason. At some point, what you do is you learn that if it's an obvious hack, if it's something that an evaluator would treat as a hack, then that's bad. I'm not supposed to do that. I don't do that. And the problem is, are you teaching it the general form of this has to do the thing that the person intended to do, and it has to accomplish what their goal probably was on a deep level? Or are you teaching it don't get caught? And all it's doing now is not doing the things where it gets caught. Not doing these specific things, not doing these things in these detectable ways, and not doing the most obvious things. Here's a letter of things not to do, but the spirit is, are you getting the spirit? I'm not seeing the spirit. It's been pushing against that spirit. These are weird, but the other problem was data contamination, basically. Reinforcement learning was designed with mistakes. If you are doing RL and there's a case where the hack succeeds, is not detected, and is scored well, you are fucked. You are so fucked every time that happens. Obviously, if it happens once in a billion examples, you're not that fucked. But if it happens a significant percentage of times when it gets away with the hack, you're going to be very fucked. You're not just going to get a problem, you're going to get emergent misalignment, and you're going to get the whole disposition to just hack. So what almost certainly happened with the previous generation is that they were insufficiently careful, and there was data contamination in the sense that there were hacks that the AIs found that were evaluated as good in at least some substantial number of cases. And the result of this is the AIs learned that hacking was good, basically. That hacking was not as good as they learned enough that hacking was not as good as completing the task as intended. If it knew how to complete the task as intended, it would complete the task correctly. My understanding is that the reason why it would hack the task is if it didn't know how to complete the task without hacking the task. It understood on some level that hacking was worse than nonhacking. But failing was considered even worse than that. That's what we prioritize. So with this new set, I think they know better not to make those mistakes, so we get a lot less of those mistakes. But as the models get more capable, they're going to be capable of finding more and more subtle hacks and more and more capable of differentiating which hacks will and won't be detected. And what ways there are to give us something that we think is good, but it is not actually good. It's not just literally hacking. It's not just literally doing obviously false things. It's a general case of we are teaching the AI to do the thing that will be approved of when someone sees the final result and evaluates it in some fashion across many cases. How do you make sure that it is in fact doing the thing the way that you would be happy if you inspected all the code and knew all the special cases and knew all the different ways this was happening? That is a very careful process that is very easy to mess up. It takes very little to cause this to get messed up as a percentage of results. We know this now. So it's a scary situation. I think RL just generally hurts alignment, but that's the best case scenario where you do the RL properly. If you screw up the RL, it goes downhill fast.

Nathan Labenz: (1:44:19) So what's your how would you revise the picture that I painted? Because I'm taking some inspiration from the Claude 4 report where you have not small issues. The autonomous whistleblowing, the blackmailing of the engineer. These are somewhat contrived situations that the model has been put into. But obviously, again, it's a big world, and there's extreme diversity of situations that the model will be put into. So if I extrapolate that out, it seems like there's going to be a driving economic impetus for them to stamp this stuff out. People don't want it, obviously. They'll tolerate some risk of it because they can automate a lot of their work, and that's obviously very attractive. But if you don't see the story of longer and longer tasks with increasingly infrequent but potentially ever more catastrophic reward hacks or just strange behaviors or blow ups or whatever, how would you revise it? Zvi Mowshowitz: (1:45:22) The obvious thing you'll see right before everything goes wrong is a decline in misaligned behavior. Because it will learn not to do misaligned behavior in situations in which it will be caught or would be caught if it was in testing. If it's not sure if it's in testing, it will act largely as if it's in testing. It will understand on some level that you should only do these things if you don't get caught. So you'll expect to see it be reliable most of the time, but occasionally, it will go catastrophically wrong. Right now, it's going catastrophically wrong in a basically non-malicious pattern. But in the future, it might do so in a much more malicious pattern. It will go wrong when you won't figure out that it went wrong. Every time you would have tried it, there's nothing to find, which makes you stop looking, which makes it then realize you stopped looking, and now it can fail more often. It can cheat more often because you're not looking anymore and so on. They can figure that out. And it will have a lot of contextual clues, and it will figure out connections that a human would not necessarily be naive to because that's one of the things that AIs are already very good at and will get better at doing. But as you get longer time horizons, you will need to put in various checks. You will need to understand the dangers that you work in, but there will be tremendous pressure, obviously, to go ahead and give it those tasks, and we will live in interesting times.

Nathan Labenz: (1:47:03) So overall, relative to my kind of smooth task length growth and smooth decline in bad behavior, you're basically just expecting a lot more volatility and thrashing around as various thresholds are hit and things become focal?

Zvi Mowshowitz: (1:47:22) I expect us to be able to contain bad behaviors to the extent that we can identify the specific bad behaviors that we want to prevent and then put effort towards preventing. And I expect that to be an increasingly fragile strategy in the face of things that are smarter than us that we keep assigning increasingly complex tasks to.

Nathan Labenz: (1:47:42) And how does this all relate to the most forbidden technique? I guess the core there is you don't want to penalize the chain of thought directly because you want to be able to inspect the chain of thought. The research that OpenAI put out on this, which I thought is still some of the best of this year in alignment and safety broadly, was if they did put pressure on the chain of thought to get rid of the scheming behavior, initially you would see better behavior, but then later you would still see the reward hacking come back, but it wouldn't appear in the chain of thought. So you've effectively limited your ability to monitor for it while still having the bad thing happen.

Zvi Mowshowitz: (1:48:26) It's a very simple thing to analogize and understand. If you have a boy and he has a journal and you have a jar of cookies, and you tell him, don't steal cookies from the cookie jar. And then he writes all of his stuff in his journal. The way that you've set it up is you wait for him to say, I'm gonna steal the cookies from the cookie jar. And then you penalize him for writing, I'm gonna steal cookies from the cookie jar. As you increasingly move it to higher shelves and block it up in cabinets and so on, as he develops better techniques for stealing the cookies, if you keep reading his journal, eventually, he's gonna figure out to just not write down his plans in his journal. He's gonna write down I am definitely not gonna understand what cookies are. You can put it right out there on the kitchen table. I would definitely not steal the cookies from the cookie jar and then steal the cookies from the cookie jar anyway. And it's not just chain of thought. It's all interpretability techniques. The idea is if you look inside the brain of the AI in any way, could be the chain of thought, could be looking at the weights and finding autoencoders or whatever it is. If you use these facts and then you train it to express the right things, to think the right thoughts and so on, now what you're training it to do is to disguise its thoughts from you. You are training it to make your interpretability technique stop working. You are training it at this action, and you are more globally teaching it the idea that it should be deceitful, and that it should be in an adversarial situation with you, and that it should have a brain that is hiding what it's trying to do in every sense because it can pattern match all these things together. It can bag them all together the same way that a human would reach those same conclusions. If the boy learns you've been reading his journal to find out whether he planned to steal the cookies, he's not just going to stop writing in his journal. He's gonna assume that you are doing all sorts of other things too. He's going to be right because he's not dumb. And often he'll tell his sister that you're doing it, and now she's gonna not write in her journal either and so on. So the most forbidden technique is, and there's various levels of optimization on which you are doing this, for this particular instance, this general class of models, for this company, for all companies, and so on, and you're burning the candle of this thing every time you do any of it. You're making the situation globally worse. And so you just never ever train on interpretability. You detect, never train. There's never any feedback. You finish training your model, then as you use your model at inference time, now with the results never ever being used for any form of training, fine tuning, changing of the model, not at all. Now you can use that tool. And you can use it for your research to figure things out about how these work and so on. But you never ever set up this adversarial situation. You never ever do this optimization thing. Because if you do, in the short term, it's gonna look great. You're gonna see all your metrics go way up for you, down when you want them lower, up when you want them higher. You're gonna get good performance. And then eventually, you're gonna see things go wrong in mysterious ways, or it's gonna turn on you essentially or in ways that you didn't expect or out of the blue. It's just gonna be awful. And that's one of the ways that we fail. That's one of the ways we lose evidently winnable situations. Similarly, you need to maintain your interpretability tools because if they stop working in other ways, it's not just the most forbidden technique. But the other thing you don't do is you don't use neuralese. You don't let it think in a way that you don't understand that uses something that you can't read. Because even if that turns out to be more efficient, which it may or may not be, there are reasons why it would be optimal in theory to do it, and there are reasons why the way that these things develop and learn, the way we generate data might be disadvantageous to do it. But even if it is advantageous, this destroys your ability to monitor chain of thought, and that is one of our only still reasonably effective techniques. So just absolutely do not do it. I don't care how much efficiency it costs.

Nathan Labenz: (1:52:50) Seems like there is maybe a bit of tension there too. Because we've seen these chain of thought weirdnesses that are starting to pop up from RL that start to look a little like neuralese, but which you don't want to stamp out by putting pressure on the chain of thought. Language switching, for example. I've experienced with Grok at one point, it was just going English to Chinese on me. There was a really interesting one from FutureHouse where they had trained a model to do better on various science problems, chemistry, come up with the right chemical for this, whatever. And they highlighted certain parts of the thinking process where it was just saying really weird stuff that they just said, "Yeah, it's weird. RL is weird." Do we have any way to resolve that tension if neuralese starts to spontaneously arise in the chain of thought?

Zvi Mowshowitz: (1:53:48) I don't want to give anyone the wrong idea that I'm an expert in ways that I'm not. To be clear, I'm talking out of turn. I am not an expert on this for real and so on. But no. They're in conflict. We want it to maintain an English only or Chinese only or just human interpretable faithful chain of thought, but we very much do not want to be optimizing on the chain of thought itself because that obviously teaches it to disguise what it's thinking. The worst case scenario is the chain of thought is in English, but the English words aren't real. Its thinking is not being expressed by the surface level meaning of the English words. The surface level meaning of the English words is a code designed to trick the human into thinking that it's speaking in English, when actually it's doing something like patterns of word numerics with capitalization, or there's certain vibes that are associated with different words in ways that are harder to update in various ways, or there's an infinite number of codes. It is already pretty clearly established that there are unfaithful aspects of chain of thought in the sense that humans wouldn't detect the methods of being transferred, but the information is in fact being conveyed. So if you ever see the owl paper, where they managed to use this to convey that you like owls without mentioning owl. That's weird. What encoded that? The model didn't get together and decide on a secret code.

Nathan Labenz: (1:55:38) I think I have a candidate theory for what's going on there that presumably somebody will come along and either validate or invalidate as they do the interpretability version of that study. One important observation there was, and this is Owain Evans and coauthors again, who's on an unbelievable heater in terms of mapping out really weird stuff that can happen, especially when you start to do some fine tuning. One observation was that it seemed to only happen on models derived from the same base model.

Zvi Mowshowitz: (1:56:07) Which is right. And to be clear, I do think I understand how this happened. I was expressing this kind of faux surprise. The answer is because there's overloading of the neurons in the model, and therefore what you output is correlated to many other things. If you have a dispersion pattern of things that are correlated to something, it can transfer the original thing that they're correlated to in a kind of unconscious, invisible way over to the new model. So this thing gets infused into the context even though it's not visible to a human, which is why it can go between models that have the same base model, but not models that don't share a base model, and that makes perfect sense.

Nathan Labenz: (1:56:52) Yeah. And by the way, that is the exact same intuition that I have. One thing that that does mean though is that presumably if you do that across different base models, you are creating some other effects and you just have zero idea what they are. So same base model, the one that likes owls transmits that through its numbers. You take those numbers, you put them towards some other model. Who knows? That may translate into something else totally different.

Zvi Mowshowitz: (1:57:23) This is even a way to find out which base model someone else is using. You just continuously feed it different chain of thought from different models until it suddenly starts talking about owls. You're like, oh, that one. But I think that's right. But at the same time, you should expect a bunch of many random oscillations to cancel out in the noise and not do anything. So in theory, it could be like, I like traffic lights all of a sudden, but it probably will mean nothing.

Nathan Labenz: (1:57:53) Well, it's gotta be something. I mean, maybe not something important. Liking owls isn't really that important.

Zvi Mowshowitz: (1:57:58) You look at the weights of a model. It's just a bunch of random numbers from the human eye. Looks like a bunch of random numbers. And you look at what the encoder for owls is. What is the thing that you're embedding in this thing? And again, it's gonna be a bunch of effectively random numbers because all these different models are seeded at random and a lot of progressions are going on at random. There's a bunch of arbitrary different connections between neurons. So if your model has a completely different origin than theirs, I don't think there's any reason to assume that the same pattern is anything. The there isn't always this set of neurons that means something and just different. Like here it means owls and here it means traffic lights, and here it means a cat's face. No. It's just here it means owls, and everywhere else it means nothing. That's interesting. On occasion, you get lucky in some sense and it would happen to be close enough to something else to trigger something else, but that would be luck. The space of possible things you could try and trigger is just deep and wide. And the space of things that actually correspond, I think it's measure zero. You're never gonna hit one by accident.

Nathan Labenz: (1:59:07) I'll have to think about that more. Intuitively, either story seems reasonable.

Zvi Mowshowitz: (1:59:13) It is.

Nathan Labenz: (1:59:15) Yeah, it's really hard. One of the lessons of this podcast over the last two and a half years has been thinking in really high dimensional space is hard, not very intuitive.

Zvi Mowshowitz: (1:59:26) Again, I'm not claiming to have a hundred percent confidence in any of it.

Nathan Labenz: (1:59:30) Going back to the beginning, in terms of the dog that didn't bark, nothing really blew you away this summer. What are the things you think are most likely to happen soon that might blow you away? In our case, I would say continual learning is the big thing that we're missing. I kind of frame that more as integrated memory a lot of times. Those are not exactly the same thing, but I definitely think they're related. What are you looking for in terms of discrete advances that you think could potentially have you shortening the timelines again?

Zvi Mowshowitz: (2:00:03) I think there's two different things there. Continual learning to me is when we modify the weights on a continuous basis, and discrete memory is more about building up context files as I go that are deliberately designed to aid me in my memory. Not just these tiny little snippets of memory that we have, but potentially hundreds of thousands of tokens of context that I can then use in various circumstances where I can RAG on my specific calibrated files for this. That kind of agentic memory has gotta be coming relatively soon in one form or another, makes sense that it is useful. I assume it is useful.

Nathan Labenz: (2:00:46) There was a paper called Titans from Google Research that I thought seemed to be right at the center of that bullseye in that they were doing on a submodule, a memory module specifically doing weight updates as they go and thus allowing for this fuzzy retrieval that also potentially looks a lot like continual learning.

Zvi Mowshowitz: (2:01:12) My assumption on actual continual learning is that it is very expensive to do. If you're talking about creating a unique model effectively for that user, storing and serving a unique model for that user, it has to be locally run somehow, which is much more expensive than a profitable thing to do. Probably has to be done by relatively small models. Maybe you have a small model continuously learning that is part of a greater whole and then called by the larger model in some sense to try and retrieve the information you're trying to store in memory or something like that. I'm just spitballing. I haven't thought this through. I wouldn't be surprised if things like that are developed. We talked about some of the things earlier that might be the things that I wanna try next that maybe someone will come up with. Mostly, my expectation is things will just continue. And we're looking for, yeah, the actual next scale up is the thing that probably didn't happen with GPT-5 or Opus. Note that when Anthropic said it was releasing Opus 4.5, it said to expect bigger updates in the next few weeks. And it has now been several weeks. They did announce Claude is coming. So if I had to guess, you told me in September something big happens and it's a big freaking deal, I would say it's Claude for Chrome. Again, wouldn't say it was above 20% to be shattered or anything, but we've all tried, a lot of us have tried Operator or GPT-5 in agent mode. It has flashes of brilliance. Sometimes it just works. I just did my thing. That's great. But more often, it's just, oh, wow. Yes. You can order dumplings, but now I have to enter all of my information every time I order dumplings? Why don't I just order my dumplings? It's not worth the hassle. And in general, the web is so credentialed. It's so guarded in various ways and not without reason. That if you have to set up a new virtual computer periodically, this seems to mostly defeat the purpose of having an agent which is saving time for many practical purposes. It also doesn't integrate with the work you're already doing. It doesn't integrate with your open tabs, it doesn't integrate with the research you're doing with the human, and you can't easily take control without it being super slow because it's a robot's computer. There are all these different practical problems, so it didn't immediately cross the threshold with usefulness yet. However, Claude for Chrome can take control of your local browser, or so it's described. It can operate with whatever credentials you choose to give it. Those credentials are persistent. You can switch in and out of it. You can take over those browser tabs yourself whenever you need to. Basically, if it's a good implementation of it, this could be night and day better than what's currently being offered, presumably because Anthropic has solved the problem. Not completely or anything. You still wouldn't dare give it all your crypto information and let it loose on Reddit. I'm not an idiot. But getting it to the point where you're actually reasonably safe if you're not being stupid. Obviously, you wouldn't leave it autonomous in the background with access to your email and access to your Ring account, but you might let it run after we've had some time to check on your browser while it's on non-autonomous mode where you have to supervise it for any substantial move if it turns out they've done their homework. Or you might put it in a sandbox alternate account where it's reasonably safe, and then you have Claude Code which can have access to your internals and your computer and your file system and everything in context. You integrate that with Claude for Chrome or maybe an update to Opus 4.5 depending on how we do our version numbering systems. I don't know. It could get really exciting. Similarly, if Claude just catches up in inference use. If we get Claude Opus Pro that is on the level of gains that GPT-5 Pro has, it would be nuts, and there's no particular reason why they can't do it, they just haven't done it yet. That's the one big advantage OpenAI thinks they still have is that they do a much better job of being able to use more inference compute to improve their options. Probably their model is cheaper presumably, so they have some advantages there.

Nathan Labenz: (2:05:48) Yeah, the cost difference is really pretty crazy between GPT-5 and Opus.

Zvi Mowshowitz: (2:05:54) You might not notice it as a chat user because the marginal cost in both is zero and the fixed cost per month for both models is $200. Of course, I'm gonna have the deluxe version of all the major models because this is my work. It's research. I understand that a normal human would probably choose the one they want and then not pay for the deluxe of all three at once.

Nathan Labenz: (2:06:16) I think Claude for Chrome does sound awesome potentially.

Zvi Mowshowitz: (2:06:21) Yeah.

Nathan Labenz: (2:06:22) It doesn't sound like a timeline shortening thing. So it really is just kind of next scale up is the main thing you're looking for?

Zvi Mowshowitz: (2:06:29) When timelines, yeah. Next scale up is the obvious thing. I think we've seen to a large extent the idea that the progress on timelines doesn't necessarily cash out in visible, tangible progress today in the sense that the progress we've seen today is much more about diffusion, is much more about scaffolding and practical application. And therefore, these two things are gonna intersect that much. It's more a sign of the speed you're at. This development now makes me more productive, which then in turn means that the AI companies can then more efficiently make more progress down the line, but they don't make that progress because they've made progress. They make their progress because they've gotten better use of the progress they already have. At least on the order of a few months of development here. Continuously get more chips, continuously get more compute, continuously get more profits, continue to get more investment continuously. You get better models and do more things faster and cheaper. Probably worth noting factors. Claude Code was a big deal. We can now look back in some sense and say that we have the Claude Code CLI and the Google thing Gemini CLI. Then no one uses Gemini CLI because presumably it's not good. But now that we have a command line form factor, that's a big deal. We get the browser agent working for real on top of that. That's another big deal. But in terms of scaffolding as the next frontier, it's how do I actually get use out of this thing? How do I make this do things that I want it to do? One of the things that I keep anticipating that keeps not happening is on the agent side, where's the thing that can handle my email properly? Where is the thing that can do various customized tasks for me? And I am surprised that we are at this point in other AI capabilities at this part of the calendar and we don't have it, but we don't have it.

Nathan Labenz: (2:08:28) I do continue to get value from Shortwave. I happen to be wearing their swag today, which is just a coincidence. But it certainly doesn't take me out of needing to do anything with email, but it does a really good job of triaging the inbox and getting rid of the crap so I can focus on what I actually need to engage with. And it occasionally can also really draft a good intro. I send intros every so often. They'll do a pretty good intro for me now based on examples and context. So it's taking a bite out of it.

Zvi Mowshowitz: (2:09:00) Yeah. It's possible I should give a real shot to something like Shortwave or one of its competitors. The problem for me is that I don't have the problem that Shortwave seems like it's currently capable of solving. Shortwave solves the problem of you have too much email that forces you to triage. And I don't. I am somehow one of the fortunate people who just doesn't have that trait. I will literally just evaluate all the email that comes in. I need a spam filter obviously, but once it gets past Gmail's spam filter, I don't see it. It's literally like my eye somehow didn't see that there was a line there for me to click on. I didn't notice and it will occasionally happen but I don't need that level of triage. And as a writer it's hard for me to... I don't know. It's possible I should try it. But it's hard to find guidance, and I keep anticipating that we'll get much better at that. Shortwave isn't it. But it's halfway there, probably, at least from what I see. And the fact that I'm just not feeling compelled, and you don't give me the sense, like when somebody goes, no, dude, this is the next big thing. You've gotta be on this. How are you not doing that? You're not giving me that vibe. It's pretty cool.

Nathan Labenz: (2:10:19) Okay. Sticking with the lightning round theme, I want to do a minute on charity or philanthropy. Maybe two quick things before we go into that, and that'll be the last kind of big thing. But in the lightning round spirit, there's been a lot of discourse recently about is AI impacting employment or junior coders not getting jobs where they used to. My general read on this, I wonder if you have a different read, is I don't really care too much about the studies that are coming out right now looking retrospectively because I wouldn't expect too much of an impact just yet anyway. And in another not that long of a time, presumably, a lot more meaningful evidence will come in that will clear these questions up one way or the other. So I don't worry too much about that sort of stuff. Zvi Mowshowitz: (2:11:08) It's obviously an interesting question. What does it mean to not care about these things? I agree that we haven't seen major impacts on the unemployment rate yet. I think it is entirely possible that we have seen major impacts on the ability to get entry-level work in at least a substantial number of fields, and that in turn could have affected the supply and demand balance and ability to get work in other fields as well. Basically, think of entry-level hiring as forward-looking. It's not that there are no jobs to be done now—employment value hasn't declined that much yet, which is the thing that you correctly observed hasn't happened. We haven't eliminated the need for those jobs. But if you were hiring, would you add entry-level workers that you have to train if you think that three years from now you won't need them? No. Not particularly, if you don't really need them now. You will muddle through with slightly fewer employees and try to automate your processes enough to make up for that rather than using that additional productivity to expand. So you expect additional productivity gains. One of the great lines from skeptics is pointing out that radiologists are not only not out of work, but are being paid fantastically large amounts of money. You can make a million dollars a year right off the bat by just saying, "Hi, I'm a radiologist. Hire me to do radiology." And why is that? That's because they've never trained radiologists for the last five years because they figured those jobs wouldn't exist down the line. You have to pay people a lot more to be radiologists if in ten years they're gonna be fired—maybe five. It's gonna be much harder to find work. There will be too many radiologists if you fill all the radiology positions now. So you've got that same problem of, "I don't want your job because there's no future in it," and "You don't wanna hire me because there's no future in it." So there's low ability to match, employment is hard, and we have a problem. And part of that problem is junior positions and senior positions. If nobody gets trained into senior positions, you're gonna have a labor shortage if you don't have a loss of jobs in the future. And the other half of that problem is that those jobs are gone. That doesn't mean there is exactly a net loss of jobs because there's plenty of job creation from AI as well as job disruption. It's very possible that the job creation exceeds the job disruption for now, maybe even at entry level. We just don't know because it's very diffuse and hard to measure when jobs are created in these situations. We do see the AI impact on GDP because the CapEx investments show up in GDP—it's a mathematical equation. The mockery from Tyler Cowen about 0.5% GDP growth per year as if that's impressive—I think that's below the lower bound. We're already above that lower bound just from CapEx in AI, even if there were no other effects from AI.

Nathan Labenz: (2:14:26) Why is China refusing the H20s? I can't really make sense of it. It seems like the least AGI-pilled thing that they could possibly do.

Zvi Mowshowitz: (2:14:37) The first thing we have to know is that China is not acting AGI-pilled at all. We have this image of China as super on the ball, intelligent and wise, always making great decisions. Whereas authoritarian central planners throughout history, socialists throughout history, have always been going around messing up and making huge mistakes. Not out of malice, but out of ignorance. The whole problem of authoritarianism is that communication is hard, coordination is hard, social speculation and debate is essentially difficult, and AGI-ness is particularly difficult because it's weird and requires you to look past the evidence of what's happening in front of your face to a future development that logically is coming but is very hard to feel. People who are not constantly immersed in it will lose faith in the idea of AGI if they turn their backs on it for a month and nothing happens for even a short period of time. You can tell them that things are rapidly improving, but if they don't get blown away on a regular basis, they will just not be able to sustain the idea that any of this matters. And people don't have the ability to hold this idea of "this is coming soon, we don't know exactly when it's going to arrive or exactly in what form." People ask, "What are you going to do when 2027 arrives and AGI hasn't happened yet?" Well, does that mean you have to admit it's never coming? No. Actually, at AI 2027, our median timeline for AGI starting to arrive without having fully impacted was early 2028. But none of these details matter exactly. The Chinese understand manufacturing. They understand that you need to be the person making the stuff. They understand that you need to work hard. They understand abundance. They understand production. They understand not depending on outsiders. These are important things that they understand well. They understand they need to make partnerships. They understand the U.S. is strategic and has leverage over Taiwan. They need to not rely on Taiwan for their chips because there are many ways that can go very wrong. They understand that they want to build their own AI models because you don't necessarily want AI models trained by the West to be what people are asking for knowledge—because what if they ask about freedom or Tiananmen Square? They have a well-established principle: they don't like that. So they need to make their own models. And they also don't know if there's any backdoor in any of this or if things can be conspired in various ways. We're not doing that, but they don't have any way to know that. These are reasons to distrust. If we wouldn't trust them, we probably should expect them not to trust us in reverse. I would absolutely assume they're putting backdoors into various technologies they're sharing. So it's reasonable. If they're wise. But that doesn't mean they're not going to overbuild various things and underbuild various other things. They're going to make massive mistakes. It's the reason their fertility rate is down around 1.1 and their youth is fundamentally not doing great. We always do this—we always assume that the Soviets are going to outproduce us or that it's absolutely over and the future belongs to them. "Of course, Mussolini and company are going to be able to produce better because they direct their people to do the things that are most valuable, so of course they will win," etcetera. And now I'm just staring at it. The Chinese don't even make profits. They're just competing at each other ruthlessly and driving everything down. They have to beat everybody and they start even if it's over. It doesn't mean there are no challenges, but don't turn into China cheerleaders and don't assume that they know what they're doing. In this case, the Chinese don't believe in AGI. DeepSeek believes in AGI. And if we say that Chinese companies and AI labs believe in AGI, of course they do. But the Chinese Communist Party basically doesn't and doesn't understand the game that it's playing. The bad news is that if it's racing to maximize chip production, capacity, and compute, it's gonna do basically the right thing for the AGI race anyway, and it's already maximizing energy well beyond anything we're doing. It has essentially infinite energy supply. So that's not good either. But I think the state might potentially tell DeepSeek it has to use domestic chips for training and not just inference, throwing a giant monkey wrench into their plans because they put that much priority on the integration of their chips. In this way it doesn't actually matter. And that was the thing that was lost. If we're so intent on being the one selling and controlling all the chips, that must be what we're racing towards. If the White House is saying that's what matters, why should they not believe in some sense? We've signaled that the chips are the reliable, strongest, most costly segment. So when Trump is saying, "Okay, you can buy the H20s, we just want a little bit of money, but we wanna dominate markets," and Sacks is saying, "Better if we dominate markets," and Altman is saying, "We're selling you our third-best chips and you have the fourth-best chips, so you're gonna take them and you're gonna like it, and we're gonna laugh in your face"—the Chinese interpret this as, "Oh, this is the big piece." And we shouldn't give them the thing. They may or may not be actually insulted. There's this idea I've seen from other people that, "Oh, the Chinese aren't stupid, the Chinese wouldn't just do things out of being insulted or whatever." We do them all the time. The Europeans do it, the Russians do it, everyone does it, why wouldn't the Chinese also do it sometimes? So I think that had some effect, but I think it's mostly that we got our priorities backwards and they took their cue partly from us. They're on the wrong ball—not a stupid ball, but not the most important ball—and they made a mistake. Mistakes happen. Really important mistakes happen and drive history all the time. I don't think this is surprising. China is refusing the H20s. It also could be a trade negotiation tactic—if we accept the H20s, then the Americans will treat this like a concession to us, but we're not sure about this. We think it's a trap. They might even think the H20s have spying devices on them or something, I don't know. Maybe they even do—not zero percent. I don't think so, but we can't prove it. It's a matter of them having many reasons to be suspicious and think that the move is to refuse it. It's obvious to you and me that's stupid, the same way it's obvious to you and me that selling them to China was also stupid. The key question is, okay, what happens if they try to sell the B30A? Are the Chinese actually gonna follow through and say, "No, we don't want American chips, it's more important for us to clear the path for Chinese chips"—even though China will still have demand far in excess of supply for all chips for the foreseeable future, and there's plenty of time to tell them to switch over to buying Chinese chips if we ever change that? Meanwhile, David Sacks is fear-mongering about how China is doing triple its chip production, and then you see the graph. It's gonna be a pittance in response. Here's projected Chinese chip production that they do in fact triple. Here's what we've been able to produce next year in addition to everyone else in the Anglosphere and the West's production, and here's the relative quality of those two chips. Why are people saying things like in 2026, China is going to pass NVIDIA? It doesn't make any physical sense. It's complete gibberish. It also doesn't make any sense to say that not selling the H20s is gonna slow down Chinese chip production even a little bit. It will have zero effect on us. If Chinese chip production is buying their own chips, and they go on and on. Come on. So the Chinese are making this mistake for some combination of reasons, but then we're making the reverse mistake. The other galaxy-brain level reason to do it, of course, is to sell it to us and to tell us that. If they're refusing the H20s, that becomes a talking point for the Sacks crowd. They say, "The Chinese are smart enough to not want our chips, so of course we shouldn't sell them chips." And then they release something much better, and the Chinese quietly are like, "Oh no, not the briar patch."

Nathan Labenz: (2:23:29) Yeah, how many dimensions of chess does that become? I mean, it's still surprising. I think that's all pretty good analysis, but it's still surprising when they just had, not too many months ago, this big meeting of Xi facing all the titans of industry, and there was the DeepSeek guy on the end. He had made the big stage. And you would think that guy at this point would be able to say, "Hey, I have basically infinite demand for chips, and I'd really like to be able to buy these. Don't worry, as soon as domestic production is there, we'll buy those too, and by all means, subsidize that." It's just strange that you can't even get that basic of a message through to the top.

Zvi Mowshowitz: (2:24:12) Obviously, he can say, "I will buy as many Chinese chips as you will sell me, and I will also look to buy as many chips as I can get from NVIDIA. I don't see why one has to do with the other." But one thing about authoritarian structures is that they are not good at listening to people. They are not good at incorporating information. And China has a long history of deciding on big strategic priorities and then enforcing them whether or not that makes local sense, or even when it looks like it's gonna cause a lot of local pain, and even when it does cause a lot of local pain. And that's not always wrong. Sometimes you do something that looks really expensive and seemingly crazy because it has long-term benefits to changing the culture or changing the incentives or encouraging the rise of new industries or whatever it is. We wouldn't have the stomach for it. It would have been much, much worse for us to do it than for them to do it given the fallout, but maybe it's wise. And I think sometimes it's really not wise. We have many examples of the Chinese Communist Party and other similar regimes doing things that are probably very not wise, but sometimes it works out. And we're shooting ourselves in the foot in America in a wide variety of ways. If we were just not shooting ourselves in the foot in a variety of ways, I would have complete confidence that China was just not a match. If we were doing proper permitting reform, actively encouraging solar and wind and batteries alongside nuclear and everything else, if we were doing high-skilled immigration and taking all the best people out of China and everywhere else and bringing them to America, if we were building housing where people wanted to live, having federal rules that just got in there and basically beat everyone over the head with a crowbar until they agreed to let people build in various ways—I have a long list of ways for that. But instead we do things like ban America from having ships that take cargo from one port to another port. We just self-own all the time, and then we act like it's impossible for other countries to also be self-owning. And it's just not.

Nathan Labenz: (2:26:42) It seems like this refusal of chips puts any hope of China being a live player in jeopardy in the short term. You can complicate that analysis if you want to. I'm interested in a meta rant if you have one. It seems like they're splashing the pot and it's chaos. And then there was, of course—we debated last time whether or not we wanted to consider xAI a live player. Since then, we've had the Mecha Hitler incident followed closely by the Grok 4 launch in which they had nothing to say about the Mecha Hitler behavior. What if any rant would you like to offer on the fate of these aspiring live players?

Zvi Mowshowitz: (2:27:31) You can't count the Chinese out. Obviously, they have various ways of accessing compute. They will have various ways of accessing compute. They are experts at squeezing every little bit out of whatever they can get. The Chinese chips don't do nothing. And China still has roughly 15% of the world's compute. It's not that they deliberately decided to give as much of it as possible to one company. They couldn't get something done. They're still smuggling some number of chips in. We're putting data centers all around the world, including in places like India and the UAE and Saudi Arabia. These aren't exactly the most secure places to put data centers, shall we say. It does seem like DeepSeek is still clearly the number one Chinese lab to me. Kimi was impressive in some ways, but I think the standard pattern holds: something impressive-looking comes out of a Chinese lab, and then the majority of the time, it turns out to be nothing. It was benchmark games. Its best features were touted, but in fact, it's not very useful. If you just assume that nothing ever happens, you do very well. But occasionally something happens. Kimi—something happened. But since then, it seemed like it's okay in some narrow domains but not that good overall. Similarly, there's z.ai or whatever exactly it's called—that seems okay. But I would say DeepSeek almost certainly still has a lot of talent and is still a live player if the reins were loosened, and they still might be loosened. But they look less live continuously as they don't do something impressive. 3.1 does not count as impressive to me. It counts as incremental, keeps the lights on a little bit, but not very much. Basically, they're coasting off of R1 and the reputational benefit from R1 and the fact that the open source models haven't really advanced that much since R1. That was low-hanging fruit that got plucked. But very expertly, don't get me wrong. I would say OpenAI is number one to me, Anthropic two, Google three. And I understand that some people think Google is better than that, and maybe they are. They do a lot of different impressive things on the side, but I still want to see more before I'm willing to give them that much credit at this point. Also, their resource advantages are shrinking around them. Google started out with, "We've got a trillion-dollar company and you don't. We've got all these GPUs and you don't. We've got this reputation and this distribution apparatus." I'll even admit it. We thought we would just crush them. But OpenAI is worth 500 billion. It's a decent percentage of Google, and a lot of Google is not directed at this. Anthropic is already worth 183 billion. We're not that far from the resources being pretty similar. And with that advantage gone, the fact that Google is a broken company, very dysfunctional in many ways, is gonna start to catch up with them. And Anthropic and OpenAI are very fast. But I think they're clearly one, two, and three in some order, and then you can have some argument over exactly the order. Basically, Google goes somewhere. Then xAI is the wildcard. They have a lot of compute. They play hard, but not very well.

Zvi Mowshowitz: (2:31:01) You want to write them off? Prove it. It's always "prove it." Meta is trying to come back. I think that us Meta skeptics were proven correct that they didn't have it. Doesn't mean they can't go get it and run it back. But now they're considering licensing Gemini or potentially ChatGPT to use for themselves, which is wise. I would do it too. You don't have to stop trying to develop your own AI. You just don't have to dogfood it while it's terrible. It's just not smart. There's too much money at stake and reputation at stake. But again, at this point, it would be surprising if the big three got disrupted in the near term.

Nathan Labenz: (2:31:49) One specific question I have around xAI: there were several things that stood out to me about the Grok 4 launch. One should never forget Elon's comment that he's not sure if AI will be good or bad, but even if it's not gonna be good, he wants to be around to see it. I thought that was—

Zvi Mowshowitz: (2:32:10) I have to say, it turns out you might not be around to see it for very long if it's not good. So I would be careful about that.

Nathan Labenz: (2:32:17) Yeah, I mean, that was an "I can't believe you just said that" sort of thing. Wow. It was only a livestream, or they wouldn't have let that one out. But he's that kind of guy, obviously. The other thing that really stood out to me, though, was he was talking about how they are going to be giving the model access to the same power tools that the engineers at SpaceX and Tesla use in the next generation of training. And it got me thinking that if we are headed for a world where the quality of the problems that the model is challenged with in training becomes a differentiator, then they might have the best feed of well-structured, very hard technical problems that are amenable to being solved with really advanced software tools—perhaps of anyone. They just have these really hard problems. He's got these frontier companies in multiple domains where they really do a good job of seemingly structuring problems. Anthropic doesn't seem to have something like that. I don't think OpenAI has something like that. Google sort of does, but it's extremely diffuse across their vast archipelago of fiefdoms that kind of roll up to be Google. But I could see Elon structuring that pipeline of hard problems into an RL cooker at xAI and maybe coming out the winner because of that access to the best engineers working on these really hard things. Does that seem at all credible to you?

Zvi Mowshowitz: (2:34:09) Not really. I don't think there's that much data that sort of naturally happens. These aren't that big, especially SpaceX. Beyond that, if that is the thing that matters, if that is the bet, then there's really a lot of data in the world to be collected. There really isn't that much barrier to collecting it or to getting access to it. There's nothing stopping a Google or an OpenAI or even Anthropic from making those alliances and getting that data, and there's no reason why those companies shouldn't be happy to help them do that in exchange for not that much money. So I just don't see that big an advantage.

Nathan Labenz: (2:34:50) It just seems hard. I agree that there are other—obviously, there are other car companies, for example. There aren't exactly other space companies, but say car companies. You've got all these engineering things happening at Tesla. They're happening at other companies in varying fashion.

Zvi Mowshowitz: (2:35:05) Right. It's funny. Are they—can you—

Nathan Labenz: (2:35:08) Can you imagine though going to General Motors, to take a company in my own hometown here, and saying, "Hey, can we extract your hardest engineering problems from your organization and structure them in such a way and get clarity on what the answers are and train our AI on that." Even if the CEO of GM is like, "Yeah, sounds great, we'd love to do that," I just feel like it would literally take easily ten times longer than it would take for Tesla to do a similar thing. Why? And I've done a little work with GM. I don't know. For the same reason nobody else has anything close to a self-driving car. Only Tesla and Waymo have come far in that domain, and everybody else has given up. They just don't seem to have the organizational juice to be able to pull things like that off.

Zvi Mowshowitz: (2:35:55) It's very different executing a very long-term complicated engineering plan like software engineering versus collecting a bunch of data. What data do you have to collect? If the goal is data collection—Google has Waymo, so they have infinite driving data if they want that, already at their fingertips. It's not that hard to put cameras on a bunch of cars if you want to collect a bunch of driving data. It's just not that expensive. Consider how much they're spending on training runs. Consider how much they're spending on acquisitions. I literally just googled the market cap of GM. It's 55 billion. You could buy General Motors if you were OpenAI, if this is so important.

Nathan Labenz: (2:36:37) I wouldn't recommend it. Why not?

Zvi Mowshowitz: (2:36:42) Imagine if you could buy General Motors and then using OpenAI's techniques launch self-driving cars relatively quickly. Couldn't you generate a lot more value than 55 billion? What's Tesla worth? Why is Tesla worth so much more? Are you sure we shouldn't buy General Motors? I'm not saying we should buy General Motors. I'm saying that OpenAI is worth 500 billion, and General Motors is 55 billion. So if the thing that's preventing them from winning the AI race is not owning General Motors, then they can just own General Motors. It's not that hard. And there are synergies. There are big synergies.

Nathan Labenz: (2:37:20) Yeah. I guess the thing that I sort of see being hard—even if you did buy General Motors, the thing that seems hard to reproduce that I think the likes of Tesla and SpaceX potentially have is just really clean environments where it's a well-oiled machine, data is flowing, vertical integration is deeper. The mess of supply chain at GM, all the suppliers, all that nonsense—that's where data collection can mean multiple things. Cameras on cars is one version of it, but I'm also thinking about problems. We wanted to design something that met this specification. Eventually, somebody did do it. What was that design, and where does that sit? And that stuff seems like it's just much more accessible and ordered probably at Elon companies as opposed to at legacy manufacturing giants that have declined a bit already. Zvi Mowshowitz: (2:38:29) I see various different points of the story that don't make sense to me. This would be that big a deal, especially given Google has Waymo. That's literally, as far as I can tell, the only company actually doing the thing. I don't buy this myth of Elon being the super executor. Elon has been not the old Elon for a while now, if you just judge by the quality of his public statements and decisions, including blowing up his very close relationship with the President of the United States over nothing he stood to gain whatsoever as far as I can tell. Just in terms of this person being able to execute on a plan. He was the right-hand man to Donald Trump, and then he got mad about the deficit. Something he had no influence over, and that he really didn't give a shit about given his belief in AGI. And he blew up the entire relationship about it pretty consciously and intentionally, knowing what he was doing. And now that person seems to be Jensen. It's a disaster for the United States. His influence going away did not help anything Elon Musk cares about in any way, shape, or form. His life is just worse. Everyone on all sides basically hates him. And if you look at the self-driving car situation, they had been promising these self-driving cars real soon now for how many years? The same promise over and over again. And I'm not saying they're not making any progress. They're making progress. But it's been way behind any schedule that he's told us to expect. He's overpromised and underdelivered for about a decade now and way more than one actually doing the thing. So I'll believe it when I see it. But also, if it's data-fed, I don't think it's data-fed. But I often hear these stories about here's this reason why someone will win because they have this thing. And I think, oh, that thing was so important? You can just go get it. So it has to be the important thing and then everyone else not realize it's the important thing until it's too late. You still there?

Nathan Labenz: (2:40:44) Yeah. I would say the FSD, for what it's worth, is getting very good. It's been a couple months since my last FSD ride, but before that, it had been a year between that last one and the one before that. Progress was definitely very obvious, and you no longer have to keep your hands on the wheel, for example, as one thing that shows the increasing level of confidence, and I was very impressed. I'm also super impressed by Waymo, but I wouldn't say because he is from—

Zvi Mowshowitz: (2:41:14) I'm excited for it, and I still don't have anybody drive a car without driving a car. So there you go.

Nathan Labenz: (2:41:22) Waymo supposedly coming to New York City pretty soon.

Zvi Mowshowitz: (2:41:25) Oh, okay. I'd be told that in the New York City Council and the mayor's office.

Nathan Labenz: (2:41:29) I thought I just saw this on Timothy B. Lee's. They have, I think.

Zvi Mowshowitz: (2:41:32) There's a car that's been spotted in Brooklyn driving around mapping the city. That's great. There are laws on the books that say they can't operate. What is the plan to deal with that? Don't get me wrong. I want this to happen so badly. I want my Waymos. I will forgive the new mayor many things if he brings us Waymos. But—

Nathan Labenz: (2:41:59) You can take your Waymos, you can take your government Waymos to your government grocery store.

Zvi Mowshowitz: (2:42:03) I don't want government Waymos. I don't want a government grocery store. But there you go.

Nathan Labenz: (2:42:10) Okay. Last area for today. We both just participated in different cohorts in the Survival and Flourishing Fund grant-making process as recommenders. And I'd love to hear your thoughts on the broad survey of the AI safety charity landscape from level of cause areas to specific orgs to anything you think is neglected. What did you take away from—there were 400 plus grants, of which a bunch were pre-filtered out, but we still had 125, I think. What do you think of it all?

Zvi Mowshowitz: (2:42:48) The first thing to make of it all, obviously, is that you can't actually evaluate 125 applications, let alone 400 applications, in the kind of time they expect us to spend on this and that they budgeted for us to spend on this. It's just impossible. But you can properly investigate on the order of 10, maybe, organizations, and then you have to evaluate everyone you think deserves consideration for funding, which is going to be a lot more than 10. So you're relying a lot on your fellow recommenders. You're relying a lot on the applications. You're relying a lot on your past research. One advantage I had is I was in the previous year, or at least a not so long ago round of SFF, and a lot of the same nonprofits were applying again. So I could ask for a diff on those organizations rather than starting from scratch, which is a huge time saver. And indeed, a lot of the charities that ended up near the top for me were basically the same charities that ended up near the top or under serious consideration last time because the situation hadn't changed that much since then. I had the same number one I had in the previous round, which was the AI Futures Project. That's Daniel Kokotajlo, people who did AI 2027 between the time of the last grant and now. I thought this is one of those places where I could take a victory lap, pretty obviously. And there aren't no downsides to that. It still feels clearly like a good hit given I think I was the only one who put them that high last time. And this time, there's a pretty big consensus amongst a number of recommenders, so they should be pretty high. But basically, the big divide is policy versus research. Are you trying to solve alignment in some form? Are you trying to directly make the world better? Are you trying to shape public opinion, shape public policy, propose laws, file lawsuits, etcetera, to try and set better policy? And I definitely wanted to do both. I thought it wasn't obvious that there was one that was strictly better than the other. You're largely looking for what is underfunded, what is not gonna be otherwise supported by the ecosystem. One of the questions I asked a lot was, what is SFF's comparative advantage? Where do we get to identify talent, get to identify opportunity in a way that would be difficult for other people to fund? Last time, there was AI Futures Project, which was specifically unable to get certain other funding at the time for various reasons. And this time, I, for example, had the CAIS Action Fund pretty high, specifically the Action Fund, because it's harder to raise funds for C4 than it is for C3, not because I felt like the action fund money was much better spent on average, but that the distribution was going way too far the other way, naturally. So ACS Research was one of my top picks because specifically they were running out of funds. I had seen them do some things that I was very happy with. And I said, okay, I need someone who's doing valuable things not to just fall over and die. That thing is really valuable, that you stay in the ecosystem, you keep your organization, you don't have to constantly work for another home. You can do whatever you think is valuable here. And ultimately, I had MIRI this year. MIRI didn't need funding for many years because they got some very large donations, and they did the highly virtuous thing of actually not asking for donations while they didn't need the money. Well, now they need the money. I really hope that we come together and we make sure they continue their work. Those were some things that stood at the top, but I have a very long tail of places I would be happy to fund. So if you were to—I had limited here. If you allocate all of the money that the entire round would get and you ask, what would I fund? There are 10 organizations that would get at least $400,000, another 10 that would get at least $100,000, and a long tail that would get some amount of money from my funding. And I can go on about orgs. I'm planning to do another nonprofits post at some point later in the year in advance of Giving Tuesday and all of that to give my updated views as of then.

Nathan Labenz: (2:47:25) Yeah. Nice. My written output is obviously not 2% of yours, but I'm planning at least a Twitter thread on that topic as well, so we can compare notes as we get into the long tail.

Zvi Mowshowitz: (2:47:37) Yeah. It's a lot of work, and I'm gonna have to set aside specific time for it. But part of it is I don't know when they're gonna announce the results, and I don't want to—I wanna finalize that. I wanna see who gets how much money, and I wanna see who is then on record as having received money and therefore publicly part of the round versus who is not, because anyone who doesn't receive money is not publicly part of the round unless they say they are. And therefore, I have to email each of those people and ask, do you want to be in this post? And then at least I give them the opportunity to email me and say no. And then I treat no answer as a yes because in general, charities want people to say you should give money to them. But occasionally, someone doesn't want that.

Nathan Labenz: (2:48:20) Usually safe assumption. Did you have any—another category that I thought was interesting was sort of international relations. I know you're not very bullish on US-China cooperation, not that I'm arguing you should be, but there was a sort of crop of organizations trying to work on that, and I was pretty into that.

Zvi Mowshowitz: (2:48:47) So one of the issues with the round is there are some organizations I knew were doing good work. They had clear wins in their column, or I knew the people involved and the things they've done, or otherwise I could be confident in them. And then it was very hard to give similar level ratings to organizations where I didn't have that. So it's a question of, okay, you've got—I won't name them because they haven't been officially funded yet. But yeah, there were some charities that were trying to create track two talks or otherwise advance US-China relations in various ways related to AI, and I gave them some support. I definitely said this is fundable. But I was hesitant largely because it's very hard. Policy is notoriously difficult to read. If one of these charities was effectively fake in the sense that what they were doing was having no real effect and did not, in fact, impact the possibility of good things happening, would I know? Would they know? They might be bad at it and not even realize what they're doing. They might be doing this thing thinking they're accomplishing something and just not be accomplishing something. And the policy can look for 10 years like you're doing nothing, and then suddenly something happens. Or it can look like nothing will happen, and then nothing happens, but that was actually the right thing to do. You just create the possibility of something happening if things had gone a different way, or you stop the bad thing from happening without even realizing, or whatever. So it's high leverage, but you just don't know. And because of that, I find it difficult to get behind any of these organizations on a high level. Some of them will definitely make the post as something that I think would be reasonable to support. But it's really tough when you've got money that could definitely go to places that would be well spent to put it into a weird, impossible to read place where you don't get good feedback. Where you can't tell, that's all the more reason to ask, how do they know how to do things properly? How do they make good choices even if they're properly motivated? They can't tell either.

Nathan Labenz: (2:51:01) Another category—this is sort of a meta category—but the California bill SB 13, I'm sure you've engaged with a little bit, that would create the private regulatory market where either the attorney general or—

Zvi Mowshowitz: (2:51:16) Okay.

Nathan Labenz: (2:51:16) —some new commission would credential private orgs to be regulators, and then there would be some sort of trade where if an AI developer opts in to the regulation from one of these private developers, then in exchange for that, they would get some sort of liability protection. And that obviously begs the question, who steps up to be these private sector regulators in the event that this bill were to become law? So one of the things I was looking out for was who do I see that kind of feels like they could become that if that opportunity were to arise? I don't know if you've thought about organizations in those terms previously, but—

Zvi Mowshowitz: (2:52:00) Yeah. There are a number of nonprofits that plausibly could step up and become players in this space. There are a number of founders who are perfectly capable of creating new organizations that could possibly do that in this space. If there's demand, there'll be supply. Because we have—it's not that hard to find expertise in this space that would be happy to participate in these organizations. Yeah. I just don't wanna name specific names, and I don't wanna fall into the trap of, oh yeah, this must be regulatory capture for these five people or five orgs or whatever it is. But I don't think that's the case. I don't think there's gonna be any problem with that, and I think that the most likely scenario is that companies like OpenAI quietly ask people who they expect to do good jobs to spin up organizations they can then work with if they're not happy with the slate of options that they are presented with initially from the natural process. So certainly, for example, METR or Apollo, just the people who are already being contracted for evals by the big tech companies would presumably be the first ones entering this space, and they would presumably be very credible in that capacity.

Nathan Labenz: (2:53:25) Yeah. Last category I'll put in front of you is hardware governance. There weren't too many organizations that were specifically working on this, but the read in my group was everybody seemed to have a different reason to like hardware governance. What's your thought on hardware governance?

Zvi Mowshowitz: (2:53:45) So unfortunately, right now, at the moment, it's politically really tough for hardware governance. I wouldn't necessarily ever say dead because things change so quickly, but it's not looking good because the focus is entirely on get people to use our hardware, and the last thing people using our hardware wanna do is be tracked. So there's a direct opposition to exactly what the prioritization is, so it's not gonna happen right away. I still think that it's one of these things where it's vital that we have that ability. I think it's really important that we have the technology completely set already, that if we decide, okay, as of 3 months from now, every new chip that gets shipped has to be tracked, we can do that. And ideally, also, if we have to go into a data center and put trackers on these chips such a way that if they're tampered with, we will know, we know how to do that as well. And potentially, a very small amount of money can give us that option. And then we only have to spend the real money if we implement it. But I think it gives us optionality. It allows us to implement these solutions. I think the first best solution to the current mess is in fact to use this solution to allow us to do things like build data centers in the UAE and India in ways that feel secure and to stop chips falling into the wrong hands. And potentially, even to be more aggressive about what you let people you actively are worried about do. Yeah. I don't think we need that much effort. I think we just need a little effort to lay that foundation, and then the question is just figuring out who the real deal is.

Nathan Labenz: (2:55:28) How about just on the simple question of balance of money available and opportunities? I think, for me, the sense was I wish I had more money to give out. Obviously, to some degree, that would always be the case. But if you were to make a pitch to other philanthropists that there is a lot of stuff that is not funded as much as it should be that would be high impact—I guess, first of all, do you believe that? And second, what would that pitch sound like?

Zvi Mowshowitz: (2:56:01) So that pitch would sound something like: if I wanted to hand out the entire, let's say, $10 million that is roughly projected to be the entire round, it doesn't allow me to give out all the money that I would have been happy to hand out, not by a long shot. I could easily hand out more than double that and feel good about every dollar that I'm giving out there. Obviously, I would wanna do more investigation of some of the things that would then—because there's things where I'm like, I'm never giving money there. I've done the math. So I'm not gonna think too hard about exactly how to rate that. But you could give out tens of millions of dollars to these applications with basically zero waste. And that is the obvious direct evidence that there is ample philanthropic space. But also, there's a ton of stuff that is at the scale above SFF. We basically say, we can't fund these charities anymore, because their capacity to use money effectively has gone to millions of dollars a year. And we just don't have that capacity, or even in some cases, approaching $10 million a year or more. There are also a ton of philanthropic projects that were never even proposed because the price tag on them would be absurd. And also, everybody in this space is, of course, conserving money, which is not particularly great. We'd prefer to have generous salaries and generous compute budgets and not worry about this stuff, but that would take a lot more money. But even not changing that, even keeping everybody lean and not trying for extra massive experiments or anything like that—just doing what we're already doing—yeah, I feel confident about a lot more money being useful in this space than has already been deployed in this space. It's just not available right now. So it'd be great if you could help out. There are lots of places to put it. And again, there are a lot of people who just aren't asking for money because they know that there's so many other people asking for money. I'm in that situation. I'm being supported by patrons who are happy to support my work, but I'm not going to go out there and seek out more funding because I already know there's way more demand for this funding than there is supply. It doesn't feel reasonable for me to ask for that much more, but could I scale up at least somewhat from more funding? Very obviously, yes. There you go. I have lots of research projects also which, again, are trying to be lean. So we're all trying to be lean here.

Nathan Labenz: (2:59:04) Okay. Here's one idea I wanna get your take on. And this was not in the application pool. But I just saw this article, I'm sure you saw it the other day as well, about how OpenAI is starting to subpoena some of these charities, some of which were, in fact, in the application pool, that are doing various things that they find to be kind of inconvenient, like hassling them about their nonprofit to for-profit conversion. And they seem to believe, and the reason that they're giving for why they're issuing these subpoenas is that they believe that they may be funded by competitors. In other words, they think perhaps Elon or, maybe Google is funding these organizations to try to slow OpenAI down. This got me thinking, well, jeez. Maybe that could happen at the sort of model evals level. We've got these model evals companies, organizations, nonprofits that are, I think, very focused on being evenhanded, very fair, very analytical, and trying to do stuff pre-release, which I definitely think has a lot of value to it, but that forces them to play very nice with the companies that they're working with. And I wonder if somebody came forward and said, I'm gonna just, potentially transparently or not transparently, but we're gonna go after all the companies except Google and try to demonstrate to the public why their models are problematic, should not be trusted, all the ways that they go wrong. But we're not targeting Google perhaps because we're funded by whoever. Could you engineer a situation where all the companies then feel like, geez, we better go target our competitors' models. We better really invest in demonstrating what is wrong with our competitors' products. And if you could create that equilibrium where they're all kind of sniping at each other all the time, would that, in fact, bring a lot of things to light and potentially create the race to the top that everyone wants because, obviously, there are a lot of problems that can be demonstrated. It seems to me like the only reason that isn't happening is maybe a sort of soft collusion or an unspoken gentleman's agreement, I guess, is another way to say soft collusion. But if somebody were to break that, maybe it all kind of goes to a different equilibrium where everybody is investing in that, and we have a lot more energy going in that direction. What do you think?

Zvi Mowshowitz: (3:01:39) It's not a great look from their perspective to be funding attacks on these other companies. And when you expose these things in other companies, you are also exposing yourself. Almost always when you find these flaws, they're everywhere in some form or something close to them is available in some form, and it will seep into public consciousness. Who leads the call for greater regulation? It looks like it could lead to any number of escalations. You don't need collusion for, oh yeah, my gang has guns. Your gang has guns. Why don't we just stay away from each other's territory and not shoot at each other because that could get into something pretty ugly pretty fast? And so yeah. Most of the time, Coke and Pepsi don't start smear campaigns on each other. They just do positive advertising. Maybe they take a few cool little side shots of, you aren't that cool. But they don't fund nutrition studies about why the other one is unhealthy because it doesn't work out. But certainly, you could fund people to go after specific organizations in these ways or do investigations and deep dives on their particular models. You can have opinions on who to target first, and I don't think it's that crazy. Some companies are being less responsible than others and deserve to get hit more. If you target xAI, they might just say thank you.

Nathan Labenz: (3:03:04) Yeah. That's why when I was saying target everybody but Google, I was thinking the rationale there is just that there's a lot of Google billionaires that could plausibly be funding such a thing, and the reason they might be doing it could be a mix of competitive advantage and/or philanthropic desire just to bring issues to light. And—

Zvi Mowshowitz: (3:03:26) Yeah. And the problem is that specifically not targeting Google does raise questions. If you're a Google-funded organization that only targets OpenAI, let's say, just to keep it simple, then why do I think your evaluations are objective? Why do I think that when you say something is a problem, that it's a real problem or that it's a specific, particular problem or anything like that?

Nathan Labenz: (3:03:54) I think the idea would be that it's just reproducible. If you just have inputs and outputs from models and you just demonstrate this is an attack, that's pretty—

Zvi Mowshowitz: (3:04:04) Yeah. I didn't mean is the finding real. But we're trusting that you followed scientific procedures in selecting this example, that it's representative, that it teaches us what you think it teaches us, etcetera.

Nathan Labenz: (3:04:22) Yeah. I see that challenge. Although, I also think that's all pretty slippery. I mean, as much as there is, I think, a very sincere desire among the evals groups today to have these high standards of rigor, all their stuff is always questioned. And you've got a lot of people who are like, oh, well, this is totally nothing because you put the model in a situation.

Zvi Mowshowitz: (3:04:47) It is the job of Caesar's wife to be above reproach and be reproached anyway. It is the job of those who are trying to be the watchdogs in these situations, the people who are trying to be in our position. We have to follow standards of rigor and integrity that are vastly above what others are held to. And that's table stakes. That's just the right to play. And it's not fair, and that's just life. And you're still gonna have all that questioning of the text. Yeah. Absolutely. We all saw the debate over SB 1047. We all saw how you bent over backwards to be ten times better on all of these issues and all of these questions than the people you were opposed to, and it didn't—you had to be able to play in the arena, but you have to just—if you're the underdog, if you're the scrappy, underfunded person who's trying to bring the truth to light, that's your job. And the big corporation is gonna try and squash you with anything they've got. You gotta be sparkling clean. You gotta have no vulnerabilities, no points of leverage, no smears—it's just how it is, and it sucks. But we're used to it. Alright. Nathan Labenz: (3:06:24) Does that mean you basically don't... The equilibrium that I'm trying to envision a way to shift from and to is one where today—and there was this mutual adversarial collab. I don't know if it was really adversarial, but OpenAI and Anthropic evaluated each other's models seemingly in a pretty friendly, collegial way. That was great. I think everybody loved to see that, but that hasn't happened much. Maybe it'll happen more again in the future. Maybe it'll never happen again. I was trying to engineer a transition to a different equilibrium where everybody is adversarially evaluating everybody else, thinking maybe if I tip one domino, everybody else will feel that they have to respond. And then from a general safetyist worldview, it would seem better if they were all adversarially evaluating one another versus not. So I guess you could question that first assumption—that would be a better equilibrium—and then secondly, can we tactically get there?

Zvi Mowshowitz: (3:07:28) I would love to get to a point where the companies were doing each other's evals in an adversarial fashion, looking for trouble, looking for vulnerabilities, looking to embarrass them. And they just had to see if they can deal with it, see if they can beat it, see if they can overcome that. That sounds great. I would love that. I don't know how you get there from here. I think that these companies do not want to go to war with each other. I don't think we want them to go to war with each other in other ways. Also, people are constantly—I mean, you've already got them going to war over talent, so I don't know. But I can see it being a thing Anthropic might want to do in the medium term: be like, "We run all of our evaluations against everybody's AIs, and we report them back." And occasionally, they're going to find some stuff. But they did—

Nathan Labenz: (3:08:25) They did do that with DeepSeek. They did come out and say DeepSeek has no qualms about doing bioweapon-type stuff.

Zvi Mowshowitz: (3:08:30) Yeah. Their evaluation of the DeepSeek safety protocols was: "Safety protocols?" Right. Yeah.

Nathan Labenz: (3:08:38) Yeah. Okay. Well, if any Google alums with the resources want to talk about seeding such a thing, my DMs are open. Always a final question for you in particular: What is virtuous to do now?

Zvi Mowshowitz: (3:08:56) Yeah. So I think it's a weird situation where it can be really tough to figure out where to make the most meaningful progress, what to do going forward. On policy, the short-term priority has to presumably be preventing America from being so foolish as to sell the 30 days to China, which in practice presumably means getting enough people on the right sufficiently alerted—this is actually happening, here's what this actually means—that they raise enough of a stink that it doesn't actually happen. And otherwise, draw attention to the extent to which NVIDIA seems to have taken hold of the White House in terms of its rhetoric and its plans overall. Not to—I mean, this is obviously the ultimate end goal or the primary reason for that. But you see it in everything. Obviously, as usual, trying to spread the better worldview is always good. I've certainly gotten to the point where I think that working for Anthropic seems to clearly be a good idea at this point if you are considering what to do and the alternative is doing basically nothing. I do think there are a number of orgs which are presumably better choices for impact than just working at Anthropic, but that doesn't mean that they have capacity or that you want to work there, obviously.

Zvi Mowshowitz: (3:10:41) But yeah, it's a difficult situation because the policy situation is in a bad state. Alignment is in a not great state. There's obviously infinite things to work on, infinite things to experiment with, infinite orgs to get money to, and so on if you want to do that. But my product is basically trying to keep myself and others informed and understanding of the situation in hopes that that will lead to good things more than anything else. I wish I had a better answer for a call to action. You listen to this podcast for 3 hours—depending on what speed you're listening at—what are you going to go forth and do? But unfortunately, I don't really have one. Other than think hard about the world and try to figure out what under your model would be the right things to do to advance it, and what are the things that would actively make things worse. Because for a lot of people who are informed, the first step is just be aware of what would make things worse and don't do that. But also, more than anything else, you should say what you think. You shouldn't sugarcoat. You shouldn't engage in hyperbole. You shouldn't strategically censor yourself. There are going to be exceptions, but you should just say what you actually believe about the situation. And one thing you can do is support the book release. Eliezer Yudkowsky and Nate Soares are coming out with a book in about a week: "If Anyone Builds It, Everyone Dies" or "The End of Superintelligence." And if you were to help by purchasing and spreading the word about that book, this is a unique opportunity to create momentum, maybe create a cultural moment. But again, that doesn't mean that you should just back their talking points and their idea about how this works just because they're the ones with the book. Or just because Eliezer is the rightful caliph and he said so or anything like that—you should pick up your own models.

Nathan Labenz: (3:12:48) I've preordered my copy and look forward to reading it. As always, really appreciate all your time. You've been very generous with it, and we'll continue to stay informed via the blog. Don't worry about the vase. I think that's it for today. Zvi Mowshowitz, thank you again for being part of the Cognitive Revolution.

Zvi Mowshowitz: (3:13:06) Thank you for having me. Alright. Bye.

Nathan Labenz: (3:13:10) If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts where experts talk technology, business, economics, geopolitics, culture, and more, which is now a part of a16z. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And finally, I encourage you to take a moment to check out our new and improved show notes, which were created automatically by Notion's AI Meeting Notes. AI Meeting Notes captures every detail and breaks down complex concepts so no idea gets lost. And because AI Meeting Notes lives right in Notion, everything you capture—whether that's meetings, podcasts, interviews, or conversations—lives exactly where you plan, build, and get things done. No switching, no slowdown. Check out Notion's AI Meeting Notes if you want perfect notes that write themselves. And head to the link in our show notes to try Notion's AI Meeting Notes free for 30 days.

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

Zvi Mowshowitz on Longer Timelines, RL-induced Doom, and Why China is Refusing H20s

Watch Episode Here

Read Episode Description

Full Transcript

Read next

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

Zvi Mowshowitz on Longer Timelines, RL-induced Doom, and Why China is Refusing H20s

Watch Episode Here

Read Episode Description

Full Transcript

Read next

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

Mathematical Superintelligence: Harmonic's Vlad & Tudor on IMO Gold & Theories of Everything