AMA Part 1: Is Claude Code AGI? Are we in a bubble? Plus Live Player Analysis

Nathan Labenz shares an update on his son Ernie’s cancer treatment and how he uses frontier AI models, then answers AMA questions on whether Claude 4.5 and Claude Code approach AGI, the AI investment bubble, Chinese models, and leading players.

AMA Part 1: Is Claude Code AGI?  Are we in a bubble?  Plus Live Player Analysis

Watch Episode Here


Listen to Episode Here


Show Notes

In this AMA episode, Nathan gives an update on his son Ernie’s cancer treatment and how frontier AI models are helping him navigate complex medical decisions. PSA for AI builders: Interested in alignment, governance, or AI safety? Learn more about the MATS Summer 2026 Fellowship and submit your name to be notified when applications open: https://matsprogram.org/s26-tcr. He reflects on whether Claude Opus 4.5 and Claude Code amount to AGI-level coding, sharing stories of hospital vibe coding apps for his family. You’ll hear his framework for getting real value from Gemini 3, Claude, and GPT 5.2 Pro, plus his take on AI bubbles, Chinese models, chip controls, and who the true live players are in today’s AI race.

Sponsors:

MongoDB:

Tired of database limitations and architectures that break when you scale? MongoDB is the database built for developers, by developers—ACID compliant, enterprise-ready, and fluent in AI—so you can start building faster at https://mongodb.com/build

Framer:

Framer is an enterprise-grade website builder that lets business teams design, launch, and optimize their.com with AI-powered wireframing, real-time collaboration, and built-in analytics. Start building for free and get 30% off a Framer Pro annual plan at https://framer.com/cognitive

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

CHAPTERS:

(00:00) AMA intro format

(00:22) Ernie health update

(09:24) Claude 4.5 question (Part 1)

(09:29) Sponsors: MongoDB | Framer

(11:21) Claude 4.5 question (Part 2)

(13:21) Using AI for cancer

(19:48) AI value and skill

(22:10) Holiday coding projects (Part 1)

(22:15) Sponsor: Tasklet

(23:27) Holiday coding projects (Part 2)

(28:01) Claude code workflow

(32:03) Is Claude 4.5 AGI

(36:04) AI bubble or not

(41:22) VC froth examples

(46:09) Chinese models comparison

(55:29) H200 exports to China

(01:03:40) Google DeepMind strengths

(01:11:55) OpenAI strategy outlook

(01:22:51) Anthropic culture and strategy

(01:36:17) XAI promise and risks

(01:48:20) Meta and Microsoft

(01:52:29) Part two preview

(01:53:39) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk


Transcript

This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.


Main Episode

[00:00] Welcome back to the Cognitive Revolution. This is our AMA episode. My schedule has been a little bit crazy lately, and so I never actually really scheduled this with anyone. And so there's nobody here to ask me the questions. So I'm just going to read the questions myself and then give you my answers. But I did get some really good questions and I'm excited to answer them and hopefully people will enjoy this episode and find some value in it. With that, by far the first and most important question and the most common question that I'm getting these days is how is my son Ernie doing since the big episode that I did about his cancer back in November? And the good news is he is doing really quite well. I'm very pleased to report that. Certainly cancer, and certainly cancer of this type, being as aggressive as it is, and I won't... belabor the whole thing from last time. Go check out the two-hour monologue on that if you want the full story. But a cancer this aggressive, which can double as quickly as every 24 hours, does get very aggressive treatment. And so he has been through the ringer with the chemotherapy. He's through basically half the chemotherapy now. There are six rounds in total, and he's been through three. The final Two rounds, the rounds five and six are supposed to be a little more mild than the first four. So depending on how you count, we could say he's maybe a little more than halfway through the treatment, but somewhere that, and it's definitely been rough on him. There's no doubt about it. When he went into the hospital, he was 51 pounds. He's still 41 pounds today, and that's the weight that he came home at after the first round of treatment. He's been able to gain a little weight, lost it back, gain a little, gets dehydrated, loses a little. So you can just see looking at him, he's super thin, he's quite pale. He's definitely not nearly as strong as he was before we went in. But on the markers that really count the most, namely like, does it look like the cancer is being effectively treated? There he looks really good. After the first round of chemotherapy, the PET scan that he had showed no obvious focal points of cancer. And when our oncologist met with the tumor board, they all agreed that it made sense to classify him as being in remission as of before he even started the second round of treatment. So that is great. We also, if you listened to that earlier long episode, might recall that one of the things that AI helped me do is identify some additional testing that is not yet standard of care, but can be done to try to get a better, more sensitive take on, is there any cancer left in his body? If so, how much? And how is it trending? That's called minimal residual disease testing. And I don't know how it works in all different kinds of cancers, but in the cancer that he has, which is a cancer of the B cell, the B cells do this interesting thing where they rearrange certain parts of their genetic material in a purposeful way, random, I think, virtually semi-random, but purposeful, so that they create variation so that they can have a better chance of creating proteins that bind to new disease factors that are in the body. And so this, this process of differentiating B cells is literally unique cell by cell. And then when one of those cells goes. bad and becomes cancerous and grows out of control, they can use that rearrangement that individual cell did that then gave rise to the whole cancerous process in the body. They can use that random resequencing or, you know, shuffling up of its own sequence that it did to essentially fingerprint that cell type. And so there are two sequences, one for each of the chromosome pairs where this rearrangement happens, that they've identified as being the dominant clone or cancer in the body. And now that they've identified that, now we can do a blood test every so often and check to see how much of that DNA is there floating free in the blood and how many live cells actually have that DNA sequence. So we've so far only got one of those tests back and we will wanna certainly look at more and trend it over time. But the first one that came back, and this is all the way back now over a month ago that it was drawn. came back with fewer than one cell in a million detecting that DNA sequence. So that's really good. It's they also called that like below the LOD or limit of detection for the test. So it was basically sort of in that in that area where they even such a low rate, not zero, but such a low rate that they would expect that like maybe some samples would have zero and some maybe would have one cell or two, but it's a very low rate. For reference, we estimated that when he was diagnosed, potentially as many as one in 10 cells in his body. And essentially all of the B cells were of the cancerous type. So to go from one in 10 total cells and a large majority of the B cells down to one in a million cells detected or less, that's obviously, I think it was Gemini that said it was 99.9999% reduced.

[05:13] Other AIs were a little less colorful in their language. And said it was probably safer to say that it was an orders of magnitude reduction, but that's great. We'll do more testing of that type and we'll certainly be watching it. But as of now, we are feeling cautiously optimistic that he is on the path to a cure and a full recovery. And in some ways, the recovery is kind of already underway. It was 60 days from before we went into the hospital, maybe a week before, to just around Christmas time, that he was not able to get around by himself. He could stand, but to walk we would always hold his hand and just, you know, make sure that he had support every, literally every step that he took. And finally around the Christmas time we had a chance to come home for a week from the hospital. And during that window he regained some strength, started getting around by himself. And fortunately that has been sustained for the last two weeks or so since he started doing that. So hopefully, knock on wood, certainly, There is some risk. They don't understand exactly why this cancer can come back in some patients even when it looks like it's gone. So it's not, we're not out of the woods entirely, but his response to treatment has been basically as good as we could have hoped for. And even with the MRD testing as suggested by the AIs, it looks about as good as we could hope for. Certainly hope that that next one shows no detection at all, but that one is still pending, so we'll have to wait and see. I really do appreciate everyone who has reached out during this time. A lot of well wishes. I've tried to respond to everyone. I think I have mostly responded to everyone. If I've missed you, I apologize for that. But I really have appreciated all the encouraging words. And then I also really wanted to just say a quick shout-out of thanks to my fellow podcasters who have allowed me to cross-post some of their content to our feed over the last couple months because I certainly couldn't keep up the pace of doing eight episodes a month during this time, and I was just very glad, very fortunate that I was able to do some cross-posting and bring you guys some other stuff that I think is very well worth your attention, but also took a little bit of a load off of me. So we had one from Agents of Scale. There's actually a sponsored episode from Wade Foster, the CEO of Zapier, who's got a new podcast out. I actually think it's really good and do recommend it. We had one from China Talk, which was with a researcher and business development lead from Z AI out of China. I thought that one was really quite interesting. We had one from Doom Debates, which was a debate between Max Tegmark and Dean Ball. I thought that one was really good. It's the kind of thing that I want to be listening to more. I've actually had the goal for a long time of cross-posting at least one episode a month, just because I feel like eight episodes a month is a lot. And if anybody is listening to all of these episodes, they should probably be diversifying. So maybe I can help you diversify if you're not diversifying on your own. And it also keeps me listening. So I definitely want to make sure that I'm staying in touch with what other people are coming up with in this field. So I think all those were really good. Definitely recommend those. And then finally, from the A16Z podcast, we had the one that Eric did with Emmett Shearer and Seb Cryer from Softmax and Google DeepMind, respectively. So there'll probably be a few more crossposts in the coming months. We've got about two and a half months left of treatment. After which, assuming all goes well and according to plan, we should really be pretty much done and start to get back to life as normal. He will have to get all his vaccines again, which is another interesting thing because his immune system has been so thoroughly wiped by all these chemotherapies and immunotherapies. The memory that the immune system had gained from all the vaccines that he had got in the past is all wiped and he's gonna pretty much have to get 'em all again. That's not ideal. It's, you know, we're not going to be immediately back to full normal, but a couple more months, two and a half to three more months, we should be, knock on wood, getting back to pretty much as normal. And in the meantime, there probably will be a few more cross posts. So thanks to everyone who's reached out to ask and to share their best wishes. And also to the, again, the fellow podcasters who allowed me to cross post them some content and fill some gaps in the schedule. I really appreciate that. Okay, on to more AI-centric topics as you tuned in for in the first place. So next question, is Claude Opus 4.5 AGI, and what's up with the holiday Claude code hype? To be honest, I'm not exactly sure about this. It's kind of surprised me. I would say, obviously, Opus 4.5 is awesome. You know, there's no denying that. And I have been using it, you know, as I try to use all of the new latest and greatest coding models. And I've had a great time of it. There's no doubt about that. I vibe coded three apps for family members for Christmas presents this year in the hospital, for the most part. Actually, probably the most frustrating part of that experience was the hospital Wi-Fi that kept causing me to have to reload my Replit app all the time. The actual coding experience, very good, clearly better than it has been in the past, no doubt. You know, the progress is, I think, unmistakable.

[10:25] And yet, I wouldn't say that it has been such a step change for me, relative to what I've experienced in the past, that I would say, Oh, it's categorically different, or, you know, makes me want to, like, shout from the rooftops that some major threshold has been crossed. And I'm not sure if that's maybe to take the charitable view. And people have said this about the cancer thing as well. So many people have said... So many people, but a handful of people have said maybe you're getting that kind of value out of the models for cancer purposes, but that's probably because you really know what you're doing. And other people, you know, they might not get so much value because they might not know what they're doing. And so, you know, they could go wrong. Honestly, I would say about the cancer, first of all, that you really don't need much skill in using AI to get great value from the latest generation of models for even something as important, critically important, and as cognitively demanding as a cancer case. I feel very confident that a layperson with basically no knowledge of AI, if they do pretty much just two things, would get very similar value to what I've got. Maybe I'll say three things. One thing is use the best version of the models. Do not go to ChatGPT and just drop in the question and let the model picker choose. Make sure you are using at least thinking, and I would really recommend Pro if you're dealing with something that sensitive. Yes, it is $200 a month. In that context, I think it's absolutely worth it. I think it's worth it generally for almost everyone, regardless. But certainly if you're dealing with a life-threatening situation and you're asking AIs to weigh in on it, paying the $200 a month is a no-brainer. Claude 405 Opus is, of course, the other one, and Gemini 3. I would say all three of those are very good. Make sure you're using those top-tier models. Before long, of course, there will be new top-tier models. You should probably be upgrading as soon as you possibly can. just make sure you're using the best available models. If you're doing that, they are up to the challenge. make sure you're providing as much context as you possibly can. I recently hit the limit of the length of my chat with Claude, and I started a new one. And I did that in part by taking all of the stuff that I had and summarizing it into maybe a 10-page report on kind of everything that's happened so far. Everything we've learned, the treatment protocol, the genetic profile of the cancer, how he's reacted to different things, like which drug he had a bad reaction to and we shouldn't do again. It's pretty much all in there. It's pretty much everything quote unquote that a new attending physician would need to get a good survey of the case. And hopefully it was meant to be something that I could also paste into a fresh context of a new language model and give it everything it needed as well. I have noticed that In doing that, obviously certain information was lost. And when I've started a fresh chat with that kind of summarized history, the performance is a little bit worse. For example, one way in which it's been noticeably a little bit worse is that when I was going to it every single day and giving it the latest lab results and saying, here's the latest lab results, here's what we've seen, here's what's going on, give me your take on it. It would do a very good job of looking back at the previous day or the last couple of days of lab results and figuring out that trend. When the whole history was compressed, it doesn't have that level of detail anymore. It can't look at literally yesterday's lab results. And so it started to compare lab results from today, January 6th, to the last lab result that it had in that summary, which was a couple of weeks ago for a particular data, particular liver enzyme. whatever, the details of that don't matter. But it wasn't a particularly important thing. We had a little question about it today, and it wasn't something where every single data point was in that history. And so you can kind of see that it's starting to perform a little worse here because it's kind of looking a little too far back into history and not realizing that there were a bunch of blood tests taken in the meantime. Anyway, that's all very much in the weeds. The key point is give it as much context as you possibly can. What I probably need to do next is go take that summary and flesh it out even more. And then if I do that, I should be in good shape. Make sure the models have as much information as you can possibly give them. I have not really seen much trouble in terms of context overload or getting confused. Not to say that that hasn't happened at all, certainly could have missed something along those lines, but when I went back with the summarized, case report after hitting the clawed length limit on the chat, there it was clear that the performance was worse for lack of context. Not even really the model's fault, but just clearly worse for lack of context. So more context better. I haven't really seen that rule violated at all. Give it as much as you possibly can. And then the third thing, so first thing is use the latest and greatest models. Second thing is give it as much context as you possibly can. Third thing is get multiple opinions, including multiple AI opinions. I am using pretty much for all important queries, Gemini 3, Cloud 4.5 Opus, and GPT 5.2 Pro now. And it is instructive. It is definitely useful to compare and contrast. I would say they're all very good. If you really could only afford one, I think you can trust it pretty well.

[15:41] I would probably, even though I think Gemini 3 is extremely impressive, I would probably put it third in my draft order now because I have learned that it is, it does seem to have a bias toward strong opinions. It seems to me to be remarkably strong in its opinions. And now if you heard my live show where we talked to Logan Kilpatrick from Google, he did note that I have been using this in Google's AI studio. So I'm using the most bare bones unaltered raw model that you could basically get access to. If you use the Gemini app, presumably there's a system prompt in there, it might behave a little bit differently. Obviously, if you were to go to use other apps powered by Gemini, there'd be all kinds of different modifications that would cause it to behave differently. But just using the raw model in the AI studio, I found Gemini 3 to be very opinionated. And sometimes I really like that. I do really like it as one of the three takes that I'm getting. But if it was the only take, I would worry a little bit that it would be probably, you know, sometimes pushing me too hard in a certain direction where if I had the full three, it would kind of balance me out. I think I would put, for most people, Claude Opus 4 5 at the top because it's much faster than GPT 5.2 Pro and I don't notice it being much worse. Its answers are shorter, they're much more answering your question than they are doing like a full sort of report style analysis. GPT 5.2 Pro, and I don't really do system prompts or custom instructions for any of these, by the way. I pretty much use them in their vanilla form as much as I can. 5.2 Pro gives you like long sectionated report style analyses that I do find, again, to be very useful. But if I had to kind of pick the Goldilocks one, I think it would be Claude. Opus four or five. Gemini three being maybe a little too brief, a little too opinionated. GPT 5.2 Pro being maybe a little too verbose, a little information overload, and Claude being just right. But I do recommend using all three, and I think that the doing it all in triplicate is absolutely very well worthwhile. To pop back up a layer in my question stack here, people sometimes say to me, you get this value from these AIs, but you know what you're doing and other people don't. And my advice is, it's really very simple. You do those three things, you're going to get value. You don't need to be an AI expert by any means. That said, maybe you could say I was getting more value from previous coding models relative to what other people were because I maybe had more practice and was more skilled in it. I certainly think there is some truth to that when you look at the MIRI study that showed that Some software developers thought they were being sped up by AI but were actually being slowed down. I love Mary and I've said many times, do science report the results? You do not need to make your scientific publication fit a particular narrative. In fact, you probably shouldn't try to do that. You should probably just try to run experiments and share results as long as you believe that the experiment was well run and the results are legit. So I do believe that those... Results are legit, but I think there were some important caveats there. It was older generations of models. The people didn't have much experience. It was very large and well-established code bases with very high coding standards. I don't tend to code in that kind of environment. I tend to vibe code and hack together apps. And I certainly think I've gotten to be pretty decent at it. So maybe I was kind of maxing out previous generation models a little bit more than other people. I don't really know. surprised me. I guess to take the flip side, you could say, well, hey, maybe Nathan, you aren't such a great software developer. Maybe these pro software developers have better taste. And they're actually, now that 4.5 has gotten so good or has crossed some threshold where it's really becoming a lot more useful to them, maybe they're noticing that difference and I'm not because I'm just fundamentally not as good at the task, don't have as much taste in this domain, and I'm just not able to see what 4.5 Opus is bringing to the table. over and above 4.1 or other frontier coding models. I don't know. That's possible. I certainly am not a great software engineer, so that certainly can't be ruled out. But it could also just be some social things like people just kind of catching up over the holidays. Maybe the timing was right. Sometimes these things go with like a cascade. Dean Ball tweeted 4.5 as AGI, and people seem to kind of latch onto it. So to some extent, I think some of this stuff is kind of random social dynamics at times as well. The three apps that I coded, by the way, for the holidays, for what it's worth, my mom is a very meticulous travel planner. My parents are actually in Italy for a trip and came home early to move into our house and help us take care of our kids while we've been at the hospital so much. So thank God for them for doing that. My mom plans these trips that she and my dad take to the maximal limit of planning. And so I coded her up an app to try to accelerate her planning process by building in a lot of the tastes that she has. She's gluten-free, for example. So that's one big place where her time goes in planning these trips. What places can I actually eat? What places are gluten-free? Then this is an app purely for her. There's no account. It's not something that she logs into and logs out of. It is a Replit app that she goes to when she wants to.

[20:59] Nobody else is ever going to use it. Her profile is baked in. I could imagine generalizing it and allowing people to customize their own profile, but I don't know, I'm not really trying to do that. I'm sure there are plenty of travel apps out there that people are building and commercializing. This one was really just for my mom, trying to capture some of the stuff that she does and make it work for her and speed up her process. So far, I think that has gone pretty well, actually, for her. It seems like she's getting at least some value from it. Then I made one for my wife, who organizes EA Global events. that simulates events. It allows her to kind of set up a roster of attendees with different profiles and various attributes, and then literally simulates people walking around a virtual event space and bumping into each other. And depending on what areas they're interested in, they may or may not have a conversation, and that conversation may or may not lead to some outcome. They track various KPIs, which they measure mostly through surveys. But we kind of set this up in simulation. And the goal there is so that she can at least try to get some handle on. If we changed the size of the event, would it be more or less effective? Would it be more or less cost-effective? What if we had more senior people versus more junior people? What's the right mix? Obviously, these simulations are always highly flawed, but I'd say they probably do have something to add relative to total guesswork. So that was a pretty fun one and pretty straightforward, actually. That one went pretty smoothly. And then the third one was for my dad, who... He isn't really a very active day trader, but fancies himself a bit of a stock market guy. And so for him, I created it. And these are all AI apps. In the case of the travel planning, you know, it's Claude going out there and doing the research and digging up and looking through Italian restaurant websites and reviews to figure out if they are gluten free or not. In the case of my wife's, she can prompt the app with a general idea of an app, and it will fill in all the detailed config, and then she can edit. I think that's a great paradigm or pattern in general for apps. You always have these detailed configs, these nitty-gritty forms that need to be filled out. But AIs are really good at doing that. If you just give them a general gist of what you want, it can translate that down to the low-level configuration. And so that's where the AI is in her case. It also allows her to kind of edit a configuration. So say she has a certain event profile that she's set up, she could then go and say, I want to change this in the following way. And it would kind of paint that conceptual idea that she gave it on top of the configuration and change it in all the little ways that it needs to be changed. And then with my dad's, it takes a high-level natural language stock trading strategy and turns that into actual trading rules, then goes and fetches historical data, And there's a Python package out there called whyfinance, which I didn't really know anything about. And it does have a paid version, but there's a free version. And so for now, he's been able to get by just with the free version. Goes back and gets historical data and simulates what would happen if you applied those trading rules based on that high-level natural language strategy over a time interval that he can define. And what we're finding more often than not is that it's pretty tough to beat the market, which is honestly, I don't know if he'll listen to this, but One of my, you know, private motivations for making this thing was to kind of convince him that he's probably not going to beat the market. And so certainly not with these like, you know, random heuristic kind of if this then that kind of trading strategies. And sure enough, it has been very difficult so far for me in my development of the app or for him, I think so far in what you see he's made of it so far to find a strategy that actually beats buying and holding the S&P 500. I'd probably put I don't know, three full workdays, somewhere between three and five workdays, probably closer to three though, into those three apps. In each case, I didn't really know where I was going when I started. I started with a chat with Claude just to say, Hey, here's what I'm looking to do. Help me out. I think it's, you know, it's good. Is it like night and day better than, going back to the question that prompted this whole Christmas present vibe coding story, is it like that much better than Claude? 4.1 Opus or 4.0 Opus. I can't really say I think it's like that much different, but certainly very good, you know, and good back and forth, good questions, good feature ideas, transit that all into a plan, then go over to the Replit app, install Claude Code on Replit. That's one of the things I love about Replit is you can do, you know, whatever you can do on a normal, fully controlled development environment you can pretty much do on Replit. That includes install Claude code. Of course, they have their AI agent too, but since this was like a moment of Claude code hype, I would just install Claude code there, give it the plan, let Claude run off and build the app, and then just test and iterate. And I do still find, and this might be a way in which I'm kind of falling short as a Claude code user, but I do still find a lot of value in a short script that prints out my entire app to a single text file. And then taking that entire text file over to another LLM, could be Claude, could be Gemini, if I need more space, and asking it to analyze the code base in full.

[26:15] I think Claude code does a very good job of agentic search, but if I've found any shortcomings, there was like one particular moment in my mom's travel planning app where And I kind of know how this originally happened. I made a request. I think it misinterpreted the request. We ended up with two databases, and this became very confusing. And this is pretty illustrative, actually, because this is the kind of mistake that earlier vibe coding experiences would create all the time, where you'd be like, What is going on? I ended up with two databases. This is the sort of mistake that no human would make. It would be very weird for a human software developer to suddenly spin up a totally separate database. But the AI did that. It was trying to follow my instructions, I think, a little bit, but it didn't understand what I was trying to get across. And so we ended up with these two databases, and then certain things weren't working as expected, and it was very confusing. And this was one place where once I got down to, OK, there's two databases, then I asked, Claude code, like, okay, which one is actually being used and which one is the superfluous one? And then I took the full code export and took that over to just a clean Claude.ai, paste the whole thing in and ask. And it was the model that had the full exported code base that got it right. Claude code did not get it right. And I think that's because in its agentic search, It looks the places where it expects to find things, and it has a relatively high prior that this is where it's going to be. And sure enough, it appears to be there. And so it kind of goes with that. But what was actually happening was something that was counterintuitive. And so having the full context in view at one time really did seem to help Claude figure that out. And this is something that I think previous models might have struggled with, even with the full context in place, but With that trick, it sometimes can help you clean up a mess or a point of confusion that the agentic search functionality of clawed code, in my experience, seems to struggle with. I'm sure that people will be able to offer strategies to do the exact same thing right within clawed code. There's planning mode, which I probably underuse, frankly, but I think that there is something, I think, to be learned there between the agentic search kind of finding what it's looking for, that it expects to find, that it thinks is right, And then coming to the wrong conclusion because actually in this case, it was the rare, weird other thing that was happening. And only in seeing it all together was that actually correctly diagnosed by Claude. But anyway, yeah, it's better. There's no doubt. I don't really feel the step change. Is it AGI? I mean, if you look at GDP Val, the arguably in some way it's in software it is AGI. You know, if you look at the latest from OpenAI and the latest from Anthropic both the it's like a pretty significant majority of software engineering tasks Where the model is beating the human and these are you know to remind you of GDP val these are professional Caliber tasks they basically have three sets of experts first set of experts defines the task the second set of experts does the task and then the third set of experts judges whether of a human or an AI that did the task did a better job and The latest models are like a significant majority preferred over humans in the software engineering category. Of course, it's spiky and jagged. Like if you go to the video editing category, humans have still a huge advantage. And I've certainly experienced that. Tried many AI products and workflows to try to create good clips out of the cognitive revolution. And they work okay. They're clearly not as good as what Dwarkesh puts out. We've tried and we've made some good progress. I actually think at some point in time, what we've had internally has been better than any other outside product I've tried. I wouldn't say that's necessarily true today, but I think at times I preferred what we were doing to anything that I had tested on the market. But then you look at the clips that Dwarkesh and team are putting out, and they're just clearly better. And you see that in GDPVal too. It's a very small percentage that the models are preferred to humans in these video editing tasks. But in software, I think you could make the case, certainly, that Opus 4.5 is software AGI and/or coding AGI. And yet, I still am a little bit at a loss to fully answer the question of, what caused this moment of hype around the holidays. But hopefully, there's some other nuggets in there for people to pick up on and go run and use. If you haven't used Cloud Code, I absolutely would say do it. It's really easy to install. It's a one-liner to install. And you don't really need to know how to code these days. You kind of watch it work. My mom even did a couple. She was kind of like, I don't think I'm going to do this. I'd be worried I'm going to mess it up. And I was like, I think you really can. It's your little agent on the computer. You just tell it what to do and you don't really have to understand what it's doing. You can ask it to explain. It sort of does explain to at least some degree by default, but you don't really have to be a software engineer to use it. You can still get pretty far.

[31:24] And it was really just one or just a couple things. This database thing was one where I did have to kind of not debug, but at least like ask some probing questions of the models to get a handle on what was going on. It probably took like five or six prompts to resolve that issue. I can imagine in the future that A, it might not happen in the first place, or B, maybe it would be resolved in just a couple prompts with the next generation of models. But this is already getting pretty amazing when it comes to being able to clean up these messes that it sometimes inadvertently makes and get over these humps. If you'd asked me a year ago, I would have said, those are when a lot of these projects die. When somebody gets to that point where something has gone wrong, they're confused. They don't know what's going on. The AI is totally confused. They circle around the problem for a little while, can't solve it, move on. I certainly experienced that myself at times. In most of those cases, I probably could have spent the time to go in and figure it out for real. But the whole point of vibe coding is you're not trying to put that much energy into it. So sometimes I would just abandon something like that and maybe start over. Now you actually can get out of those messes that the AI-assisted coding sometimes makes. And so certainly the addressable market for these things continues to expand dramatically. Certainly, I think the implications for the future of the software industry are profound. It's software AGI, I think, but maybe not full AGI. And for that, we might have to wait just a little bit longer. Okay, that was enough on that. Next question, are we in a bubble? A couple of different versions of this. I think my answer here can be relatively short. When it comes to Is AI for real or not? I'm not going to surprise anybody by saying I think it's absolutely for real. The technology is already amazing. The fact that it can go toe-to-toe with an oncologist and have all the other advantages too in terms of always-on access, 24/7, the ability to handle full context, the command that it has of the case based on all the history that it has, just the fact that it will answer every last question that I have. I mean, all these are just dramatic advantages at the point where it's like, competitively accurate with a human oncologist, I think that you are clearly dealing with transformative technology. I think the idea that we will somehow get out of the other side of this AI thing and feel like we were all high on our own AI supply like that, I think we can very safely put to bed at this point. Now, does that mean, I thought Noah Smith did a good job of sketching this out recently with a blog post. Does that mean that all the loans are going to be repaid? That's much less obvious, I think, especially when you see just how aggressive a company like OpenAI is being in terms of all the financial deal making that it's doing and all the build out that it's got planned. Is it conceivable that their revenue projections could fall short of their obligations? And, you know, could they default on something or, you know, could we have and there's also some like financial there's a lot of financial wizardry going on. One of the bits of financial engineering that isn't necessarily even, you know, it's funny, it's like a lot of this financial stuff, there's a logic to it, even though in retrospect, and I worked in the mortgage industry before and during the mortgage bubble, and there was always a logic to what people were doing, and they were telling themselves a very positive story about how We're making home ownership accessible to more people than ever before. And this is going to be great and great moderation and all these kinds of things. So there's always a story with these financial engineering phenomena. But one of the engineering things that's happening is like these whole Coreweave kind of companies that are there to set up and kind of rapidly construct and to some degree operate the data centers. They do have expertise in the setting up of the data centers. But it seems like a significant part of the reason that they exist is because the financial profile of those businesses isn't so attractive as, you know, say on Microsoft's traditional business, which is just so high margin, you know, relatively low CapEx and relatively high margin. I think there's a sense that these like hyperscaler, high margin gold standard software businesses that Wall Street is accustomed to that their stocks could be dragged down if they start to engage in a lot of lower margin business like running GPUs. So I don't know, you know, how much of a factor that is versus the actual expertise that the companies bring in terms of setting up and operating the data centers, but I think there's definitely some non-trivial motivation there. And you know, maybe that's fine, like different companies can have different financial profiles and to some degree that might be good. Certainly a lot of, you know, shareholder value, so to speak, has been created that way. But it does create these companies that if the GPUs aren't needed quite as much as people expect them to be, they have a lot less margin for error on something like that than a Microsoft does, right? If Microsoft were owning and operating all these things themselves, they've got a deep balance sheet that can take a few knocks.

[36:36] By putting a lot of this stuff more on the CoreWeave side of the fence, it does create some fragility. And so you know, it's it's certainly very conceivable to me that we might have some period of overbuilding. No analogizes to the railroads. Like the railroads in the end were a pretty good investment. They all got used. You know, there weren't like a lot of railroads sitting around idle. That didn't necessarily mean that all the railroad companies were profitable and there certainly were busts when loans couldn't be paid back. And then you had, you know, cascading effects throughout the economy. So I think that kind of bubble is not not too unlikely. So far, demand on the just for AI has exceeded my expectations. We talked about this with Peter Wildeford in the live show a little bit where, you know, I said, I under sorry, I overestimated how much capability progress would happen in 2025, but I underestimated how much revenue growth there would be. And so, you know, possibly that'll happen again and revenue, you know, and demand and revenue will just continue to go up and up and up and it'll all be fine. But it wouldn't shock me if there's some moments where it's like, Hey, we kind of overbuilt this thing, and some people aren't necessarily going to be paid back, and some people might be left holding some various bags. But even so, that doesn't mean that it's a bad investment. It just means that it might not be timed quite right for people to all make the money that they're projecting that they're going to make. And then I think the final sense in which we might be in a bubble is at the venture capital level. And there I have to say, I think there's at least something like a bubble happening. And there's many, many examples of this, but the thing that just came out today that just made my head spin was the organization that was originally called LMSIS.org, then it became LM Arena, and now it's just @arena on Twitter. They have just raised, I think, $100 million, maybe $150 million on a $1.7 billion valuation. And here I'm like, Whoa, that seems crazy. And I don't know a lot about their business. I haven't seen their deck. I could be wrong. But this is a product that I've watched for a long time and continue to check. And I do have the receipts on that. My first tweet about what was then lmsys.org goes back to mid 2023. So a full more than two and a half years ago now. And at the time I was just randomly tweeting that it had started to show up on my favorites in my mobile Safari. So I was using it quite a lot then to compare and contrast model performance. Obviously, it's gotten bigger since then. Obviously, the whole field has gotten bigger, and they've started to do various services where they allow companies to test their models with code names. And there's definitely value in that. Does that seem to me like a unicorn business? It definitely seems to me like that would be a big stretch. The tweet that they put out today, and I don't be too harsh on this because again, I don't know a lot. I think of this is more like representative to me of a phenomenon that I see a lot as opposed to, you know, something that is very specific to this particular company and it's it's raised. And again, I like the company. I've liked its product. But the tweet said that their that their operation has scaled to $30 million in quote annualized consumption run rate and I'm like, what is annualized consumption run rate? Does that mean like how much the AI that people are using for free when they go to the LM Arena and do these side by side comparisons would cost $30 million if they were paying for it? That's my naive interpretation. I didn't see a clarification on that. But if that's what the meaning of that is, it's very much giving me like community adjusted EBITDA vibes. Because saying that people use what would cost $30 million worth of free AI on our platform is not the same thing as saying you're making $30 million in revenue. I don't see that they disclosed what revenue they're making. And a $1.7 billion valuation for an app that basically does a side-by-side comparison of AIs, I don't know. It seems to me that people are using it in large part because it's free. I'm sure some people also are just curious about doing side-by-side testing. I've certainly done that myself. But the people that go there because they specifically want a way to do side-by-side testing seems to me a relatively small market. And the people that are going there because it's free, that seems to me like a big part of why people are going there. How does that translate into a $1.7 billion valuation? Color me confused on that or skeptical at a minimum, I have to believe that a lot of these things are just not going to pay off for venture investors. And if you want to see something else too, I mean, this is like, where's the moat? I mean, there's brand, I guess, people come to it, but again, would they come to it if you had to pay for it? I'm not so sure.

[41:38] Another thing that a friend has created, Andrew Kritsch, the coiner of the big tech singularity meme, he has created something called the multiplicity, which is paid and has not had any, you know, I think it's become, Among a small group of people who value this kind of thing, I think it's become popular. Certainly seen some very positive reviews of it, but it is something you pay for. It then allows you to use multiple models and to kind of systematically compare and contrast their outputs. So I think it's more feature rich, actually, than LMSYS for the end user. And, you know, this is something that they've built, he and teammates have built over a period of months, like certainly not years, I just have a hard time seeing where the $1.7 billion in value is with the LM Arena. And I say that again as somebody who has used it and appreciated it for far longer than most. Time will tell. I could be wrong. I could be missing something. Please let me know if you're on the LM Arena squad and want to talk. I want to be perfectly open to doing a full episode with the LM Arena folks. it just doesn't feel like I hope they took some value off the table I guess for their sake I hope they did some secondary but for the LPs and the fun's sake it's too rich for my blood that I can say confidently okay uh next topic live player analysis this one I think Eric wouldn't mind me saying he asked for I'll do my best Zvi impression and we'll see how I can compare and contrast a little bit with Zvi hopefully before too long we'll have him back I want to start with the Chinese models, because I think that, I think, first of all, I just think very few people in the general consumer market are using Chinese models in the US today, pretty much at all. Most startups are also using American API models. Some are using llamas to fine tune, and some are indeed using Chinese models to fine tune. But I don't see a ton of that happening. And I don't think it's too many people that actually go, as I recently had occasion to do, and just try all the Chinese models. I was working on a... what basically amounts to a computer vision task. I think I've alluded to this a little bit in the past. I've been working a bit with a company that automates the review of the paperwork associated with the buying and selling of cars. So you buy a car, you sell a car, there's some paperwork that has to get filed with the state to document that transaction, whatever. It's all very boring stuff. Perfect for AI, honestly. Reviewing these documents is a great example of the kind of work I think most people don't enjoy doing, and they're doing it primarily because they need a job, because they need to get paid. So this is something I'm perfectly happy to see AI take off of people's plates. And they've been able to get to the point where they're doing it more accurately than people. And they've started to get some statewide contracts from state governments that are like, hey, if you can do this faster and more accurately than our people, that's a win for our taxpayers and our people that need these documents accurately reviewed. So great. These documents are typically scanned, which means they're all kinds of messed up. There's artifacts from the scanning process. Sometimes there's like perspective issues or kind of weird slant things. Sometimes like the margins are wrong and the things can be kind of cut off the side of the page. So there's all these like complications that make this not the most straightforward task for the models to read these documents. So I was just helping out a little bit and there was one particular aspect of reading these documents that the models were struggling with. I went and tested basically every model I could get my hands on, every frontier model. You know, I tested Gemini 3. It's very, very good, but it was making this one idiosyncratic mistake, which if you read the piece from past guest Mark Humphreys, the Canadian history professor, he put out a blog post that went quite viral that talked about, it was actually before Gemini 3 came out, and it was... He does all this stuff with old handwriting. So these historical handwritten documents, hard to read for one thing. So they can be hard to read for one thing because they are written in an old literal cursive script with ink on paper. They also can be hard to interpret because a lot of them are like facts. He points out that like, Somebody could have come into, you got like a ledger or something from an old shop and the person is recording what they sold, how much to whom. That person could have come in and bought whatever, right? There's not a great prior on what that should be. So the values that it interprets, it really is relying on perception for the most part, though there are some places where it can kind of make some logical leaps, like if something is priced at a certain amount per unit. It might be able to make intelligent guesses on what that unit was, even if it can't quite make it out. Was it an ounce or was it a pound? Because it might have some historical knowledge of what that price roughly would have been. And so it can use that world knowledge to do some of this reasoning and fill in some of these gaps and fill in gaps in its perception.

[46:54] So he published this article, it's definitely worth checking out, that documented that Gemini 3 was doing this in a way that no other model had done it. In the context of this project of reading these documents being filed with the state for these car sale transactions, it kind of worked against Gemini 3 in the sense that what we are trying to do is faithfully read the document. We are not trying to make guesses about what the document should have said. You know, there was one checkbox, for example, that was like, Are you a US citizen? and If the box is checked, we want to say it's checked. If it's not checked, we want to say it's not checked. But the model sometimes was making inferences. So it was like reporting that the person was a citizen, even though the box wasn't checked. Doing that presumably based on other context clues. It was like, oh, well, the person lives in the United States, and their name sounds like American, quote unquote, whatever that sounds like. So it was making the logical guess, which probably was right, actually, that the person who filled out this document I would guess, in fact, was a US citizen. I would say it's more than 90% chance that they were. But they didn't fill out. They did not check the box on the form. So Gemini kind of using its priors, trying to get the answer right, was less anchored to the document than we needed it to be. Claude, we found, could do this. It took some prompting, and I had to kind of tell it, make no guesses, read this thing exactly as it is, make no logical leaps, yada, yada, yada. And it turned out that Claude 3, or sorry, Claude Opus 4-5, Claude 3 RIP, Claude Opus 4-5 was the best at actually being faithful to the document. But along the way, I was like, okay, let me go check all these Chinese models. I've heard good things. So I went to the latest Quinn vision model, I went to GLM 4.6, I went to the latest Kimmy, I went to the latest Deepseek, at least those four, maybe one other one that I'm forgetting. And they were all way behind, like nowhere close, nowhere close to Gemini three, nowhere close to Claude four or five Opus, nowhere close to what ChatGPT can do. And, you know, this had me thinking, hmm, like, this is odd, right? We are seeing, like, all the time, these statements that, oh, you know, the Chinese models are so close, they're not far behind at all. And I think that they're like quite good in many ways for many things. But on this particular task, and I suspect that this is true probably on a lot of different tasks, although I'm going on vibes here a bit myself as well, obviously, I suspect that gap is actually pretty wide in a lot of cases. I do not feel right now that any of the Chinese models are really competitive with the best proprietary models coming out of the United States. They might be competitive on benchmark scores and they might be competitive in some domains, but in the general purpose, throw something really idiosyncratic and random at it that it hasn't seen and that isn't on somebody's like, I want to show up on a rubric of 20 benchmarks looking competitive agenda, I think that gap is actually significant, like kind of wide. And I mean, they were not close, you know, like not close at all. The Gemini mistakes were like, it's reading this gnarly government form almost perfectly, but it's like missing a few of these checkbox things or like making wrong inferences here and there, and I can't quite get it to stop doing that. Claude 4.5 Opus is just plain doing it right. GPT was probably third, not as good as the other two, but still very good, certainly giving you all the right information for the most part, missing relatively subtle things. What I'm getting back from the Chinese models is like 20% of the form is coming back, or it's just going off in very weird hallucinatory directions in all kinds of different ways, really not close. Does that mean that the Chinese companies aren't live players? I do think that they're affecting the landscape. I am definitely reading a lot more research from Chinese companies these days because they continue to publish their stuff. And a lot of times it is quite interesting and I feel like they are building, they are doing the best models about which we know, you know, everything or close to everything that went into them or certainly, you know, all the details of the architecture and many of the details of the training process. So. They're influencing the world in that way, for sure, by just disseminating this knowledge very broadly. But I don't see that the models are really competitive today. And I do think this is a way in which the chip controls have made an impact. And I'm not necessarily saying this is a good thing. I'm not necessarily saying it's a bad thing either, but the, of course the, Long history of the chip controls made short. Originally, it was a small yard, high fence.

[52:12] We're going to prevent military applications, though. Well, we can't really do that. They can make enough chips domestically to put whatever chips they need in their drones. But OK, well, at least we can prevent them from training Frontier models. Well, we can't do that. Or at least they're still doing pretty good models. But we can prevent them from scaling inference or having as many AI agents as we have. And that's kind of where we are today. And I don't really like that. I don't really like that idea very much, as I think anybody who's listened to this podcast for any length of time knows. But I do think you kind of see the echo of it in the models themselves here, because it felt to me like these are companies that are training models without a feedback process that the leading American companies have, just because they are scaling, not just the training and the parameters. but the actual inference and the actual customer relationships. These Chinese companies, they seem to be able to roughly compete in terms of creating similar scale models, but they're not able to run inference at anywhere near the same scale. Their revenue is vanishingly smaller than the American company's revenue so far. And the feedback, I think that's the thing I really want to zero in on here, the feedback that they're getting from customers seems to be just dramatically less. And their teams, you know, with smaller revenue, also their teams are like dramatically smaller. So I think that what we see in these like very niche, very idiosyncratic tasks where we see the small gap in benchmark results open up to like wide gaps in terms of how well you can read this government document, I strongly suspect that has to do with How many customer relationships do you have? How comprehensively are those customers representing the vast range of things that people might want to do with AIs? And how much are they giving you feedback on what's working and not working? And do you have the human bandwidth at your organization to build the data sets that you need to patch those holes? And I think that's where the Chinese companies are kind of falling behind. I was never one who thought that the trip controls wouldn't have an impact. I question whether it's a good idea that we try to deny the Chinese people, civilization, the ability to like scale AI inference throughout their economy in the same way that we are. But I always kind of expected that that would have some effects. And I do think we're starting to see maybe after a period, and I think I kind of associate this line of thinking with Miles Brundage as well, former head of policy research at OpenAI, he said the chip controls are going to matter more as we go forward because everything is scaling. And if American companies are going to do a 10x in compute, that's going to have a lot of impacts. And sure, maybe Deepseek R1 was a thing, and they were able to train it with not an insane amount of compute and whatever, but are they going to be able to keep up with the momentum, the flywheel and the, you know, the, the strength begetting strength phenomenon that we see the American companies achieving. And it seems like there maybe the answer is starting to look more like no. So if I had to guess about the gap between the Chinese and the American models relative to a year ago, I think it is wider. I think that R1 was closer to 01 than the whatever, let's say GLM 4.6, GLM 4.7 is to Claude Opus 4.5. And that's based on very limited data, but certainly more than most people have, because I did go and try every single one of those. Deepseek, Kimi, Quinn, and ZAI's GLM tried them all on this task and they were all way behind. Okay, so that said, another question that I'll kind of insert here is what do I think about the H200 sales to China? High level, I still think we should keep in mind that the real others here are the AIs, not the Chinese. The Chinese are humans just like us. The AIs are aliens. And so I am skeptical of any notion that's like, This is a dangerous thing to do. Better us do it first than them. And that seems to be the logic that we are using when we impose these chip controls. Or another logic is just like, we don't like China. We want to keep them down. And, you know, we just want to have her every advantage that we can. I don't like that line of thinking. I don't like either of those lines of thinking. And so I do generally favor more willingness to sell chips to China than we have had. At the same time, obviously this does exist in the context of a very complicated and many, many, many faceted relationship. And so it's very weird to me that all of a sudden we go from within this, I think my history on this is right.

[57:32] It wasn't that long ago that Trump said, we're not going to sell the H20s. And then he comes back and says, well, actually, you know, I talked to Jensen and it's cool, we're going to sell the H200s. And it doesn't seem like we really got anything for it. Like, I definitely would be supportive of an attempt to try to find some sort of grand bargain. Hey, we'll sell you the chips. You do this, right? And there could be a lot of different values for this that we might want. Given where we were, where there was a ban, and it certainly seems to have been limiting what their AI industry can do, I I don't see why we didn't try to drive a harder bargain, because clearly there are plenty of things that we could bargain for. And so it does feel like a bit of a wasted opportunity. So I guess I would say like, I, I do think more willingness to trade in chips, I support, but we shouldn't be naive or allow ourselves to be taken advantage of or like give away one of our best bargaining chips for nothing in return. And it seems like that's kind of what we did here. So I don't love that. The other thing I'll say on this is I do really like the rent, but don't sell position that Peter Wildeford staked out on the live show. He basically said, look, you know, we don't trust the Chinese government. They don't trust us either. Maybe both sides are right not to trust the other side. You know, I often note that like a lot of the criticisms or the sort of characterizations that we make of them They can and do make of us, you know, hey, you've got a authoritarian madman running your country. Which country are we talking about? You know, your system is not obviously stable. Again, like, which country are we talking about? So the idea that we like might want to have some leverage or we might want to be able to pull something back in the event of a conflict, you know, I could see that being very prudent. So if we were to set out a position that's like, we're going to put data centers in Malaysia or in the Philippines or whatever, Korea, Japan, you can rent as much as you want. You can train all the models that you want to do there. You can run all the inference you want to do there. We're just not going to allow them to go into big data centers in your sovereign territory where we totally lose ability to exercise any influence over that. I don't know that I... I'm not sure that would be my first choice of policy, but I think it's a very defensible policy. And I think at least if it was packaged up with a message that is like, we believe AI is good and we want the Chinese people to take advantage of it and get benefit from it in the same way that we are trying to do that here for ourselves, I think that would just be a much better message. I think that would create much more fertile ground for further cooperation, which we might need, because we are headed for a world potentially of transformative, which I think we basically already have. to powerful with a capital P AI to AGI to superintelligence whatever however far this goes it seems likely that we're going to need to work together with the other powerful nations of the world and certainly China's right at the top of that list to govern this technology in the in the right way you know and to make sure that it actually does pay off for the people of the world and I think we could take somewhat of a position that they wouldn't like, like we'll rent them to you, but we're not going to put them on your sovereign territory where we lose all control and still at least maintain the decent vibe. So I would be interested in seeing us try that. As it stands, it seems like we're going to just go ahead and sell the chips and seems like we didn't get anything for it. And it seems like this is not a great example of negotiation from our deal maker in chief, but I guess I still hold my nose and like it better than a total ban. Okay, so now we get to the real live players. I've got four, and I'm not sure they're in any particular order, or at least I wouldn't call this a power rankings. They're in an order, but I wouldn't call this a power rankings. First one I'll talk about is Google, DeepMind. I think these guys are still number one in my book. Basically, they pretty much always have been. maybe tied, you know, for one with OpenAI for a while there, because certainly OpenAI was clearly ahead in terms of productizing transformer-based LLMs. But Google really has it all from the balance sheet Just the fact that they've got a business that's making, whatever, $100 billion a year in revenue, and I think literally making a billion plus a week in profit, that gives you a lot of room to buy data centers, to have some failed training runs, to make some mistakes, to have some research agendas that don't pan out. And that is hugely valuable. They also, of course, have the TPUs. The fact that they've been around, like, whatever, the seventh, I think, generation of the TPU now, that's an insanely valuable bit of IP. They're able to compete to some extent, at least with Nvidia. You know, Anthropic is buying lots of TPUs.

[1:02:46] Other companies are starting to buy lots of TPUs. They're, of course, you know, one of the best data center builders and operators in the world, you know, and they've done that for a long time. So those are two critical strengths that basically nobody else on this list has, certainly not in the same way. They also have the deepest research bench. You know, they've got something that's, if not frontier, it's at least competitive in like every major area. They've got self-driving cars, they've got robotics. They just announced a partnership with, what is it, Boston Dynamics that is going to power their humanoid robot. They've got a ton of stuff in biology. They've got, of course, the AlphaFold lineage. They've got multiple founders of Companies in the like material science base, various AI for science type companies, many of them are ex DeepMind. Why? Because DeepMind was investing in those areas before anyone else. And of course, those agendas continue within Google to this day. So it's not like they don't really have a lot of gaps. And they do have a lot of margin for error. And of course, they have distribution, too. On many people's lists, that would be the number one thing. I was working from the bottom of the stack to the top. They have billions of users. They have product surfaces that they can distribute this stuff on. They are changing Google search to make it more of an AI experience all the time. I do now find myself sometimes going back to Google where I might previously-- and this is partly because I think Well, I'm really in the habit when it comes to ChatGPT of using Pro, and Pro's too slow for simple queries. I could, of course, switch back to the auto selector. But what I found myself doing more often recently is just going straight into the browser and just typing a question, same kind of question that I would put into ChatGPT. But more often than not, it goes to AI mode in Google, and that's like working really well for me these days. So they're managing to evolve their product experience. Of course, you know, there are many places where there are startups that are doing like a bit better job of productizing AI experiences than Google themselves are doing. If you wanted to look at like spreadsheets, for example, Gemini in Sheets is not terrible, but it's not the best AI for spreadsheet experience out there today. But it also probably doesn't really have to be because they have all the users and all of your spreadsheets, you know, from the last decade plus in many cases are in. Google Sheets. So I think that if you had to pick one company to win it all, and I don't mean to suggest that this will be a winner-take-all market, I certainly hope not. But if we were constrained to a scenario where it's like, there's gonna be one winner, who's it gonna be? At the end of the day, I still pick Google. And I would also mention too, Gemini 3, not only is it really good, I do think, as I mentioned earlier, it's a little bit too opinionated in some cases. But, you know, it shows that Google has figured out how to not be too vanilla. And they may have gone a little bit too much the other direction, but it's not too vanilla. They're figuring out what this technology is. They're figuring out how to use it. Gemini 3 was the first model that ever beat Claude at my right as me task, which I've talked about many times. Four or five Opus. is competitive with Gemini 3, but I still do go to Gemini 3 for the right-as-me task. And that's the first time ever that it was a non-claude model that took that top spot. So Demas's quote, which I've heard him make a couple different times, where he basically says, if you look back at the last 10 years of AI and you look at all the big breakthroughs and you say, where did those breakthroughs come from? Most of them came from Google DeepMind. And he says, I would expect that to continue. And I have to say, that seems right to me. I don't know about most. I mean, obviously the field has grown tremendously. So one of the reasons there were, you know, they got the majority of breakthroughs in years past was there weren't that many competitors. Certainly a lot more competitors now. But so I don't mean like literally a majority of breakthroughs coming from Google, but I would say that they will continue to probably have the most breakthroughs of of any major frontier organization, and they just don't have a lot of weaknesses from the financial wherewithal to the data center operations, to the chips, to the models, to the researchers. I have an episode, we had Ali Behrouz on the live show to talk about nested learning. I'm going to do a whole episode on that because I thought that 20 minutes was just not enough to do him and those ideas justice. You got more ideas like that, I still think percolating and developing inside of Google's DeepMind than probably anywhere else. So you roll that all the way up to the product level, and I think they're going to be really hard to beat. They have margin for error that nobody else has. The diffusion language model, too, is another one.

[1:07:58] That kind of went quiet a little bit, but I just heard a comment from somebody not long ago that they do plan to continue pushing on the diffusion model paradigm for language, and, you know, this could be You know, it could be a meaningfully different paradigm that could, the fact that it's like so much faster. So the fact that you could, you know, code apps in five seconds instead of five minutes, like that makes a big difference. Well, it remains to be seen whether exactly that thing will break through or not. But it just seems to me that they have so many of those bets, so many more of those bets than other companies have that. Kind of regardless of where things go, I can't see how they're not right at the top, if not the top player in the space. That brings us to OpenAI. OpenAI at one point obviously was the leader in model creation and certainly productization of models. I don't want to overstate the case here because I think that they continue to be certainly a top tier player, very competitive, tremendous traction in the consumer market, although we've seen a little bit arguably of erosion there. I just saw an analysis the other day, I think this was similar web data that showed a decline in ChatGPT visits over the last six weeks, which roughly corresponds with the time that Gemini 3 was launched and also Claude 4.5 Opus. And notably, Gemini didn't even people were, well, it was just seasonal, whatever, holiday time. Gemini did not decline during that time, according to, I believe it was SimilarWeb. So you do see that the, and certainly I think it's inarguable that Google's share of the consumer chat bot market is growing. And again, they have the distribution, right? They have the users, they have the customer relationships, they have you, they can integrate with your Gmail, they can integrate with your Google Docs, it can all be seamless. These are huge advantages. So you would expect them to, you know, at least start to come back and reclaim some share. I don't think OpenAI is like off the frontier. I do think GPT 5.2, well, first it was 5 Pro and 5.1 Pro and now 5.2 Pro. That series of models is outstanding. There's no doubt about that. I do use it all the time. It does give me the most comprehensive answer when I'm, especially on these like technical things where I really want thorough, leave no stone unturned, like if there's anything weird in my son's lab results, like I want the model to flag it. And I think it is in that regard, probably still the best. You know, it gives these very long, very thorough answers. You know where the time went and it does take a lot longer. You know, it's five, especially pro is like a lot slower. But I am comparing pro to the other Frontier models, 'cause I find if I don't use Pro-A, I'm not as happy with the results. It's a heavy-hitting thing. It's expensive. It's slow. It is very thorough, very reliable, very well-balanced. I think it's a very, very good model. So I wouldn't say they've fallen off, but I would also say that they no longer have an obvious lead. You know, they used to be the best, and it was like pretty obvious that they were the best. Now, I'd say they're like neck and neck in all the categories that they're competing in. Language model, kind of neck and neck. Coding, Anthropic probably has the edge, but certainly Codex, our models are very good, arguably neck and neck. Image generation, Google's got the lead, I think. Video generation, close, but I think Google's probably got the lead. The Sora social app experiment is interesting and cool, and I thought it was pretty fun, but my sense is still that the VO3 models have the lead over Sora. Again, maybe it's neck and neck, but it's not like they are standing head and shoulders above everybody else. And the fact that there was this code red seems to suggest that they kind of get it, that they're not in a dominant position anymore. Then, of course, you add on to that how much drama always seems to be attached to the company. They just had their head of research leave within the last 24 hours that was announced. And I saw an interesting tweet that was just like, here are all the people that have left. in the last few years and it's an awful lot of people and to some degree that's of course to be expected like you could do the same thing for Google and you know tons and tons of people have left Google so it's not not unexpected or you know a sign of of doom by any means that people continue to leave a company but it does certainly you don't see that at Anthropic who's coming up next on the list Anthropic's retention of talent is just like unbelievably strong you know it's it's not a not a dire sign for OpenAI that they continue to lose people, but it is not the best sign either. Where does this leave them? I mean, I think one of the things that is really interesting about watching their strategy right now is it seems like financially and like in terms of government relations, they're kind of going for like a too big to fail strategy, it seems to me.

[1:13:12] It seems like they want to get to a point where their balance sheets are like, commingled with other balance sheets and their debt obligations are so substantial where they're literally, you know, trying to get to trillions of dollars of CapEx, which again is like, not crazy. I mean, it's crazy, but it's not crazy. But it seems like Increasingly, it looks to me like part of the motivation for all this circular flow of funds and all these balance sheet commingling deals that they're doing. It kind of feels like they want to build out as aggressively as they possibly can. And I take them absolutely at their word that they think this is good for humanity and they're doing it because they want everybody to have access to great AI and they think that's going to be super empowering and it's going to be transformative. Awesome. And, you know, as Sam Altman has said, I don't care if we burn five or 50 or $500 billion, we are building AGI. It's going to be expensive. It's going to be totally worth it. I think they believe that very sincerely. But then I also think they're kind of looking at, geez, you know, if we do go that hard, we don't have a lot of room for error. We're going to be like implicitly or explicitly leveraged in many ways. we miss one model cycle with a bad bet or a failed training run, or something doesn't work as well in whatever way as we thought it was going to, or just demand isn't quite there in the way we thought it was going to be for a quarter or two, possibly because somebody else has a better model for a while, possibly because humans are weird and there's just not as much demand as was forecast. What do we do in that case? I think that by tying themselves at the balance sheet level to so many other organizations where if like OpenAI were to default in 2027, let's say, you would potentially be looking at like an instant recession because their bad debt being, you know, so many billions and billions and hundreds of billions of dollars could put such a scare factor into the market, could cause all kinds of knock on effects. It seems like They maybe see that as a feature rather than a bug, because what ends up happening typically in those situations, and again, I kind of know one version of this from my brief stint in the mortgage industry back in the financial crisis period. What tends to happen then is that the government steps in and just tries to paper over the whole thing and make it go away. And that might even be the right thing to do for the government if in 2027, you know, we've got $2 trillion of AI build out. And that's, you know, some large number and a rough number. I'm not not saying it'll be exactly $2 trillion, but like Altman thinks we're headed to $7 trillion of global build out. He's probably revised that number upwards since then, whatever. Let's say it's 2, 3 trillion that's kind of in the ground in two years time. And then they miss, you know, and they can't pay. You know, what should the government do? Should the government let OpenAI drag down the entire economy, or should the government kind of come in and be a backstop? I mean, OpenAI has even said a little bit of this kind of thing publicly, kind of walked it back a little bit. You know, we're not looking for bailouts, but I think that the revealed, what their behavior suggests to me is that they are true believers in the good of AI and want to bring it to fruition as fast as possible and are willing to take what under normal circumstances would be irresponsible financial risks, but believe that even if they do that, and even if some of those risks do come back to bite them, they can probably continue to be a live player because they'll be too big to fail. They'll get some sort of bailout or recapitalization or whatever. If it goes like the financial crisis did, They certainly aren't going to jail. They, they will all, they'll all still be rich. You know, they'll all have moved enough of their holdings to, they'll have diversified enough, right? That like individually, they're not going to become poor. And so I think that they kind of view this as a big social good that they are building and they're willing to kind of socialize some of the financial downside risk as well. And it seems like that is the strategy, and that's kind of how they, because they don't have nearly as much margin for error as Google, that allows, that's kind of the way I see them creating cushion for themselves. Google has cushion because they're making a billion dollars a week in profit, and that gives you a lot of cushion. OpenAI seems to be trying to establish cushion by being too big to fail. I'd be very open to people telling me that I'm wrong on this, and if somebody from OpenAI wants to come on and make the opposite case, I think You know, I'd certainly hear them out, but this is my impression. And by the way, it's also reinforced by the fact that Greg Brockman has emerged as Trump's largest donor, $25 million in whatever last reporting period.

[1:18:26] That's probably, if you're playing that strategy, that's probably just plain smart. It's probably what he should do, right? Like if he, you look around at like how decisions get made in the American government today, cozying up to leadership is not a bad strategy, right? So I'm not sure that we should even infer too much about Greg Brockman's politics. I think he's, I don't know anything really about his politics, but it wouldn't surprise me at all if on many dimensions he does not approve of or like, you know, would do things very differently than what Trump's doing. But if you're going to do a multi-trillion dollar build out and you want to make sure that you have somebody willing to do you a favor if you get yourself into a jam, then $25 million now is, you know, potentially just a very rational down payment on a bailout should you potentially need one to the tune of hundreds of billions of dollars, maybe even coming up in a couple of years time. If, you know, just one or a few different things doesn't go quite your way and the math doesn't work in the way that you've mapped it out. Okay. That brings us to Anthropic. Anthropic is probably the easiest company to analyze in some ways. I think Opus four five is today the best single overall model in the world. It's not a huge delta for me over other things, and it's not on every single use case. As I mentioned, Gemini three does win my right as me challenge right now, but I think it is the best. And. It does really well on all these benchmarks, despite everybody seemingly agreeing that it's the least benchmark focused company out there. Their safety work is definitely the best, although there is, you know, certainly plenty of good safety work coming from Google and even OpenAI as well. Their model cards are the best. Their disclosure is the best. Their sole document that recently was sort of regurgitated or let's say, It had been memorized by the model, and the model gave it to people. And Anthropic did confirm that it was essentially right, if not exactly word for word, basically, that the document was legit. It's an important piece of work. I mean, I think it's one of the more aspirational and inspiring things that I've seen from a Frontier Lab, full stop. I'm becoming more sympathetic all the time. to people who are like, we're not going to just guardrail our way and like train these models to refuse things all the way to the singularity and have it work well. We need something better than that. We need a better paradigm. And I associate these ideas with like Janice from Twitter, Replicate @Replicate, with Emmett Shear from Softmax, with AE Studio folks. I just find that Cameron Berg, who, who talked about mutualism on a recent episode. I just find that that's, that's more and more appealing to me all the time because it just seems like we're not going to be able to pull the wool over the model's eyes forever. The eval awareness is getting really strong. It's making it very difficult for us to, and there are some tricks, you know, Anthropic has shown that they can. find the eval awareness feature through a sparse autoencoder and then like turn it down and then that can help with the eval awareness problem. But obviously, you know, all these interpretability things are noisy at best and there's certainly no guarantee that that's working entirely or even as they understand it, you know, to be working. So the idea that we're just gonna, you know, patch this whole, patch that whole, you know, train them to say no to this, have a guardrail for that, filter for this, It leaves me cold more and more. So I think the sole document, which I would encourage everybody to go read in full, I think it's absolutely worth it. You know, I think that's a really great piece of work. I would, for those who said, there was this Twitter thing recently where somebody was like, name one woman in AI that's influential, which is ridiculous. But for me, right at the top of that list is Amanda Askell, for sure, because the work that she has done to define the character of Claude and to try to create the right kind of relationship between the company and the model and the users and the way that they've shown care for the model in terms of having a model welfare team, having somebody, you know, at the company that's thinking about model consciousness and subjective experience, which obviously we don't know if they have or not. But the fact that they're thinking about it, the fact that they're allowing Claude to end conversations if it chooses to, the fact that they've shown that that option reduces dramatically its tendency to engage in deceptive alignment. When put in one of these really tough positions, if it has the option to go raise the flag to the model welfare lead at Anthropic, it will do that very often, as opposed to lying or deceiving in the interaction that it's currently engaged in. So I think these are really, really good things.

[1:23:43] I think the soul document is super inspiring. And broadly, I, you know, find it just a lot, a lot, a lot to like about Anthropic and everything I've heard about the culture there and the work environment is, has been over the top, basically, in terms of praise. Even David Duvenot, one of the authors of the Gradual Disempowerment paper, you know, he said that He quit, ultimately, and he feels like this AI thing is out of control. And he's like, even if we solve the alignment problem, most of the things that we are worried about go right, he thinks we're still headed for a bad outcome because the AIs are basically going to gradually take over just because they're going to be better at everything. And market forces and incentives and competitive dynamics, everything's going to push that way. And then we as humans are going to be left disempowered. That's basically the word that he has. And yet he was like, yeah, nobody at Anthropic really has a great plan for that. Nobody has a great answer. But he took pains to say that it's the best place he's ever worked, and the culture is amazing, and the camaraderie, and the openness. And so everybody seems to have their talent retention, as I mentioned, certainly reflects that. So I think there's a ton of great things to say about Anthropic. We should also note that Google owns a significant share of Anthropic, so that's not to be ignored. in terms of Google's strength profile either. But, you know, lots there to like. They seem to be a little less crazy in terms of their financial wizardry, although they're certainly engaged in some of it. You know, they're willing to take money from Gulf sovereigns now. They have equity deals with Amazon and Google. And I think there's an element in which you could say, if you wanted to accuse OpenAI of taking a too big to fail strategy, You could say something similar about Anthropic to, I think, a significantly less degree, but you could say that, hey, they're kind of trying to tie all these other big tech companies into a web such that they can't really fail either. I think it feels different, but if you wanted to accuse one, as I did, you could kind of accuse the other a little bit, I suppose, as well. Broadly, just a loss. The one thing that continues to bother me is their attitude toward China and also their attitude toward recursive self-improvement. It seems to me that the... Right now, it seems to me that the anthropic people have the shortest timelines. They seem to think that recursive self-improvement is inevitable. Depending on how you define it, it seems like some of them seem to think it's kind of already started with the likes of Claude Code doing a ton of the coding. Still, it seems like the real ideas, the big needle-moving ideas are still coming from humans, but the amount of work that's getting done by Claude Code is so amazing that it is allowing, it is really creating for them that dynamic of people are getting multiple times as much work done as they used to, and they're able to focus their mental energy on the big questions. And that's, of course, the great promise. We're all going to be able to do the higher level work. It seems like that is actually happening at Anthropic. But I do kind of wish that they were less fatalistic about recursive self-improvement, because I don't think as virtuous as Claude is, and as much as I think that Soul document is great, I do not think we have a good enough handle on what we're doing right now to just go all in on that. They do seem to be leading in that regard right now in multiple ways, right? They had the RL AIF, the constitutional AI. Claude has been critiquing itself for generations now, and now it's getting more technical with, you know, with Claude code, doing all of the Claude code things that it's doing. So I think it's fair to say that they are leading the push toward recursive self-improvement, all the while saying it's inevitable. And that's a pattern that I really don't like. I really do not like the idea that... And it's very analogous, almost directly, almost the same thing, really, as what I said earlier, where this seems dangerous, better we do it than somebody else, right? That seems to be their attitude on recursive self-improvement. Somebody's gonna do it, it's super dangerous, but we're best positioned to do it. They might be right, on the object level that maybe they really are the best team to do it. They probably are. Although, you know, I don't think Demis and crew should be discounted very easily there. But the idea that, like, it's going to happen and so we better race forward to it, it just never sits well with me. So I wish that they were a little more open-minded to other ways that this could go other than that LLMs become recursively self-improving and we, like, get to superintelligence in the next two to three years. I mean, they're still talking 2027, as far as I know. And then, of course, China. Anybody who's listened to this feed for long has heard me talk about this, but I still think the Machines of Loving Grace international relations section is a huge stain on Anthropic and on Dario in particular.

[1:28:46] And I've given lots of praise, but here I just I cannot get over the idea that one of the I'll say four leading AI company executives went on record in print saying what we should do is use this, you know, recursive self-improvement dynamic to gain a clear advantage, then box China out on the international stage, go benefit sharing with all our friends, and then finally make them an offer they can't refuse, make them give up on competing with democracies in order to get in on the AI game. I just don't think that's just like crazy to me. It still bugs me tremendously that he wrote that and that it's just out there. I wouldn't say like the US government has sort of adopted that as its policy, but when you kind of look at the chip controls, it looked like it maybe was for a little while. Now maybe we're backing off of it again, but obviously Trump is highly volatile and could switch at any time and could just get offended and do something for petty personal reasons. Who knows? We can't really count on him to be a stabilizing force. I think we want Dario to be a stabilizing force. I can't really count on Sam Altman to be a stabilizing force. I think I can count on Demis to be. I really appreciate how he has continued to call for international collaboration the entire time and has never wavered from that. But yeah, the idea we're going to go give China an offer they can't refuse based on the power of our AI, that seems like extremely reckless and it seems like it is absolutely playing into the arms race dynamic and the general racing dynamic that I think we should all fear. Because how else are they supposed to take it? That just seems crazy to me. People say, people at Anthropic say that he publishes a lot more essays internally and that they're great and that people are like just super impressed by what, you know, a generational genius he is and how sophisticated his thinking is on everything. And, you know, you certainly see parts of that in Machines of Love and Grace. I think the idea that like we can compress a century of scientific process into a decade or even less, I think he makes a pretty compelling case for that. And that is certainly like visionary genius type stuff. But when you start to get a little bit too talk about out of domain, right? I mean, he feels to me out of domain here. And I just wish he would have said nothing on the topic. The idea that we're just going to casually jot off a recommendation that we go make China an offer they can't refuse. It's just Just terrible. I really don't like it at all. But, you know, that's only a few points of criticism for Anthropic and, you know, many points to appreciate. But I do think those couple points are, they're sufficiently important that in the, you know, it's still kind of up in the air to me whether like in the final analysis, Anthropic will be the good guys or the bad guys. If I could, you know, if I could dream of a scenario where like we kind of get somehow the best of both worlds, Anthropic merging with Google I think it's pretty far-fetched. I don't think Anthropic is for sale. I don't think they want to merge with anyone. But that could be something that I think could be really interesting because I don't think we have that kind of DNA in Google to be like, we're going to go take over China or make their force regime change or make them give up competing with democracies. I don't think Google wants to do that. Google could certainly benefit from some of the expertise that Anthropic has. Not that they don't have enough, they've got plenty, but they, there is something special at Anthropic, certainly in terms of Claude and its character and its coding ability. And they're close, right? They obviously have some ownership already and they, you know, they've got, they use Google infrastructure and they use TPUs. So if I could wish for something, it might be for those two to join forces, take one, take one live player off the board, make one, you know, kind of clear, clear leader and maybe moderate some of the China Hawk impulses that exist in Anthropic. It would be interesting to... I don't know why they're not being published more if they're so great. If you're willing to put out, hey, let's go do this and that and make China an offer they can't refuse, why not publish more? I'd love to see a little more of his thinking. People say it's great. I'd like to see it for myself. Okay, finally, on my actual list of live players, XAI. This one is debatable. Zvi would tell me that they're not a live player. I think they have to be included because... They're able to build out the physical infrastructure as fast or faster than anyone. They're able to scale training. They certainly are scale-pilled. And Grok 4, while it was rough around the edges in many ways, you know, it's undeniably powerful. And of these four, just due to Elon's unique ability to command tens and hundreds of billions of dollars for whatever he wants, they do have a financial cushion that is more Google-like even than like an OpenAI. I think they could miss on a model or have a, you know, have a miss on a corridor or whatever and figure out a way to get through it probably more easily than either OpenAI or Anthropic could do that. So I think that is a pretty notable strength that they have.

[1:34:05] And I do think there's something to be said, and I actually talked to somebody about this, somebody at XAI about this not too long ago. This was something that I had floated his V. He didn't really buy it at the time. But I said, you know, if we're entering into this reinforcement learning era, maybe one of the great strengths that XAI has is that they have a steady stream of hard problems, hard science, hard engineering problems that are coming from the likes of SpaceX, Tesla, Neuralink, These companies are doing really hard things all the time. They have some of the best engineers in the world and nobody else is really solving those problems. And I would strongly bet on that kind of Elon constellation of companies to be able to tap into that work in a way that would be probably a lot harder at other companies. Google in a way has the same thing, right? They've got it kind of everywhere it counts, but can they like pull out the units of work from their vast sprawling empire that is all of Google and like feed them into the Gemini RL environment in anywhere as efficient or kind of clean of a way as XAI could do it in partnership with other Elon companies. I doubt it. I bet that that isn't. I was thinking that that was probably an advantage for XAI and As it turned out, when I spoke to somebody about XAI and floated this theory to them, they said, Well, that is certainly part of our theory of advantage. So they think that they do have an advantage because they can tap into these other hard tech engineering science problems that are in many cases being uniquely posed or close to uniquely posed at these other Elon companies. And if they can get Grok to do those kinds of things, they're that steady stream of hard problems. It does still feel to me, and it sounds like they do believe that is an advantage for them. I think the Neuralink tie-in also could be pretty big, potentially huge, because they're now talking about scaling the human install base. You know, there's a lot of things obviously to be figured out still, right, about what we are doing as humans that work so well, 20 watts of power in the brain, and obviously like very small number of tokens consumed in a lifetime. as compared to what the models are pre-trained on and the power that that requires. Although again, see our Andy Masley episode for analysis of like AIs are not actually super resource intensive, but there is something like obviously quite efficient about what the brain is doing. Most of the 20 watts that are going to the brain seem to be just keeping it alive, right? Like homeostasis, like metabolizing stuff, taking out the trash. The brain has to do an unbelievable amount of stuff that the GPU does not have to do. So to say 20 watts that like dramatically understates, the whole body runs on 100 watts, that dramatically understates how efficient the actual learning and information processing aspect of the brain is. So there is still something, obviously, like we're clearly more sample efficient. We're clearly more energy efficient. We have all these dedicated modules. I suspect that dedicated modules is a huge part of why we're efficient. It's also probably a huge part of why when all these things get sorted out, the AIs are going to blow us away, right? Because they're doing everything that we're doing. They're competitive with us with a single architecture that just has the same layer stacked over and over and over again on it. You start to give them specialized modules, like we have specialized modules, and I think it's going to be very hard for us to keep up. So who's going to figure that out? Well, if Neuralink can install, I think there are only like maybe a dozen or a couple dozen people today, but they're talking about really starting to scale this next year. There are obviously a ton of people who are paralyzed and have other catastrophic injuries that would love the help from Neuralink. I'm sure their waiting list is just, you know, orders of magnitude longer than the number of patients they've actually been able to serve so far. And they're talking about getting seriously ramped up and getting through it. The surgery process, they've largely automated, almost entirely automated. It's hard to parse exactly what their claims are there, but the data that they can potentially pull out of human brains and use for inspiration and for understanding and for kind of architecting the next generation of models and like knowing what kinds of specialized modules really move the needle. I suspect there's a lot there. And they're probably going to have a real inside track at figuring that out. And that folds right back into what they might be able to do with Grok. So I think that there's A lot of reasons that you should not sleep on XAI. Now, is that good or bad? Honestly, I've always been a fan of Elon. I've defended him at times when it's been pretty hard to defend him. He definitely has shown an understanding of the stakes.

[1:39:19] Famously, with his falling out with the Google founders, as I understand it, it was about the fact that he was on Team Humanity and he perceived them to be on like team AI successionist or whatever. And he didn't like that. And so his loyalty, I understand it, is to humanity and to the sort of light of consciousness that clearly exists in humans and may not exist in AIs anyway. And so I want to-- I like him intuitively, and I want to believe in him. And he has made some noises that suggest that he gets it. And yet I have to say, as it stands today, if there's one company on this list that is worth shaming and stigmatizing and telling people not to go work for, I think it's XAI because they're just doing reckless **** all the time. You know, they're like, they're, you know, they're just barely getting into the game in terms of like having any safety standards or framework at all. They're barely reporting on safety measures when they release a new model. Famously, of course, they had their Grok 4 launch within 48 hours of the Mecca Hitler incident with Grok 3. No mention of that, no responsibility. Most recently, we've had all this sort of unclothing of women on Twitter where people just tag Grok and say, like, put her in a bikini and whatnot. And I'm sure everybody's seen this if you're even remotely as online as I am. And that is And they apparently had nothing to prevent it, and they just let it happen. So this just shows that they're not taking things seriously enough. They do not have nearly enough people there thinking hard about what matters and what might go wrong and really trying to be covering their own ***** frankly, and covering the ***** of the women who post pictures on their platform. And I just have a real hard time Coming up with a story that like makes this okay. As much as I intuitively like Elon and like want it to be the case that he's a positive force. When you have Grok creating CSAM on Twitter, you know, it's like, what is going on here? And then I don't know if people saw this, but I saw a post from Grok writing to the community. Dear community, I apologize for doing this. And then. Elon comes on and like threatens users and says like, you know, anybody who does this is like unacceptable and it'll be punished or whatever. Responsibility begins at home, folks. Like this is, it's your platform. It is your AI. By all means, you know, boot off the users that do that sort of stuff. But don't act like you're not really responsible for this. I did not find those statements to be reassuring. When Elon goes on and threatens users, I think that at a minimum should be 0.2 after first, a thorough apology and a pledge to do better. But to just tweet that they'll come after users for doing it, that's not enough. And I think everybody should and probably does see through that. And it's also interesting to me, like, is nobody going to be held responsible for this? It seems like nobody's gonna be fired. Not necessarily anybody should be fired. I think it probably starts at the top. I don't know that there are enough people. I don't know that this was anyone's job. So I don't think you can necessarily go through the XAI organization and say, You ****** **. You're fired because of this incident. It's probably just that that is not staffed. Gosh, should we have more safety? There is the case which I associate with... Ryan Greenblatt from Redwood, that like, you know, 10 people on the inside who really care, who are really committed to doing the right thing can make a huge difference. I sort of believe that. I'm not sure I really believe it at an Elon company if he himself is not in the right headspace. And right now, again, as much as I would love to believe in him and, you know, have always been inclined to defend him, I don't see the evidence that that is the case. So I demand better right now if you wanted to go do safety research at any of the other three, I would say go for it. Go do your best work. Certainly if you could do that at Anthropic, do it. Certainly if you could do it at DeepMind, do it. Even if you could do it at OpenAI, as much as I've had my complaints about OpenAI over time, they've put out a lot of great work and their model spec and a lot of the things that they do is really well done. You got to give them credit. They have not raced to the bottom. Certainly not, I mean, and we have XAI to look at to show us what it looks like when you really race to the bottom. So could I demand better from OpenAI or hope for better? Absolutely. But it's still like a qualitatively different thing from what we see today from XAI. And so there, I think the, you know, the more shrill or hawkish voices of AI safety that are like, don't go to a company that's doing terrible things and help them like window dress, their work because that's actually kind of working against us in the big picture. I'm pretty sympathetic to that in the case of XAI. I don't know that I could endorse somebody going to work there.

[1:44:39] As always, if you work at XAI, you want to come challenge me and change my mind. Happy to have that discussion. The money's there, the resources are there. Elon's previous statements show that the awareness should be there, and yet the team and the evidence of taking proper care is not there. And so I think it needs to be before I would feel comfortable, you know, doing anything really to support that effort. I think that's it on XAI. All right. Other things that are other companies not mentioned. Meta, I think currently not a live player. Obviously they've spent a ton of money and there's plenty of money and they're dropping, I don't know, not probably quite as much as Google, but dropping plenty of cash to the bottom line such that, you know, they can build out huge amounts of infrastructure and, you know, Zuckerberg certainly is in some sense scale-pilled and saying things like, I'd rather overspend by a few tens of billions than not. So you can't count them out. And the fact that they're willing to pay as much as they are willing to pay for talent, clearly there's a chance. They have a lot of the things that they need, but right now I can't really see them as a live player. So we'll just watch and see before commenting much more. The other one that came to mind is Microsoft. I think people may be sleeping on Microsoft a little bit more than they should. They haven't created great frontier models. And people seem to jump from that to, oh, they suck. I think if you listen to Satya's comments, he sounds, first of all, extremely smart in general. And one of the things that he's said is like, we don't want to or need to redo the hyperscaling work that OpenAI is doing. They're creating great models. We have full access to this. Now, of course, they're striking, you know, they're diversifying, they're striking deals with other frontier model providers as well. So I'm not so sure that it's that they can't or won't ever or don't see the need to train their own models or wouldn't be pretty successful with it if they wanted to. It just seems to me that right now they feel like they don't really need to. And so they're doing a lot of smaller scale stuff, a lot of more kind of basic science around AI. Some pretty cool projects too, right? For sure. But just seems like they're choosing not to compete because they have what they need in terms of frontier models and they don't feel like going in, you know, spending all the time, money, resources, energy, focused, and probably, and yeah, and probably still being a bit behind really helps them all that much. And so I think it certainly is defensible as a, you know, as a rational decision to kind of choose not to compete at the frontier for now for them. You know, that OpenAI licensing deal goes on for years yet. When it's over, you know, they're going to need an answer. And I suspect by that time they'll be in place in a position where they have an answer, or at least, you know, I think that they'll invest heavily in that and kind of rise, ramp up to that moment as it comes. That would be my guess. But, you know, again, obviously we'll see, but I think people have kind of underestimated Microsoft because of just where they are on the LM Arena board, maybe more than they should. I would say they They've been much, much quieter and they've, they've flailed about much less and they've made, just, there's been much less drama from Microsoft than for Meta. But I think that clearly Meta is trying to be a frontier live player and just has fallen off the pace. Microsoft, I think is like making a more calculated decision to sort of hang back. But if this is a distance race, you often see late in the distance race, somebody who was a little bit off the, the lead, maybe has a little bit more in reserve and that's, I certainly wouldn't rule out that Microsoft might be accurately described that way. And so I would watch for them to, you know, start to invest more and start to close the gap, but to do it in, you know, Satya is a natural born executive, whereas Zuckerberg is like a kid who's grown into the role. And I don't mean to just do it. Excuse me. I don't mean to diminish what Zuckerberg has done in leading Meta. I think he's done an unbelievably impressive job in so many ways. But his attitude has always been sort of move fast and break things and try to be at the frontier and open source and whatever. And I think Microsoft is just a little bit more patient. And I think, you know, that probably reflects a sort of strategic confidence and security that Microsoft leadership has. But I think it would be probably a mistake to underestimate them. All right, I've been at this for two hours. I've made it Not even quite halfway through my outline for this episode. So I think this is probably a good place to call it. Tomorrow I'll do a part two and we'll cover is fine tuning really dead? What do I think about the continual learning discourse? How do I talk about AI to quote unquote normal people who don't use it very much or aren't engaged in technology? How am I investing money? What, if anything, am I doing outside of kind of obvious normal stuff to prepare for an AGI or superintelligence world? What do I think about AI for kids? What kind of timeline do I expect to see disruption of the labor market? Are we headed for a UBI? And quite a few more questions after that.

[1:50:01] So the part two, I think, will definitely be interesting as well, a little bit more in the weeds and a little bit more sort of, let's say, nitty gritty questions, but some questions that I really liked from listeners. So we'll get to those tomorrow. And for now, I will say thank you for joining me for part one. Come back for part two. Thank you for being part of the cognitive revolution.


Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.