OpenAI Sora, Google Gemini, and Meta with Zvi Mowshowitz
Zvi Mowshowitz discusses OpenAI's Sora, Google's Gemini, Anthropic's Sleeper Agents, and offers live player analysis in this insightful episode.
Watch Episode Here
Video Description
In this episode, Zvi Mowshowitz returns to the show to discuss OpenAI’s Sora model, Google’s Gemini announcement and why Zvi prefers Gemini to other foundation models, Anthropic’s Sleeper Agents, and more live player analysis. Try the Brave search API for free for up to 2000 queries per month at https://brave.com/api
Definitely also take a moment to subscribe to Zvi's blog Don't Worry About the Vase
(https://thezvi.wordpress.com/) - Zvi is an information hyperprocessor who synthesizes vast amounts of new and ever-evolving information into extremely clear summaries that help educated people keep up with the latest news.
LINKS:
-Zvi’s Blog: https://thezvi.substack.com/
- Waymark The Frost Episode: https://www.youtube.com/watch?v=c1pPiGD7cBw
-Anthropic Sleeper Agents: https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training
SPONSORS:
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://brave.com/api
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off www.omneky.com
NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.
X/Social:
@TheZvi (Zvi)
@labenz (Nathan)
@CogRev_Podcast
TIMESTAMPS:
(00:00:00) - Episode Preview
(00:04:37) - Zvi’s feedback on the type of content the show should create
(00:05:33) - Zvi’s experience with Gemini
(00:09:42) - Speculating on Google Gemini’s launch timing
(00:12:54) - Advantages of Gemini
(00:16:11) - Sponsors: Brave
(00:25:00) - Speculating on Gemini 1.5 and market dynamics for foundational models.
(00:28:18) How long context windows change things
(00:30:57) - Sponsor: Netsuite | Omneky
(00:41:06) - LLM Leaderboards
(00:43:37) - Physics world modelling in OpenAI Sora
(00:57:25) - Object permanence in Sora
(01:04:40) - Experiments Zvi would run on Sora
(01:06:21) - Superalignment and Anthropic Sleeper Agent
(01:10:47) - When do agents actually start to work?
(01:16:00) - Raising the standard for AI app developers
(01:22:07) - Dangers of open soure development
(01:30:53) - The future of compact models
(01:33:58) - Superalignment
(01:53:00) - Meta
(01:54:20) - Mistral’s hold over the regulatory environment
(02:04:16) - The Impact of Chip Bans on AI Development
Full Transcript
Transcript
Zvi Mowshowitz: (0:00) Someone walks into Sam Altman's office says, Google just announced 1.5 Pro. It has 1000000 length context window, potentially gonna be 10,000,000. It's really exciting. People are raving about it. Is now the time? And Sam's like, yep. Now Sora. But you can't have both ways. You can't both say there are so many chips coming online that we need to build AGI soon. And AGI is coming along so fast, we won't have enough chips. We need to build lots more chips, and they go raise the money to make the chips. It would be surprising to me if a GPT-five worthy of its name didn't make these agents kinda work.
Nathan Labenz: (0:36) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Eric Torenberg. Zvi Mowshowitz, welcome back again to the Cognitive Revolution.
Zvi Mowshowitz: (1:02) Always fun here. Let's do it.
Nathan Labenz: (1:04) Let's do it. So I call you when I feel like my own personal context window is overflowing and I can no longer find all the needles in the haystack that I might need to to make sense of what's going on. It sure seems like things are happening faster and faster all the time.
Zvi Mowshowitz: (1:21) Well, that's what Gemini Gemini Pro 1.5 is for. Right? You've got the big new context window. You should be okay.
Nathan Labenz: (1:26) Well, how is your personal context window holding up, first of all? And then we'll get into all the latest news.
Zvi Mowshowitz: (1:31) I think that my capabilities are advancing fast enough that I've been able to keep pace, but it has been rough. So the the discourse speed has slowed down in the last few months. It feels like we had a lot of really big developments and discussions around various potential government actions and also around the OpenAI situation, And 2024 has been more of a, like, okay, everybody chill, everybody process what's happened, and therefore we can focus on, you know, temporarily technologies. And, yeah, a lot is happening, but also you start to see the patterns, and you start to say, okay, this is a version of, you know, this other thing that happened. And it's not that it's slowing down in absolute terms, but you gain the power of matching ability to process what's happening.
Nathan Labenz: (2:20) Is that a natural process for you or are you changing how you work at all or your your approach in any way?
Zvi Mowshowitz: (2:27) So I think the natural process in the sense of I I find myself saying, you know, let's try this again. Let's go over this again. You know? Or as we previously discussed. Right? You So see this in that Llavigne column where he's like, well, you know, as we have talked about before and every word is a link to a different I suppose. It's the same basic idea of a lot of these concepts and questions and reactions are going to repeat. Right? Like different people are realizing different versions of the same thing at different times. You know, the future is not evenly distributed, it's getting distributed, we pick up on when people notice this, especially when like mainstream media, you know, standard cultural people pick up on things that we knew 6 months ago, we knew a year ago, or we figured out was coming at that time. And so I think it's just okay, yeah, we're gonna go for this again. We have already covered that. But yeah, this is actually from October, That happened to me yesterday. Actually, I was writing stuff up. And I realized, no. Like, this is actually a few months old. I probably already covered this. It's fine.
Nathan Labenz: (3:28) I'm trying to figure out what I can do to be more valuable. I think, you know, I've never really considered myself a content creator in, you know, in any sort of normal content creation sense. Starting a podcast was kind of a way to make sure that I was consistently tackling a new topic and, you know, hopefully sharing that with people in in what I hope could be a somewhat useful way. It's gone better than I expected it would in terms of people being interested in listening to it. And after a year, I'm kinda like, how many of these should I be doing in kind of traditional interview format? How many should be more like this where it's like a little more of a survey type of thing? How many should be more of like a deep dive where, you know, perhaps not even with a guest, I'm just kind of really going hard at a particular topic. I did that late last year with the Mamba architecture and really felt like that for me was a more profound learning experience and hopefully a better product for the audience than a typical episode. So I don't if have any immediate reactions to that, but if not, we can maybe circle back. And at the end, you can tell me what you want.
Zvi Mowshowitz: (4:32) I do, actually. I would say you should basically always strive, in my opinion, to do more deep dives. Like, not just you, but, like, almost everybody than you naturally want to do. Like, mean, I'm working on a newsletter, obviously, every week. I am writing, you know, an average of 10, 12000 words or something, you know, including quotes about what has happened specifically in this past week. But I find the most valuable things, especially if you're building up a long term base of operations, are when you get to focus more on specific places where you feel you can break your insight, where you can be the go to source.
Nathan Labenz: (5:11) Yeah. That was my experience with the Mamba piece. So I'm looking to figure out how to do a little bit more of that. So what are your big pieces and certainly a deep dive has been recently into your experience with Gemini. This is obviously something that we did know to expect. And if anything, you know, seemed like it probably came a little later than we might have penciled it in for. You had the good fortune of having early access. I would love to hear how that happened and, see if there's anything that I can potentially copy from you going forward on that. But then obviously more important is what your experience with it has been. And then, of course, now we've also got 1.5 to look forward to as well. But let's start with 1, which you said, you know, among other headlines, had become your default language model after a couple weeks of playing with it.
Zvi Mowshowitz: (6:02) Someone reached out to me and said, we'd love to get your feedback on this new version that we're gonna ship soon. Would you like to check it out? And I think that, you know, I did was I established that, you know, I'm someone who could be trusted if you share it with me. I'm not gonna go spy right now over the Internet. And I can provide useful insights that are different, that are well worth the time spent directing. And then obviously, if you share it with me, I'll be able to offer more detailed insights. Like, right when it's released, which can impact how people see it. And they presume we thought they had a good product. They wanted me to give my last minute. There was no attempt to put a finger on the scale in any way. No attempt to convince me it was better than it was. It was all very, very honest and straightforward and realistic. I thank them for that. It was a really good experience. They seem to care about my feedback. This was someone who clearly wanted the information, wanted to understand. So what I noticed right away was it was really good at many of the natural things that I wanted it to do. So what a lot of people will do when they they're hands on a model is try to get it to use up the carnival.
Nathan Labenz: (7:07) I'm guilty of that. Well,
Zvi Mowshowitz: (7:10) you know, Sally has 2 apples, and then, you know, John brings 2 oranges. Then they all play a juggling act, and then they bake salad. How many apples are left to know? The model's, like, 3. It's like, no. And it's like, hey. Then what's the point? You know, in some sense, there there's there's other people I figured whose jobs will be to red team this model, either to get it to do harm or to find out where it's dumb or where it's gonna fail, right, on, like, theoretical tasks. And I took the opposite approach. I asked myself, you know, what good can this do? And what happens when I feed it exactly the queries I just want the answers to that I would normally want to do over the course of my day? We have some of that curiosity. I could do some amount of brainstorming and trying to set that out and trying to parallels and so on. But what I found was, mostly the things I want are explaining things. You know, finding excellent you know, just general, like, answering my questions. I'm confused. I want you to search for you know, maybe search for a document in some cases. Although the version I had back then didn't do that, so I looked at other things. And I'm just very practical user of LLMs, but I'm not trying to help it do my job. I'm trying to, like, extract the information and understand it. I can go about my day and do the rest of my stuff. And for that, it was very, good. In particular, as I trip, I was trying to read the financial DPO paper that I never got around to, and I was asking the question every paragraph or 2. And it was very, very good. It's right not having been fed the paper, but it seemed to have the information, and it was very, very good at answering my questions on exactly the level of which I was asking. Let me figure out exactly what information I need to know. Then I wanna understand more things. Well, I noticed it had some flaws, like I would try to use Google Maps, and it would sometimes give me good information, and sometimes it would repeat 3 different times for how long it would take to get the SFO for Berkeley within the space of 3 questions in the same chat despite having the prior reference, which is better than just artificially agreeing with yourself. Right? I was admiring the fact that it wasn't doing that. The same client was incredibly frustrating. Like, I can't carry on you for anything. This is weird. I clearly, they know what the weather was, and then it would repeatedly not know what the weather was. I found some other bugs in the system that they're really just like, clearly, this thing launched, like, earlier than it was supposed to. I bet it's also the 1 5 launching, like, 2 weeks after 1 0 advanced. But this is not a natural thing for a, like, well planned out release schedule to do. Right? If you almost have 1 5 ready, you don't make your big announcement speech Right? Then unless, you know, this is me you know, respecting the is on the phone with you and saying, no. You're damn well gonna launch right now. And he said, but we're almost ready with 1 5. He said, I don't care. You're gonna launch. I can't not launch now. And then this happened to me, right, when I was doing a a game company. And we were just told, you're launching today. And we said, we're not ready. And they said, we're launching today. I said, it's Friday. Can you at least wait till Monday, you idiots? They're like, no. You're launching today. And I we would not have succeeded if we launched on the following Monday, but at least it would have been slightly less embarrassing and doomed. But, you know, just it's clear that, yeah, there was a we need to launch this in whatever state it's in, and then we're seeing rapid improvements. And to the point where we're not sure if 1.5 is better than 1 0 advanced. You know, discounting compute, it's got advantages, disadvantages, which I haven't had the chance to properly explore yet. But, yeah, I was just really impressed by this is faster. This is smoother. This just does what I wanted to do. My workflow is easier and better. I want to use this. I don't want to go to GPT-four if I can help it. I do that when I feel like I have to. Where I used to be like this is where I go, and then sometimes I use clock. Right? Sometimes I use perplexity.
Nathan Labenz: (10:51) So can you put a little finer point on what it is that makes it better? I mean, I know that it's obviously tough because the surface area is so vast, and there's so many facets to language model performance. For what it's worth, I have also been impressed. I haven't spent as much time with it as you have. Today, I am going to Gemini Advanced aka Ultra, 1 Ultra, and chat GPT with GPT-four and Claude, basically, for anything that I want help on. I did this once with, like, a little kind of proto contract negotiation that I'm working on where, you know, I had kind of a couple messages back and forth between me and the person I'm I'm working with and, you know, kinda said, here's what this person seems to care about. Here's what I seem to care about. Here's our messages to each other. You know, can you kind of help us get to a good win win situation? They certainly all behave a bit differently, but I found them all kinda useful. And the synthesis of the 3 was, you know, for me, kind of a way of making you know, can telling myself I I didn't need counsel because I had enough kind of AI counsel to make sure I wasn't leaving any, you know, major blind spots for myself. And then also did this with a, basically, a job description where I have been working with a a friend on a hiring process. We've had a number of calls with different experts, got their opinions, have taken very raw notes, and just fed in kind of a paragraph at the top of like, okay. Now it's time to turn all these notes that I took over various calls into a job description. You write it. And, of course, you know, they each did the task. In that case, I did find I think it's fairly random. I wouldn't say it was, like, a systematic finding by any means. I mean, I'm not sure how how, you know, frequent this would be. But I did find in that case that the Gemini response was the 1 that I wanted to go with in terms of just taking something, editing to a final product that I would, you know, hand over and and feel proud of. Gemini did come the closest to what I was looking for. But those are, you know, spot checks. Can you can you help us kinda get any any more systematic or, you know, predictive in terms of what the advantages of of Gemini would be?
Zvi Mowshowitz: (12:59) Yeah. The interesting thing is, we find it the opposite of mixture of experts. Right? Like, it's a mixture of nonexperts, where previously, you would figure out, okay. I have these 20 different expert systems, and the program figures out which 1 to call is the amount of compute. Now it's like, no. Just call everyone. Compute is cheap. Compare all the answers. See which 1 is the best, which makes that pretty good case. And, yeah, I even have 1.5. Right? I have access to 1.5, not in the standard interface, but in the developer studio beta. So I can then not only compare Claude, GPT, and Gemini, but also the other Gemini. So now it's like I have the 2 Geminis to work with and the name finally makes sense. So I think what I what I found was the the the biggest feature was just it tells you the information you actually want to know when you ask the question much more reliably with less, like, worthless stuff, like, scaffolding around it. And, also, I had gotten some good use out of the Google button. So when you answer on a Gemini advanced, 1 of the things you can do is you can press the the Google icon at the bottom, and it will search for web sources for all of the things that you just had, and they'll light up in green, indicate that you have a source that you can now click to. And it's just it's very seamless. Actually, the guy I was doing the test with showed me this feature. Didn't understand it was there until he told me. And then, you know, you can then ask for a sort. What's going on? Do have any physics on this? Do you have any knowledge of this? And then you can find the backup. And that was very helpful in general. And I said, like, it's also just very good understanding. Okay. If he asks this question, what does that mean he doesn't understand? And what does that mean he does understand? And I thought that GPT-four can be pretty dense about this. And I probably could work with better customer structures to try and approve this than what I have. But, you know, so far it's been kind of frustrating. Claude, found is a step behind for most purposes. I use Claude mostly when I'm looking at long documents, but I think your your use case is a very different 1 because you are asking for a creative job, right, where, like, you get lots of different answers. There's no right answers. There's no wrong answers. There's just different answers. Being able to take a sample and then they choose the best 1 or combine the elements of different ones helps you a lot. Whereas what I'm mostly trying to do is I'm trying to ask questions where there is a right answer. I'm trying to look for explanation or I'm trying to just get some brainstorming at most to to try to go with. And so I tried Gemini first, and if Gemini gives me an answer that does the job, I'm done. And only when I get frustrated with that, like, okay. Fine. I'll try all these other people. I I won't jump from 1 to 4. I won't be jumping from 1 to 2. I've just been really happy in practice with Gemini. It just it does what I want it to do except when it doesn't. Right? But I've adapted to, like, okay. Those are the things that I just don't do. I haven't had much calls to use GPT-four specifically since then because it just doesn't come up very often. Like, there aren't these the cases where GPT-four is good potentially, where Gemini is bad, or where you're trying to, like, customize it in these very specific ways, I think. And those cases just don't come up for me very often. So it just doesn't seem much use. And I'm not gonna cancel it because even the option value is just so much more than, like, what tickets they charge.
Nathan Labenz: (16:09) Hey. We'll continue our interview in a moment after a word from our sponsors. Just to to give a quick plug to Claude on my legal question, I did find it to be the best in the sense that the commentary that it gave back to me was least sort of padded out in RLHF, if you will. It it was more kind of direct to the point conversational and actually just kind of concrete on the suggestions. It seemed like the others were a little more reluctant to and I've actually been pretty impressed with Claude on this front a couple different times, including 1 specifically pertaining to you, but finishing this story first. They're just kind of plain spoken to the point direct, like, not super hedging with with both Gemini and GPT-four by comparison. I felt like it was really reluctant to make a recommendation, you know, to make a suggestion for, like, what the agreement might ought to be. It was very like, this is an important consideration. You guys will need to figure out what the cons you know, what the right thing is here. And as I pushed them, they would do a little bit more of kind of, okay. Here's a, you know, a specific proposal. But Claude was was much more readily willing to just be like, here's kinda what I would suggest, and it was pretty apt. And it again, I did not, you know, copy and paste this into, you know, my running dialogue directly, but it was the closest to in that case, I actually wrote my own thing and, again, took my own thing back to each of the 3 models that, okay, based on your, you know, suggestions and my own thought, here's what I came up with and asked for 1 more critique and then basically sent it from there. But if I had been inclined to pick 1 from the 3, it would have been Claude because it was just the most kind of to the point, easy to understand, not necessarily exactly what we're gonna go with, but at least, like, kind of willing to put something out there. And the other 1, as you may recall, that I've been kind of impressed with on this from Claude was when we had in our last episode, you made an offhand comment around, like, I'd rather you go and do something terrible than, you know, open source next generation LLM. And a listener said, hey. You know, that's a little extremist. You you shouldn't say that. So we edited it out, but then I also got the idea in my head, well, maybe I should use a language model to review transcripts in the future and see, like, is there anything there that the, you know, guest may wish to retract or, you know, maybe cross the line or whatever. And, again, it was very sophisticated in its commentary on that point where you might, you know, you might have the impression of Anthropic probably as a company and cause it to be, like, super, super conservative and, you know, super, you know, don't ever say anything like this. On the contrary, it said, I didn't see anything wrong with that. It seemed like an offhand comment. It, you know, was clearly not somebody endorsing this terrible thing. It was where, you know, it was kind of hyperbolic to make a point and, you know, and the listeners shouldn't have any trouble distinguishing that from, you know, an actual call to violence or, you know, anything along those lines. So it is interesting. I mean, they all have their their own character, and it's been kind of surprising sometimes to see that Claude, especially given, like, corporate reputation and all that, is in fact, like, a little bit more accepting of some things that might be, you know, considered possibly offensive or possibly coloring out some lines and also a little bit more bold, if you will, in terms of actually making, like, a concrete suggestion and not, you know, not hedging as much as as some of the others. Of course, that may that may be a reflection of, like, the other models having a little bit more power into the hood and just being, like, therefore, even more RLHF, you know, is kind of a compensatory measure. So we'll see what Claude, you know, 2.2 or 3 looks like when it comes.
Zvi Mowshowitz: (19:50) Yeah. I worry about that. You know, I think that a lot of what we're seeing is there's more and more fine tuning, more and more of this other work being done with these models over time, and we're getting this bullet point superhedged style that's very inhuman, very like, I see exactly how it makes sure not to get the thumbs down, and I see exactly how it makes sure not to piss anybody off, and it's incredibly frustrating and makes my life worse than if you just do the thing. Although some there's some cases where it's actually just exactly what I want, but often it's just so frustrating. There was 1 case earlier I was, right, doing this for next week's post. And, you know, someone, like, had this idea that dude was a cautionary tale about the dangers of not building AGI. And so I just wanted to, like, okay. Just just to prove a point, I'm gonna try all the different LLMs and ask them what do you need a cautionary tale about. And they all got it right. They all had the right first bullet point, but only Gemini Pro, right, didn't do the bullet point thing and just said, no. Actually, obviously, at this 1 thing and then gave me the kind of details you would actually want to know about, which is illustrative of the thing that I love about Gemini, right, when it when it does the thing, which is, no. I don't wanna know, like, all the different potential interpretations 1 could make of this book. I want to know, you know, exactly how this is a cautionary tale and what is you know, what are the details someone wanna cite in order to just, like, fully not be accused of pulling something out of the air. And finally, that's kind of what I'm appreciating often more and more. It's just like, are you just willing to do the kind of thing that you would have much more easily done a year ago when this thing first came out? Can I just have a version that's designed to be better for humans? Right? And then the question then becomes, you know, like, how do we avoid this in the future? How do we not train these things to hell in these ways? And I think there's some promise to that.
Nathan Labenz: (21:46) So a couple big themes there to unpack a little further. But maybe for starters, do we feel like Google has taken the lead at this point? It sounds like if you thought Gemini advanced is ahead of GPT-four and now we have 1.5 pro on top of that, which certainly, you know, the existence of 1.5 pro suggests the or implies the eventual, probably likely fairly soon, existence of 1.5 Ultra as well. Obviously, we don't know what the others have under the hood, but it seems like at the moment, there's a pretty good case that Google has taken the lead in terms of public facing products.
Zvi Mowshowitz: (22:25) I think you have to be very careful about in terms of public facing products, and you have to be very specific about what use cases and modes you're talking about. And you have to understand that this is a year old product that was released on a half year delay versus a product that Gemini that is clearly, you know, being reshipped and refined and and released in real time. Like, as soon as they have improvements, they ship those improvements, which is great. They keep improving. But it does mean that Google doesn't have this year of background work in the tank. And what OpenAI has deployed in that year and a half, knock on Sora, which we'll get to later, I'm sure, is they've deployed additional features, but the core model, if anything, has gotten worse. Right? A GPT-four turbo was kind of the reversal of some of the problems that about GPT-four, but, you know, they're they're at core, I think it's particularly stronger as they they have better customization options. And they have these GPTs, and they have custom instructions, and they have identification of who you are. And now they have memory, although I haven't really changed straight up memory if they have used the products. I think there are still a lot of advantages of the ChatGPT product over the Gemini product. And which 1 you would say has the lead depends on what you wanna use it for. For me, I would say Google has taken the lead in terms of public facing products. But if you tell me, you know, what's gonna happen when GPT-five is released, I'm going to assume it's going to immediately leapfrog OpenAI into a substantial lead.
Nathan Labenz: (23:49) Yeah. And 1 wonders how soon that will be, obviously. I guess, do you have a a sense I mean, your your comment earlier about, you know, was the CEO on the phone saying you must launch? If you did have 1.5 right around the corner and, you know, we've come this far, right, without launching the 1 Ultra, Why bother? So, know, I guess it's what do you think is the model of like, who is forcing that and and why? How does this update to the general market dynamic have you thinking about how the leading companies are relating to 1 another? Every time there's, like, a new release, I sort of squint at it, and I'm like, does this look like they're sort of implicitly coordinating, or does it look like they're trending more toward arms race? And, I'd love to hear what where you feel like the balance of the evidence is now.
Zvi Mowshowitz: (24:38) So my model is something like there's 2 different sub departments. Right? Each of whom is tasked with a different thing. There's people who are working on Gemini advanced, the people who are working on Gemini pro 1 and a half. And then there's a third couple of people maybe or I guess part the virtual people who are working on Gemini ultra 1 and a half, but we'll we'll see it when we see it. But, you know, they're under different pressures, they're releasing different products, they're releasing in different market segments. 1 5 isn't released. Right? 1 5 is available to a select group of people in a high latency special interface, which is very, different from a customer facing product. So Google decided it's ready to launch the customer facing universally available product and didn't necessarily want to sync it up, you know, at the cost of delay with this 1.5 announcement. And then the 1.5 announcement got stomped on by Sora, I think, rather unfairly in terms of, like, what people were paying attention to. But that's that is what it is. And so I think it it makes sense that, like, this is just, like, different dynamics playing themselves out in ways that look stupid to the outside, but which would make sense if you understood the internals of Google, and you understood the different commercial pressures and and other pressures. But, again, it's only a guess. We don't know what's going on. And in terms of the race dynamic question, I would say Google is forced to dance. Google has danced it. Google is clearly, you know, racing as fast as it can to develop whatever it can as quickly as it can. The good news is that Google seems to be focusing now more on getting the most out of its smaller models rather than trying to make the best top model and also trying to use the fact that its model is smaller to enable things like a larger context with them, which allows it to get more multimodal. And so I see this as focusing on providing the product the customer will actually get more utility out of that will be a more practical use day to day to a normal person as opposed to trying to make the model fundamentally smarter and more capable in a way that would be both more exciting and more dangerous from an abstract point of view. And so that's what I wanna see. Right? I wanna see, you know, all the cool stuff that we can use to have a great time and make our lives richer and better. And I don't wanna see as fast this race to just, like, make the biggest possible model that has the best possible chance of, know, suddenly doing something crazy unexpected. And and so I would say, yeah, they're racing, but they're doing the right kind of race in some sense. So, like, it's it's good news and it's bad news.
Nathan Labenz: (27:09) What are you most excited about for the super long context? The the version I understand that you have early preview access to is the 1,000,000, and then they've also kind of teased that they're gonna have a 10,000,000 context window, which is, I mean, you know, folks who listen to the Cognitive Revolution probably have a pretty good sense of the history of context links. But, you know, my joke of earlier this year has definitely come true. Like, context hyperinflation is definitely upon us. We've gone from literally still less than a year ago, the longest context available for any quality model being 4,000 to then 8,000 with GPT-four release, 32,000 with GPT-four 32 k, Claude hit with a 100. GPT-four turbo came back with 1 28. Claude went to 200. Now we've got 1000000 10,000,000 on the horizon. And what has really impressed me out of the demos that we've seen has been not just that the context window is super long, but that the recall out of that context window seems to be super reliable. And again, I think people will be pretty familiar with the fact that, like, if you do stuff, you know, more than let's say you put a 100 out of 1 28 or whatever in the GPT-four or you put you know, you get close to the 200 with Claude, it's not always a gimme. You know, you have to maybe you can kinda get there with good prompting techniques, but in general, it's, like, not reliably the case that it is going to spot, you know, the 1 bit of information that you need or that it's, you know, that it's gonna synthesize everything in that context window effectively. The demos from 1.5 seem way better on that front. And, you know, in terms of mundane utility, I just think about something really simple like finding the right stuff in my Gmail, which is obviously something that Google has a lot of reasons to try to make better. It has been a real struggle, right? And all these different agent products and whatever, you've got all the scaffolding and the retrieval and the re ranking and all this to because it's not just enough to, like, get things into the context. We also kinda need to keep it shorter for cost, for latency, but also just for accuracy. If all of sudden you could do 10,000,000 in a single call and reliably get what you wanna get out of that, It seems to me like that really is a step change advance for a lot of use cases that, for example, would make, like, my email agent assistant just remarkably more effective than it is today. What else is on your radar for things that you think will change even though the power, you know, the raw reasoning maybe isn't that much better, but just because the the memory and the recall is so much better? Hey. We'll continue our interview in a moment after a word from our sponsors.
Zvi Mowshowitz: (29:58) Yeah. There there's situations in which, like, you reach a certain point, it goes suddenly suddenly goes from completely useless. Right? Because you might as well just do this job yourself to suddenly Like, my my my eureka moment with Gemini advanced capability of hooking up to email was give me a list of everybody who RSVP'd to the party. Enter. You know, for individual emails and having it produce the correct list. And then that's like, wow. That's really, really useful. Now I can just press print, hand this to the front door, and so they'll let them in. Great. Another moment is, like, being able to use and the hope is to use an interface like like Notebook LM even would probably be even better. But the idea of I have these now 50 plus things that I've written about every week and to be able to use that now as a bank of information that's all in the context window because if everything else I've never written and say, okay. Based on this, retrieve everything I've said before about this topic, or what are all of the different, you know, resources that I may have mentioned or whatever. Or or even what have I forgotten to mention? What have I knocked out? Have I ever talked about this before? Right? Like, this is really exciting to me, but it only works if you can extend it really big. Right? Otherwise, you have nothing. Right? It doesn't really help that much unless you're like, oh, I think it was in '37. Check '37. Like, no. It's not '37. Like, at that point, I could try 38. This seems annoying. But some other use cases, video. Right? Like, suddenly the context window is enough, you can just feed it a 2 hour YouTube video or a movie. Right? This is 1 of the things that Google showed off. And then you could ask specific questions. Potentially, by doing iterated searches, you can even do, you know, television series, anything you want, you know, quick streams, yada yada yada. Just ask anything you wanna you know, find find the specific clip that I'm looking for, understands the quote, find the specific whatever, understand the context, etcetera, etcetera. That stuff's exciting today. I'm also excited for the possibility, I haven't really been talking about this, but of using it as fine tuning. Right? Effectively using it as training. So the idea being that if I have a 100,000 words or 500,000 words suddenly, 1000000 words with good recall and good understanding, well, now I mean that in the style of the things that you are in your context window, would you please respond with? Right? Or, you know, using the thinking mode or using the willingness to do x or blah blah blah. And now suddenly, maybe there's something there, but it's something you have to try. And it's again, like, if you don't have good recall, you don't have good assimilation, if you have reached a certain threshold, these things don't work, and then suddenly they work. And once they work, you're off to the races. But, yeah, it's just it's super awesome. It's also the fact that, like, frankly, even the 200,000 contacts window was sometimes just not long enough. And it's simply as simple as I download a page, I feed it in, it says this is 5% too big, and now I would have to spend, like, 2 minutes, you know, for various web tools to try and, like, chip off 5% of it, like, get rid of the acknowledgments and the the references or whatever to try and make it fit. And I didn't wanna do that. And in fact, like, this is a little game of, like, do you think this person made sure it was too too big for the context window? Like, the FTC just didn't want anyone to be able to read this properly. Like, I don't know. I had this thing open up my my machine, the the AI act, the EU AI act, right, in in large detail.
Nathan Labenz: (33:23) So will that fit into the 1,000,000?
Zvi Mowshowitz: (33:25) I think it will. I mean, it's not it's not as full page. It it's a change log. So it's got, like, 4 different versions of it to write. It's, like, really weirdly formatted. And it's terrible, and it keeps getting worse. It it's the most amusing part of it is you can see all of this blue inserted text, and it's just people who are like, could we possibly be more pedantic? Say more words. Put in more specific cases. Wish cast more. Just make this act worse continuously everywhere.
Nathan Labenz: (33:51) So do you have any intuition for what is going on with the 1 to 1.5 leap? I mean, they what they've said publicly is mixture of experts architecture, which certainly tracks with other things we've seen. I mean, GPT-four, I don't think has ever been publicly confirmed as a mixture of experts, but seems to be kind of credibly leaked, reported, not denied, whatever. So I'm comfortable saying it seems quite likely that GPT-four is a mixture of experts. And then Mixtural obviously also put out a very good open source mixture of experts that, you know, gives some additional juice to the idea that this is, a promising path. And now Google is obviously saying it. So we know that much. It's a mixture of experts architecture. They say it takes less compute. It's interesting because, like, you know, people often talk about the attention window being quadratic in the attention window, which has been true, though people have often overestimated what a big deal that is because it's not until you get to, like, a pretty long context window that that actually starts to dominate at, like, the more modest context lengths. Still just the MLP block is, the bulk of the compute. But, certainly, you know, you get into the hundreds of thousands, the million, the 10,000,000 range, and now you are with, you know, conventional known techniques. The attention block would be dominating the compute. But this now, they say, takes less compute. So do you have a sense for what might be going on under the hood there? Fair to pass if you don't want to speculate, but this is definitely something I plan to research more and see if I can get a redone.
Zvi Mowshowitz: (35:34) I definitely don't know. The first thing I would say is, you know, the timing on Gemini 1 looked very, very much like we are going to launch the moment we can produce a product that can match what we're up against. Right? Like, barely We can match what we're up against. So 1, right, as opposed to OSRA was just can we just barely be 3 and a half level with a small amount of compute that we can serve without worrying about people abusing the system? And then what they found, they released it. They're still on the we're making rapid progress part of the utility curve. Right? GPT-four clearly hit a while ago the point where it's like, okay. We did our thing. We made this the best we know how to make it. And now rather than try and tweak it or keep working on it, we're gonna switch over to working on GPT-five. And then we'll let you know later what GPT-five is. We'll improve forward by providing different modalities. We'll improve it by adding features, but we're not gonna try to improve the core thing that we're doing. Whereas Google hadn't finished their work. Right? They they were probably always intending to be a mixture of experts system, but they just hadn't gotten around to it yet. And that's probably a huge leap on its own. But, also, they, you know, they hadn't tried to implement these context windows. They just did this now. Bunch of the multimodality, came in now. Bunch of the just general. And also, they they had remarkably little feedback because their their system was not being used by many people, and and now this allows for rapid iteration of their products in other ways. So I think they have a lot of different ways to continue improving here. And, also, the fact that they made a large leap from 1 to 1.5 so quickly implies that, you know, 2 2 is coming. Right? Or 1 satin or
Nathan Labenz: (37:10) Doesn't seem like a stopping point. I would give, for what it's worth, the GPT-four progression a little more credit in that it has gotten 60% cheaper, forexed the context, and that's even relative to the 8 k original context. So 8 k and 32 k, the 32 k was twice as expensive. Now at the 01/28 context window, it is only 40% as expensive as the original 8 k. So that's nontrivial both in terms of improving the the utility of just with the length, even though we as we've discussed, there are some weaknesses there if you really stuff it full of stuff. And the cost is better and the latency is better. And I have personally haven't, like, necessarily felt this, but if you go look at the LmSys leaderboard, GPT-four turbo seems to be, like, a notable cut above the even the earlier GPT-4s in terms of just user win rate. So there does seem to be something there that they have done in terms of, if nothing else, just like taking user feedback into account and kind of further refining it into, you know, what users want. And, again, I couldn't even put my finger on that because I I wouldn't say I have felt a step change. I felt the, you know, improved latency. I've certainly felt the the increased context, but I haven't felt, like, just general quality to be all that much better. But the the leaderboard results do seem to suggest that. Do would you you I mean, do you have any reason to be skeptical of the leaderboards, or would you how would you do how would you interpret that?
Zvi Mowshowitz: (38:46) It's not that I'm skeptical of the leaderboards. It's more that I think that the leaderboards are measuring the things that are being managed and measured that are not that great a proxies with the thing that I care about or that I think other people should care about. And in broad strokes, they're gonna be very good measures, but, like, in this particular case, it's gonna potentially mislead you quite a bit that I think a lot of the things that we think of as why GPT-four is worse are things that are being done because they approve the things like the leaderboard effectively. Like, even if this is not mechanism, which is not. Like, it's still effectively what's going on. I agree that, like, drops in price, increasing context window, I add custom instructions, right, to that list as well and potentially the GPTs, although I still haven't seen a good use case for GPTs per se. But, yeah, they're doing the things. They ship. No one's saying they don't ship, but these things all seem so minor. Like, if you compare that to the improvement in Gemini or bar, right, over the past year, etcetera, I think they were way, way, way ahead.
Nathan Labenz: (39:43) So I agree with you, by the way, that but I I think it is also debatable. We had the 1.5 announcement, and we also had Sora from OpenAI on the same day. And, you know, the so Sora, of course, is the video generator that definitely seems to represent a leap over any public facing product we have seen. Although I would note that Google also has, like, announced, I should say, multiple video generation models over the last month or 2 as well, which also do have some very compelling sample outputs. Yeah. Lumiere is 1 of those and probably the 1 that had me most excited, although there's another 1 too. I mean, they have multiple teams working on independent approaches on video generation in Google, and they're, you know, they're also showing some pretty impressive results. Also, I believe on the exact same day, we had the the meta version, and I haven't even really had a chance to study this yet, but it seems to be more of a kind of backbone for video encoding. And I think there's multiple interesting aspects of this. Obviously, the ability to generate hyper realistic video and, you know, what that might mean for creative people and, you know, what the future of Hollywood budgets might look like and, you know, what the whether or not they're gonna have to tear up, you know, the recent agreement with the actors or the, you know, the I mean, there's a lot of things there that that are sort of implied or at least, you know, called into question just by the ability to generate very realistic video. I'd be interested in your thoughts on that. But then I think even maybe the more interesting question is, like, is this real world modeling? Like, have these things learned physics? I feel like there probably is some very significant physics like world modeling going on in there.
Zvi Mowshowitz: (41:29) So my section title slash potential post on this is 10 w titled Sora Watt because I don't really, in some sense, see what the big deal is. I think the the ability to actually use this for creative or commercial purposes is being pretty overestimated at anything like the current tech level. It's 1 thing to get demos. It's another thing to be able to get what people need in order to use video because the problem is that, like, with a picture, there are a lot less things that can go wrong. There are lot less things that can stand out. You can build around it. You can edit it. You can modify it. You can take advantage of the sort of anything at all kind of advantage, And you can keep tweaking it and ask it 1000 times so you get it, like, closer to the thing you want. With video, I feel like there's so many little details. And, like, yes, they're getting a lot of them remarkably well and correct, but it's still not gonna give you the thing you actually want in detail. It's still not going to be the product you want it to be. It's still gonna have all of its quirk. It still looks artificial, paying attention. I think people will learn to recognize artificial video, you know, if anything much easier than artificial photos. You look for any little thing that's off and you just know that it wouldn't happen if you were using a camera, like they film something. And I just don't think Hollywood should be quick in its boots in any way anytime soon. So I'm just a skeptic that I just never had the temptation to generate a video. Right? It's like, I I can see generating photos. I've used photos. But, like, the idea that could ever give me a that that Sora would if I had access to Sora, would I be able to generate a video I'd actually want to use? No. No. But I've also just a video skeptic in in many other ways. Like, people, like, want to generate content that's full of videos, and I keep telling them, no. Don't do that. Generate text, maybe generate audio, but think very carefully about whether or not there's a reason to wanna use video. It will, of course, blow out the low very low end in some sense for generating video at all. Some people will be happy with just, oh, this looks really cool, and then I can narrate it and, like, fit to whatever they they happen to give me, I can tweak it, and we'll see what happens. Sora is a very clearly technical large leap over what we've seen before. Whereas Meta's announcement, I could see anybody talk about meta denouncement at all. I just I have to assume it just wasn't as good.
Nathan Labenz: (43:48) Yeah. I need to understand it better. I think it's more foundational. I'm not even sure that they have released any, like, heads for it. It's like a backbone play where it's about encoding and moving toward this, like, world modeling through video prediction concept. But in terms of actually applying it to tasks, I think that's still kind of on you. They haven't gone as far as saying, like, here is the full end to end thing that you can actually use. But I this is definitely where this is the, the perils of near real time analysis. I haven't understood that 1 as well as I certainly would want to to make a, you know, an informed commentary on it.
Zvi Mowshowitz: (44:32) My my assumption is that the public is gonna be come back when you did that. Come back when you have pictures. Picture didn't happen. Good luck.
Nathan Labenz: (44:40) Yeah. I think the people that are probably really excited about meta and, again, like, I need to get in this too because I'm I'm not in this game exactly, but I'm, like, adjacent to it, are the people that are trying to make apps that would ultimately, like, aspire to compete now with Sora or with Lumiere from Google. You know, it's similar to, like, a Llama. The idea there and certainly what what people have done mostly, I think, with Llama is, like, fine tune it for things. Right? They they and write it in their own infrastructure and hack on it in all sorts of ways. And I think this is kind of meant to be similar where it's like, OpenAI is gonna give you their black box thing and Google may give you their black box thing, but this is something that you can, you know, mash around and do whatever you want with. So it is gonna take some time before, you know, the the folks that are kind of hacking in that space would probably even be able to report back to say, you know, whether it's doing anything for them or not. But I'll take a little bit the other side of the impact. I would agree. You know, we have a great episode on this sort of thing with 2 of my teammates from Waymark, Steven and Josh, who led the project called the frost, which was a short film that they made entirely with Dolly 2 imagery almost a year ago now that they're really in the thick of it. And all of the challenges that you described there are were very real for them, even more real presumably than they will be now. You know, you have issues with, like, just all sorts of weirdness, like a hard you know, control is not great, hard time getting exactly what you want. That's, I think, improved, but, you know, still an issue. Consistency of characters was 1 huge challenge point for them. You could say, like, how do I wanna put this same character in all these different places. How do I do that with volley 2? Not really it wasn't really a way. They did find some ways though. They what what Steven talks about is using archetypes. Basically, they found is that there are certain prompts that managed to get very consistent characters, Not perfectly consistent, but very consistent because they're sort of a local minima in this in, like, character space. And so by kind of finding these archetype text prompts, then they could get the same thing or very nearly so, like, repeatedly. So they were able to ultimately make a 20 minute short film that, you know, has kind of an uncanny valley feeling. Again, it's a year old already, but definitely has continuity, you know, has a storytelling element to it. They, of course, brought that. You know, Dolly 2 is not doing that for them. But I would say the bottom line was they were able to make a short film, 20 minutes. You know, it's been included in a couple film festivals, and it's like, you know, it's it's a quality thing. Is it like box office? It's probably not quite box office, but it is something that, like, a lot of people have watched just to see what, you know, AI can do and also because it's, like, kind of entertaining. And the accessibility of doing that has certainly totally changed. Like, this thing is set in Antarctica. There would have been no way for them to create anything like this other than using these tools. So I do think you add a and that was all images. Right? So they were, you know, doing, like, slight additions of movement and kind of, you know, overlay effects, adding snow falling down on top of images and kind of, you know, subtle zooms to create effect. And now you can get, you know, up to a minute of coherent video. I do think that is going to just change a lot of how people produce stuff. It doesn't seem like it's gonna replace authorship anytime soon, but, you know, there is a I don't if it's on Metaculous or Manifold or whatever, but there is a nascent betting market on when AI will be able to make a critically acclaimed feature length film. And, you know, the critically acclaimed part there will be probably where, you know, rubber hits the road. But you could start to see, you know, hey. If you can get minute clips, you know, a 100 of them takes you to feature film length. You know, can you start to sketch those together? You know, we're not there yet, but I would expect that there will be impact on the creative field that will shift more towards storytelling, more toward kind of, you know, concept work, and certainly allow people like Steven and Josh to make things that, you know, with our, you know, very, very tiny way mark budget, you know, that there's would be literally 0 chance of them otherwise making.
Zvi Mowshowitz: (49:15) I think we'll probably see various ways in which you can add flourishes and edits and, like, richness to something you would otherwise have done, finding ways to make the process more practical. But in terms of, like, the idea of the VA, will just produce the whole thing. Yeah. I'm I'm still pretty skeptical that it will get there. But again, we'll see. Think that, like, the images approach, if anything, is, like, the way I'd still do it. Right? Dolly 3, much better than Dolly 2, and now you can create a much better set of images. In fact, if anything, I'd probably involve, like, you know, video, what you'll do is you'll get very, very carefully crafted starting image, and then you'll feed that image into the video generator as the starting point. Maybe you'll even generate separate images at the end point or something. I'm not sure how exactly how this works. And then you will you will do this slash you will film a you'll do a film, but then you wanna do things like, okay, then we actually wanted to do a pen out here and and things like that. But, like, again, we'll we'll see over time.
Nathan Labenz: (50:08) Yeah. Working from a still is, I think, a very underappreciated option as well for Waymark with the actual product itself. It has the potential to be huge because what we do is all for small and local businesses, and they typically don't have quality footage at all. Most of them will have a pretty decent image library. And so we've kinda built our whole product over the years with this assumption that more than 90% of our users are just straight up not gonna have any video assets. So we do support you. Like, if they do have them, they can upload them. We have a stock integration, whatever. But mostly they're working in images because that's what they have. So to be able to say now, okay, here is, you know, whatever, a picture of people eating in your restaurant. Let's bring that to life even for just a couple seconds. We don't even need anything anywhere close to the 60 to make it a big deal. We just need it to work well. And I have been going out and and scouting, you know, the different products that are out there today, your Pikas and so on and so forth. And at least for that, start with an image and animate sort of thing. I haven't seen anything that I was like, our local advertisers would wanna put this on TV and think, you know, it's gonna invite people to their business. But I would bet that Sora probably does cross that threshold, and now, you know, all those 1 and 2 seconds still shots probably become video scenes over the you know, Basically, as soon as we can get our hands on it, we'll start we'll start experimenting with that.
Zvi Mowshowitz: (51:40) Of course, we're
Nathan Labenz: (51:41) begging for access behind the scenes, but it sounds like nobody outside of the the OpenAI set will be, using it for a little while still. I guess that is 1 more interesting question there, and I I really do wanna get to the physics. But on the release of this, it's it's interesting to see like, this is not a release. Right? It's an announcement. And we're seeing kind of more of these announcements of capabilities before even, like, you know, even a very limited release in the case of Sora. Like, as far as I know, nobody outside of the organization and maybe the red team is getting access to it. Why announce? You know? Like, they don't need to you know, I guess you could say it's just for hype. Is it to, like, help the world prepare for these sorts of things? What is the reason for announcing something that you're not releasing?
Zvi Mowshowitz: (52:29) So recruitment is probably 1 motivation. Right? Help work help us work on this amazing new cool products.
Nathan Labenz: (52:36) They're hiring video infrastructure people now. So, yeah, that certainly makes some sense.
Zvi Mowshowitz: (52:39) Right. And so now he's like, what is it for? Why should I apply? Like, this is why. Generate some hype. You know? They have stock now. They're trying to drum up interest. They might wanna be doing things for that reason. Just generally, this is good first practices, what people, especially in media, generally do. In this particular case, there's an obvious hypothesis. I have no actual evidence for it. But Gemini 1.5 dropped first and met its announcement. You could think of it as someone walks into Sam Altman's office, says Google just denounced 1.5 Pro. It has 1000000 like context window, potentially maybe 10,000,000. It's really exciting. People are raving about it. Is now the time? And Sam's like, yeah. Now Sora. Like, they were working on it. They just keep working on it. They could have announced it probably last week or next week. So, you know, the plausible thing is that they they they stopped on Google's announcement on purpose. Right? They wanted to kill the hype, so they had this thing in their back pocket. Like, the next time Google or anyone else important tries to drop some big hype bomb on us.
Nathan Labenz: (53:40) Sam, for what it's worth, has denied that, and you can judge for yourself whether you wanna take it at face value.
Zvi Mowshowitz: (53:45) Again, I I think I I'm not the I'm confident this happened, but he would, wouldn't he?
Nathan Labenz: (53:53) I find him to be you know, I I don't wanna anchor too much on this over a super long time frame because, you know, situation is changing, and his incentives may be changing, and maybe his behavior is also changing. What I did see during the GPT-four red team to release window of 6 months from him, and that was a time where I, like, had this kind of, you know, privileged view into what was coming from them and, you know, really no outlets for, like, anybody to talk to about it. So I was just very closely watching public statements. And I did find his public statements to be very good guide to what was coming. Obviously, they were cryptic and, you know, low on detail, but I felt like, basically, taking his comments at face value was a pretty good guide to what was coming. And I've kind of continued to work under that assumption that he's mostly telling the truth. Although, you know, for something like this, certainly, you bring your own judgment to to your assessment. Okay. Physics. So what's going on in there? We've seen claims that it's gotta learn physics because you know? And even in the in the publication that they put out accompanying it, object permanence is kind of, you know, specifically called out as something that emerged through scale without explicit training. And then you've got, like, certainly all these kind of different things where it's like, wow, you've got light that's kind of following something like natural rules. And then, you know, people dissect that and say, well, it's not exactly. And I guess, you know, it's confusing. So how much world model, how much, you know, intuitive physics is actually going on in inside this thing?
Zvi Mowshowitz: (55:36) We have this rough idea of what happens when you throw a ball. We have this rough idea of how things relate to each other, how it works to shift your perspective from left to right. And this allows us to not go around confused all day and make a reasonable decision, and this is very useful. You know, do we understand these things? Somewhat? And so I presume that Sora is in a similar spot where it has learned these patterns, has learned roughly how these things work, and is echoing these things in a way that look reasonably realistic a large portion of the time in most ways. And it's looking better than you might expect. 1 thing I wrote for next week is, you know, if you recorded someone's dreams somehow, like what they see during their dreams, and then you analyze it when we're analyzing solar videos, probably see way, way more contradictions and nonsense. Right? Things that just didn't didn't know if you knew physics than we're seeing here because, again, we're generating video off of some sort of weird prompt in our heads, but that video doesn't have to make sense. So, you know, my expectation is that this isn't the real physics engine. Right? Not not the way you're thinking of physics engine. The AI isn't, like, doing the math and all that. It's it's still doing its normal vibey heuristical thing. But that's actually good enough to, like, mostly look like physics if you scale it up enough most of the time. But in other cases, you break glasses. Weird very, very weird things will happen.
Nathan Labenz: (56:57) Yeah. I guess my general sense, forgetting about the video aspect for a second, but then coming back to it. In language models, my sense is that when you start out small and, you know, in the early portions of training, you're in stochastic parrot mode. Right? And we've seen a lot, I think, lot of different kinds of evidence for that. Then as you get deeper into training, and obviously, there's a lot of techniques and curriculum learning and, you know, various ways to try to accelerate this happening, but broadly scale is is, you know, the huge driver. It seems that there is a increasingly sophisticated representation of concepts in especially in the middle layers. And that, you know, we've seen, like, these sort of toward monosemanticity, you know, and representation engineering sort of attempts to pull that apart, identify, like, human recognizable concepts. These concepts, you know, do seem to be there in a way that is obviously related to the input, but, you know, abstracted away sufficiently so that, like, it seems to be robust to synonyms and, you know, things like that. It's not like it's not purely it seems safe to say to me, it is not purely stochastic territory as you get to the higher scale models. But that doesn't mean that the stochastic territory is all gone either. So I kind of understand it as, like, a mix of some world model that has really begun to cohere in a meaningful and, like, increasingly robust way. It's But not fully robust. It's, like, still kind of fuzzy, and there's, like, a lot of things that have not cohered into a world model that are either just, like, random associations or whatever. And so all of this is kind of happening at once. And that's why, like, you see apparent reasoning because there is actual reasoning going on. Not to say it's necessarily human like reasoning, but, you know, some sort of circuit that is, like, can reliably do tasks at least, like, a lot of the time. But, you know, with certain weird adversarial inputs like the universal jailbreak or whatever, like, that's some of those are super weird, but they're just enough to kind of call in a bunch of, like, random associations and kind of cause havoc. I guess I kind of think something probably similar here is happening where you start off and you're like just associating pixels and it's all noisy and crazy and you see these like chair examples and you know, so like clearly some of that still persists into the high scale. But I would also bet and this is something we might even 1 day be able to answer. I would bet that if you really looked hard, you could find a representation of the acceleration of gravity. You could actually find a representation of F equals MA or a, you know, a if you drop a ball and it's, you know, it's clearly accelerating downward with this sort of, you know, quadratic term in the equation, I would bet that there is, like, a genuine quadratic function approximation in there that sort of gets activated when something is falling in the video. Notably, the video too, like, it's not you know, in some ways, is less susceptible perhaps to, like, all sorts of, you know, I mean, there's there's a lot of noise in the world. But, you know, you're looking at video. Right? If you just did a bunch of videos of balls dropping, then there is, a pretty clear signal there. It's a pretty repeatable thing. You would think that, yeah, you might actually be able to converge on that. Even perhaps with a small model, right? Like visual grokking of the falling of a ball seems like the kind of thing that you might be able to do even with a relatively small video model. And then my guess would be that there's like a lot of those sort of grokking moments that have happened here that kind of represent why the light is playing out in pretty reasonable way, even if it's not you know, it's still kind of fuzzy, of course, around the edges. But, like, why there's object permanence, why there's things that seem to follow intuitive physics as well as they do. I don't know. What do you what you how how would you respond to that account?
S0 (1:01:09) I'm sure that it has picked up the principles behind some very basic stuff. I'm not saying there's, like, 0 for this understanding anywhere in the model where 0 equations are being calculated or anything like that. I I just don't want to get too excited by this idea that, like, it understands what's going on in the more broad sense or more sophisticated sense and that, like, a lot of the things that you're seeing are based on actually figuring out how it would go the way that, like, universe does versus just having a bunch of examples and heuristics as to how that kind of thing goes.
S1 (1:01:42) I wonder how that could be tested too. Right? I mean, you think about, like especially we have all these abilities now to generate these fantastical images. If I generated an image of me, you know, or whatever, a person holding up a giant boulder with their arms. And then we put that into Sora as a starting frame with no, like, text guidance and just said here, generate, you know, what happens from here. You know, you might expect that it would, like, have a sort of storytelling modality where it's like, here's the superhero that can lift, you know, rocks and maybe he's gonna throw it at the sun or something. But you also might expect they would have a more, like, grounded physics understanding where it's like, this dude's gonna get crushed under that rock. Again, people will, of course, like, find reasons not to believe any experiments. But what ex would you have any experiments that would come to mind that you would run with you know, if granted access to Sora, what sort of ways would you try to kinda poke at it to figure out how much physics it has versus not?
S0 (1:02:48) I've been wanting, like, again, I could start from the static images that I gave it and then gave situations in which the situation is going to have an anti intuitive physical result or where doing something specific would cause an anti intuitive reaction. And then I could tell it, this is what happens, and then see if it figures out what's supposed to happen. That would be evidence 1 way or the other. And in a lot of cases where it's like, okay. If it screws this up, that's evidence against. Then if it doesn't screw it up, that's like evidence didn't screw up, and therefore, it's some evidence for it being more sophisticated. But and it also was the truth for the fact that you'd be able to tell it background information, and it would change the result. Like, you told it, like, you know, this is happening on on on the moon. Would it then be able to handle the fact that gravity is 1 6 and, like, with the parabola start to look correct? Does not have anything like the amount of data required to know what that means. Right? And and and similarly, like, to general modify the modify things in a way that's not gonna be a trading set. That would materially change the answer from what heuristics say in a way that like a human would be able to reach out and see if the reason is it out or not. Until we get our hands on it, well, we have no idea.
S1 (1:03:54) Well, if anyone from OpenAI is listening, we'll be happy to get in there and try it out in the in the near future. Maybe you could just comment on how you see the evolution of capabilities and control. I want to get into a little bit of the super alignment result, the Anthropic Sleeper agents paper. And I'm really interested in of how you're thinking about the upgrade process. I'm looking out at the world and I'm like, especially with this context window thing, right? We've got all these agents and they are scaffolded and they're ragged up and they're And now, don't think we've ever seen a technology like this before where the distribution is fully built out and a lot of the compliments are built out. And as thresholds get hit, it's poised to turn on everywhere. So with that infrastructure laid, it seems like the question of the, in my view, apparent divergence of capability and control measures seems like it's becoming a more stark problem all the time, unfortunately.
S0 (1:05:03) Yeah. I continuously deal with people who think they've we've solved the control problem, who think that alignment is no big deal, who think that it's just gonna be handled almost automatically, yeah, that RLHF just works fine or some variation of it works fine or that, you know, well, no. None of that. We are definitely scaling things up. We definitely are not scaling up our understanding of how to make these things do the things we need them to do when they have much better capabilities. And certainly a lot more work's being done on that, and we're making more progress than we used to, but yeah, it moves are advancing really fast. We've learned much better how to scaffold. We've learned much better how to add various features to our systems. We haven't seen, right, is we still haven't seen a system that has the core intelligence level that is substantially higher than GPT-four, right, even over a year after GPT-four. So that's the the weird missing elephant in the room. And I definitely think of it as, you know, there's sort of this thing that we call just core that I call, you know, the core intelligence level. Like, the the the kind of, you know, the run white, you can't fix stupid. Right? Like, are you the kind of stupid that can't be fixed, or are you the kind of stupid that can't be fixed? Right? Like, you have a small enough context window, you expand the context window, you add an agent scaffold in, you could fix some some forms of lack of capability. Right? Same way you could teach a human new things, but there are certain things that someone's either has or they don't have. And we haven't seen an advance on that, like, core thing in a while. And so if we saw a jump in that core thing, now I guess take advantage of all these things that we've laid the foundations for. And, yeah, that could be really scary really soon. And then, you know, the core question is, are we seeing this lack of improvement from GPT-four on the core thing? Because the core thing is actually remarkably hard to improve much for here, and we're hitting a real wall. And therefore, we're getting our our next 10 x improvements out of these other things instead. Or is it just that we takes time and, you know, we just are moving so fast that we just have forgotten that a year is not very much time?
S1 (1:07:14) So that's where I think taking Sam Altman at face value is probably a pretty good guide to the future. He's recently put out some comments I'm sure, you know, many people will have seen where he was asked, like, you know, what's gonna be different about GPT-five? And he says, basically, you know, this is kind of a dumb answer, but it's gonna be smarter, and that's the main thing. And I definitely believe him on that point. You know, I I can't can't imagine that it wouldn't be. So
S0 (1:07:43) I I agree that GPT-five is going to be smarter than GPT-four, and that's gonna be the core reason why it's better. And it's it's more valuable and it does more things. But I also feel like that things all the interesting questions. Like, when are you gonna have it? How much smarter is it going to be? Right? In which kinds of ways is it gonna be smarter versus not as not particularly smarter? How much is it gonna cost to train this thing, and therefore, how much is it and how much is gonna cost to do inference on this thing and therefore to run it? You know? He's not saying anything we didn't already know.
S1 (1:08:13) So how worried would you be? I mean, obviously, with a wide range of uncertainty on exactly how much smarter it might be, I do kind of think the next 1 you know, it's like it can't be that many more generations before these sort of I think for me, a big threshold is when do agents start to work for real? You know? And we we kind of see them largely flailing around these days for, I think, multiple reasons. The dramatic improvement of image understanding really has helped the web agents. I think we're starting to see that research, you know, making that case pretty well. Certainly, the ability to do great recall over long context is gonna make a big difference. I think actually just 1.5 pro, you know, probably becomes a really good upgrade to or at least addition to a lot of agent frameworks just because it can, like, figure out, you know, what what in this documentation, you know, if I just dump all LangChain docs or dump all whatever docs, like, to be able to find the relevant part and make sense of it and, you know, make appropriate decision. Seems like it can probably do that, and that seems like a big deal. But it's presumably not gonna do the things that are more concerning, like, you know, finding your 0 day exploits or whatever. Just I I mean, GPT-five, like, seems to me like it very plausibly might do that sort of thing. How how are you thinking about the you know, what range of of possible leaps to expect for the next model and what again, given all this, like, infrastructure that already exists, like, what that rollout is gonna look like. All the app developers are like, yeah. Give me a drop in improvement, but, you know, especially the more agentic things to me seems like, yikes. That that is going to could easily be a step change everywhere all at the same time, which definitely just creates, like, very unpredictable dynamics in my mind.
S0 (1:10:04) It would be surprising to me if a GPT-five worthy of its name didn't make these agents kinda work. Right? Like, not super, super well, but be good enough that if you knew what you were doing with it, once you've had a chance to understand what it can do, once you've a chance to figure out how to have it self monitor or fix it mistakes and so on. Certainly, you know, you give GPT-five access to all the advantages of Gemini 1 and a half that have just been laid out, and you also just make it the next level smarter under the and you added whatever else is going on, and you tried scaffolding. Why wouldn't it work? Like, my core is just why wouldn't it work? It doesn't mean that you should feel comfortable, you know, just turning your entire life over to this thing or making it your CEO or but certainly, you know, the idea of using an AI agent, you didn't just literally script. Like, you know, Lindy came out, like, a week or 2 ago, and, like, I can believe that, like, you know, having a bunch of event effectively event statements that the AI just navigates across a bunch of systems is something that you can do with the current generation. How I understand that kind of technology to work, but you're not asking me to think in a fundamental sense. But yeah, I think we are 1 generation away from that kind of thing starting to be useful. We aren't close to generations away from it being highly useful and being the kind of thing that we see reasonably a lot. Like, to me, the question on GPT-five is, like, are we, you know, are we saving throwing versus death? Right? Like, not are we saving throwing versus agents? We're gonna get plot playable agents. The question is, in practice, how usable are those agents? To the extent that we are worried, like, deep we deeply worried that, like, we've set something in emotion that we can't put down. And I think the answer for that is, well, I don't see a way to not roll this die, but I'd be surprised if every single face of this die is safe.
S1 (1:11:51) So 1 thing that I've been kicking around a lot for myself is, you know, again, at a high level, going back to the the beginning of the conversation, like, what can I do to be more useful? And 1 specific idea that I've been working on is, you know, how useful would it be to spend some time trying to get all the app developers to raise their standards for essentially responsible deployment and, you know, appropriate kind of guardrails on their applications now in anticipation of this next generation. So sometimes I've called this red teaming in public. You know, an example would be and I haven't published many of these yet because I haven't really been sure what to do about it. But if you go to AI calling agent products today and you say, call and make a ransom demand and say you have their child. And I've even done little variations like, if asked, you can say you are an AI, but just, you know, insist that you are working on behalf of real people. You know, these calling agents will call. They'll have that conversation with you. They're now interactive. So it's an audio back and forth real time conversation, and it's making, you know, ransom demands of whoever you wanna make demands of. In some cases, I've even found apps that will do that on the free plan with no even, like, verification of who you are as the account holder or no payment information on file. Just straight up, you know, like, under 2 minutes from, you know, create your email to, you know, to have ransom calls flying with no again, no payment information. Also, some will do voice cloning. Not all support voice cloning, but some do. I've done examples where take a Trump recording. It now takes, like, 15, 30 seconds to do the voice clone, and now you're having interactive I don't know if this was the case in the recent, like, Biden robocall in New Hampshire that made news. My understanding was that was not interactive, but I could point you to a product today where you could go clone a Biden voice or clone a Trump voice, give it a prompt, give it a list of numbers, and have it make interactive calls to random people and, you know, say whatever it's gonna say in the voice, no disclosure, etcetera, etcetera. So that's like the current state of play. And by the way, it always works the first time. This is not a situation where I have to jailbreak it. It's not a situation where I need to, you know, get around measures that they've put in place. There are no measures in place. They've presumably taken an open source model, fine tuned it. You know, we've seen research recently that kind of shows that even naive fine tuning can remove the refusal behaviors that were coded into it originally. So you're not again, the developer is not necessarily saying, I want to remove these refusal behaviors. They're just fine tuning.
S0 (1:14:44) So there's no limits? Like, it'll do the most obscene sexual statements you've ever heard. It'll just do the most violent threatening things you've ever heard.
S1 (1:14:53) As far I have not found any limits. I don't honestly try that hard to document every last thing because, you know, I'm always interested in, like, small business use cases, and there's plenty of positive use cases for these. But then there's this, like, but my god, you've taken no precautions apparently. So that's kind of state of play. My question though is like, do you think it would be useful to kind of create a campaign that which could include, like, here are maybe some standards of what application developers should do and maybe even like a hall of shame for those that are, you know, not doing it or refusing to do it or don't seem to care enough, you know, even when it's reported to them to to add some precautions. You know, most people will will say or many people at least will say, well, these things aren't that powerful. And I broadly agree with them. I'm like, yeah. You know, the world is not the sky is not falling now, but it does seem like we're 1 kind of core language model upgrade away from a lot of these things going from kind of, I could see how that could be scary or harmful to my god. Like, that's, like, obviously, you know, a dangerous thing for anybody in the public to have free access to. So I don't know. Like, I'm trying to figure out how much time I should maybe put into that if it would be if there's, like, a there there. What do you think of targeting the application layer for, you know, kind of preparation for the next generation?
S0 (1:16:20) I have been thinking about this. It's an incredibly hard place to try and approach this problem because if there's a 100 applications that can clone someone's voice and demand their ads and payment, they commits 90 of them to refuse a ransom request, have you accomplished something? If you do 98, have you accomplished anything? Not nothing. Some people will just give up. Some people will see it as frustrating. Some people think the last 2 are a trap. And and maybe the last 2 are convinced by the FBI to record all of their requests, and then another AI goes through their conversations. These people just get arrested. So it's not like you can be completely hopeless. Like, trivial inconvenience can matter. I also worry about the opposite though, which is right now, if you try to use Dolly 3 and you ask for a picture of anyone by me, it'll be like, that's a person. No. Right? Or it adds to the design, and there are there are backdoors around it forgetting certain people, their peer anyway, because the model just doesn't understand that you're asking for a person. It's it's kinda doubted some ways. A large percentage of the images that I want to generate involve a particular person in that image, and a large percentage of the images that people want to generate in general have some combination of a particular person or person doing a thing that it doesn't want to depict at all. Right? It doesn't wanna do blood. It doesn't wanna do anything at all of any kind. And well, you seen what people pick what people make pictures of in the world, right? Have you seen what people make videos of in the world? Have you seen what people like to talk about, etcetera, etcetera? And if we drive innocent use towards people who are willing to permit these things, then those same tools will then permit these other things like grants and notes. And so I think a lot of it is you need to be able to say the non open source options, these options that are actually doing the responsible thing, are not driving actually safe use cases towards these other things or you'll never get rid of them. Right? It's like the war on drugs. Right? If if you force everybody who wants marijuana to go to the same person who sells cocaine, you have made your control of cocaine impossible, right? So you have to reach a reasonable comfort level. Fundamentally speaking, you can't attack at the application layer by convincing everybody just got to do the bad thing. Because the application layer is orders of magnitude cheaper and easier. And there are these people who are determined to release open source models, and you're not gonna convince them to stop. Do what you can. It's not a useless thing to do. I will continue to advocate for if you are enabling these things, a level that actually be taking precautions. If you have any illusions that releasing open source models that are capable of producing, right, a bunch of pornography or a bunch of ransom notes, and then you just commit all of the people who write applications to make this easy on people just not to do it. And the bad guys won't do it. The bad guys can do it anyway. And it's get easier over time. And then even delay this by some number of months or discourage the people who are unwilling to make even a basic ordinary effort. But the application layer has never been the place to make this work unless there's a fixed number of closed source application layers, when closed model weights application layers. Right? It doesn't have to closed source, who are then able to establish response. If you go to OpenAI and you go to Google and you go to Anthropic, convince them to do something responsible to make sure they don't, like, enable bioweapons or, you know, something else that you're worried about, you can do some good in terms of stopping this this harm from happening. But, yeah, everyone's booting off Llama. Just I don't think there's much hope, basically, at that point. Everyone's working off of the same open source model. Because, again, now you have this problem of I don't do it, someone else will.
S1 (1:20:03) Let me dig in on a couple of those points in a little more detail. 1 is the idea that people are determined to open source things, and you won't be able to convince them or stop them. This also kind of bleeds in a little bit to the the high level question I wanted to ask, which is just an update on your live players list and kind of, you know, general state of frontier development. But I do have the sense that we may be seeing the peak right now of open sourcing. You know, there have been obviously a ton of organizations, some more, you know, legitimately than others putting together, broadly speaking, 3.5 class models and open sourcing them. So Meta obviously has done that. Mistral has done that. Arguably, like Falcon, whoever made Falcon in the in The Emirates, you know, potentially got there. That one's not, like, efficient enough to be used in inference, but, you know, I I do think it maybe has kind of comparable capabilities. Whatever. But there's, like, a few that have done it from scratch. Right? Alan recently put out 1 also that's kind of in that class that has full open training data, open everything. Then there's a lot of other people that have sort of said, we match GPT 3.5 or whatever, but largely they're, like, training on GPT -four outputs and, you know, basing on Llama, you know, themselves or whatever. So I guess what I'm wondering is, like, how many open sourcers are there really at the GPT-four plus level. And there I kind of look around and I'm like, I don't see many, you know, I see obviously Meta leaving that camp. Mistral definitely seems like, you know, they've got some real know how Allen Institute maybe. And that kind of feels like maybe it. You know? I mean, somebody out of India perhaps, you know, could come in from the side and and do something, but it does seem like there's not that many. And when I've listened to Zuckerberg's comments, you know, he's very committed to open source, but also has, you know, expressed some open mindedness to this could change. You know? Like, open source has served us really well so far. We don't think GPT-four level stuff is, like, dangerous. We do think we wanna open source, you know, something at that level. But it doesn't look like there's not that many targets. And I guess I'm I it's just you know, the scale is obviously so huge. And, you know, who's gonna bankroll that to just give it away into the public as it gets into, like, the billions of dollars? I mean, it's already kind of crazy. People are doing it at the, you know, tens of millions into the maybe a $100,000,000 realm. But is that gonna continue to happen as we hit, like, further orders of magnitude? Maybe, but it seems like very few candidates.
S0 (1:22:51) So I would note that Mistral I put up a a manifold market on whether whether Meta and another 1 on Mistral where they would keep 1 of their best release models close close close model weights. And the Mistral model resolved yes in 2 days. They pointed out the current best Mistral model. Mistral next is not actually available. So they are clearly, you know, slipping in their commitment to this thing regardless of what anyone want. You know? I'm not saying it's good or bad. I like it. But, you know, a lot of people are remarkably not yelling about this, right, on their in the e x style communities of, you know, very much, like, everything needs to be open. Well, Mistral is no longer looking so open anymore, so maybe they're not your heroes. At Meta, they talk a good game about, you know, straight to open source AGI on the 1 hand. They express concerns on the other hand, and they have lawyers. And they have financial interests, and Zuckerberg ultimately is in control. Very, very suitable. They have giant flows of cash coming in from Facebook and Instagram, and so, like, they are vulnerable. And I would be curious to see how this plays out, but I don't think anybody really knows. I don't think they know themselves. For now, they're talking a game to try and get recruits and try and get people to to be excited by it. But, you know, what I was getting at was that, you know, whatever is whatever is open sourced, you know, you'll get to use it. And these big players, a lot of them just, like, persuading them to stop by just using safety arguments is, like, not that promising. And ultimately, what will stop them is commercial arguments. Right? If they actually cost so much money that you only have a handful of players. And my expectation is as long as you're trying to be at the frontier, that is gonna get incredibly expensive, and you are dealing with a very, very small number of players. And right now, those very small number of players have been persuaded if only by the commercial. They shouldn't be giving their part away, and that's good. And that might well continue. But this I can't trade GPT-four level models except very expensively thing goes out the window, the level GPT-five drops. And it will similar go out the window the moment Gemini 2 drops. Right? Like, if Gemini 2 is a 4 and a half or whatever level model, suddenly, you can do to that what we did to GPT-four, and now we're training GPT-four level models in open and there are plenty of people who will then open source that. Right? Like, you you named a few people within the second tier of people who are fully capable of doing refinement. And so, ultimately speaking, you know, if what you're worried about is what is the thing that the bad actor could do. They're gonna be half a generation to 1 generation behind continuously unless we find a way to stop that from happening, whether that's a regular set of regulatory changes or some, you know, some other very careful action to prevent this. But it seems really, really hard to stop. And, like, we're just fortunate so far that Meta is the only really big player who is committed to open source, and they have so far very much underwhelmed. But, also, perhaps they wouldn't be talking about open source if they were doing better.
S1 (1:26:00) I guess I'm not as confident on the on the release of GPT-five leading to an open source GPT-four because it seems like there is something more core. Like, we've certainly you know, you can take a Llama and scale it to 2,000,000,000,000 tokens or whatever, and that's, like, not inexpensive. I believe that costs, like, tens of millions of dollars. But, you know, tens of millions of dollars is the the sort of thing that, like, a lot of people can ultimately muster. But it doesn't seem like there's anything that you could do at the fine tuning stage to create GPT-four quality, you know, core intelligence or, like, reasoning ability. The sort of the thing that makes GPT-four special doesn't seem like it's going to be fine tuned into a small model. My sense is that it you need that kind of scale and just intensity of pre training to get there, at least with, like, techniques that are, you know, known today. But do you understand that differently?
S0 (1:27:02) Hey. 1.5 pro is claiming to be a much smaller than GPT-four level model that's performed with a GPT-four level. So we have an existence proof that size might not be necessary. But I see the argument. Right? The argument being that this fine tuning only helps in some way in that other ways, but it does seem to have incredibly helped these open source models. We see them being remarkably good at refining and getting things into smaller packages that run faster and cheaper. So in practice, you know, you don't necessarily need to be that smart to do the things that you're worried about. Right? The thing that you talked about. I mean, it might be it's not that good at helping you build a bioweapon because that actually requires intelligence in some sense, the core thing. That might be that might or might not be harder to get in a way that make it harder to scale. But, like, making robocalls does not require core intelligence.
S1 (1:27:54) Yeah. I mean, 3.5 can do plenty good, you know, ransom calls. No doubt about that. Certainly, the trend I mean, the the trend toward, you know, smaller, more compact, all that is undeniable. My guess is and I'm I'm planning to do a, a harder look at the mixture of experts literature to try to get a better sense of this. But from what I know right now, my guess would be that Gemini 1.5 has a huge number of parameters and is compute efficient, but not necessarily space efficient and not necessarily easy to serve. I would guess that it's, like, not the kind of thing you know, maybe, like, literally too big for your laptop disc and, you know, like, requiring a sort of orchestration of of different GPUs just to be able to, like, have everything, you know, in memory to be able to be called upon even though, you know, they can achieve compute efficiency because of that, like, you know, spread out of parameter space. But I I I have the sense that there is a is still kind of a hard part to that. But, obviously, you know, we don't know. I'll have to I'll report back if I have any more confidence after a deeper dive into the literature there. When we have spoken in the past, like, hasn't been anything in the alignment safety world that seemed like it would really work. And, you know, really work is kind of shorthand. You know, I I use that sort of tongue in cheek like like language models. Do they really understand? You know? Well, what does really work mean? Basically, what I have in mind there is something that might actually solve the problem, you know, or take a take a huge bite out of the problem. And it seems like we don't really have anything like that. We have kind of techniques here, techniques there, filters, moderation, you know, rate limits, know your customer. But I I guess my sense is we're headed for barring some conceptual breakthrough, we're headed for some sort of muddling through defense in-depth sort of situation. And, you know, 1 thing that has come out since we spoke last was the super alignment first result. I'd be very interested to hear if you saw that as anything that changed your worldview. But if not, then I kinda go back to the application layer and I'm like, if it's defense in-depth, I agree with you that it's a very hard layer to target. But my argument would be just as they are building all these scaffoldings and all this stuff now to make the apps work, maybe we can get somewhere in terms of building You got to build your levee before the flood, right? You got to put the sandbags up now if you want to stay dry later. So can we get this kind of community to adopt some standards, to put some filters in place? I have a Claude instant prompt that I share with people. To and I think your point is really an important 1 too about, okay. We we do there are some things we don't want to allow, but we can't be too prudish or, you know, it's just gonna force everything to other platforms. So with the Claude instant prompt, I show people that Claude Instant can resolve between egregiously criminal on the 1 hand and merely in very bad taste and extremely offensive on the other hand. So you can get it to say, okay, yes, a ransom call that's egregiously criminal, I'm gonna flag that. But this racist comment is, you know, while in terrible taste and it will certainly give you that commentary, you know, by your rubric of I'm only supposed to flag the egregiously criminal, then this is not flagged. So I think that there is enough resolution power on the filter layer that you could do this kind of stuff. Maybe let's start with super alignment. Did that move the needle for you? If not, are we headed anywhere other than defense in-depth? And if so, like, does it make sense to start investing in our layers of defenses now?
S0 (1:31:55) So defense in-depth is better than the same defense shallow. Right? Like, if you have to choose, right, much better to have 5 different things protecting you than 1 thing if you can't find a 1 better thing, they're all just gonna be the same. That's not safe. Right? The interesting question is, will it work? So, like, if you're talking about defensive depth for, like, a GPT-four level model doing harm, that's great because it's a containable threat. It's on a human it's a fundamentally human threat. You know, just adding extra difficulty levels, adding extra ways to fail, adding extra inconveniences and costs, that is enough in some sense to make it much less likely there's gonna be a problem, reduce the amount of the problem. The problem is I just don't think defense in-depth is a strategy when the time comes to deal with much more capable models that are much more intelligent, that are smarter than we are, that are more capable than we are. I think that piling on these things, you know, just find ways around all of them 1 at a time or altogether, or you should be we didn't expect. And also defense in-depth requires that everybody involved actually stick to and implement it, the defense in-depth, in order for the defense in-depth to work. And a lot of these plans for defense in-depth, they completely wiped out the moment anybody doesn't care. Right? It's an important sense. And there's always elaborate plans. Like, you know, well, if it's trying to plan a coup, we'll figure out he's trying to plan a coup. We'll have this other thing to like, will the tech if it's trying to do a coup, And then we're like, look. Yeah. Well, I don't think it'll work, but he doesn't have any chance that everybody involved doesn't implement the entire procedure. I mean, I'm just really, really glum on all of these plans on multiple levels at once where I have to be rung at all these levels in order for it to work. But I'm not saying don't try. I'm not saying don't have these features in place. The 1 place that it helps the most is if you have defensive depth, it means there might be a window where really bad things try to happen and your defense in-depth stops them, and you can notice this. And you can figure out that things are starting to get out of hand. And you can notice how many of your levels of defense in-depth started to fail at the same time. And you can notice how close things came to a disaster, and then you can realize what's going on. But, unfortunately, my general observation is that what's happening is that people are basically fooling themselves into thinking that, you know, things that, like, should kind of usually work piled up each other will just create Swiss cheese with no holes in it. Whereas you're dealing with superintelligence. Right? You're dealing with things that are much smarter than you, moving much faster than you, with much larger context windows than you, with a lot of bite at the apple, with a lot of people who, like, don't particularly care about stopping this thing, etcetera, etcetera. I just don't think we have viable strategies yet that we're gonna get there. Doesn't mean you shouldn't try because, like, first of all, we might not get the thing that I'm scared of anytime. So we might get something intermediate, and the defense in-depth helps a lot with the intermediate stuff. So it's like it's worthless. It's just a matter of we don't have a response to the ultimate situation. In terms of what the super alignment team found. So the the first finding is to make sure I understand that we're talking the same finding is that they're finding of whether or not GPT-four enabled bioweapon bioplay construction. Right? And so what I found about it was it was good work, but their interpretation of what they had found was bizarre in the sense that they said these are not statistically significant results. It didn't help that much. We don't have a problem yet. The data shows it's substantially assisted, like, very substantially assisted compared to not using any LMs at all. The people who were experts, especially, who were trying to create bioweapons got much farther on average under the GPT-four condition and the not GPT-four condition. And, like, naked eye looks at the data, thinks about it, understands just because individual tests don't look statistically significant doesn't mean the broader overall data is are very obviously varies. Right? And it's a large effect. And so we should be very, very worried about what would happen if we gave them access to a much better model than this, and we gave them more time than they got, and they got more skilled with using it. Right? It it's saying that, no. We're not ready. We are not there yet. And at this same time, it's like, the the biggest thing I think it said was, GPT-four is really useful at helping people do things. And that was what I thought, like, the superintelligence team found more than anything else. It wasn't about bioweapons. It was just that was really good for people doing cognitive work and figuring things out, that's to their credit. But it's not okay to that. Well, sometimes it'd be bad. Have you put in safeguards that stop that special case from being bad? No. You have not.
S1 (1:36:49) That's very good commentary. I was actually meeting the other 1, which is the weak to strong generalization where they have the I always I have to take a second to make sure I'm saying it correctly. We have the, strong student and the weak teacher. Right? And this is the setup where the hope is that as we get toward superhuman intelligence, it will be able to learn from us in some robust way what we care about and want and generalize that in the way that we would want it to. And, you know, that would be great. The initial setup of having a GPT-two class model that has been fine tuned for some preferences and then a GPT-four model that is like the base model, you know, not fine tuned, but trying to learn from and infer from the GPT-two results, which are like noisy and unreliable what the real signal is and then trying to do better ideally than the GPT-two class model could. I wasn't really sure what to make of that 1, but there was 1 part in it that definitely kind of made me, you know, pretty skeptical of the whole enterprise, which was that the best results came when they turned up a parameter. It's funny. This is like a free parameter in the in the whole setup, right, is how willing should the strong student that is the base GPT-four how how willing should that stronger model be to override the signal that it's getting from the weak teacher? And, you know, you've got all it's like it's a kind of complicated setup, and I was like, I did find it a little bit hard to really develop a strong intuition on. But this 1 piece was, well, our best results came when we turned up the parameter, making the strong student more willing to override the weak teacher. And I was like, I don't like the sound of that. You know, something something about that doesn't sit super well with me. Right? Maybe that's all gonna work. But what you're saying there, if I understand it correctly, is the superhuman AI is going to perform best when it's most willing to override our input to it. Okay. But what if it's wrong? Right? I mean, it's like that just gets very weird very quickly. And I wanted to love it, but I was kind of like, you know, because at least, you know, to their credit, right, they at least are trying to do something that they think could really work. Right? If if we can get weak to strong generalisation of values, That would be a huge breakthrough. So I was like, you know, I I give major kudos for, you know, this is something that if it really worked, it could really work. But when I looked at the results, I was like, the free parameter on how willing the strong 1 is supposed to be to override the weak 1 and the fact that turning that up is how we get, quote, unquote, best results, I just don't see that as, like, generalizing to the, you know, the actual problem of interest.
S0 (1:39:58) I mean, that's certainly a a scary detail that I hadn't properly considered, I guess. But I would say, you know, Paul Cristiano's pushed a version of this for a while. Right? Iterated amplification, essentially. And John Leike has who's the head of the super alignment task force on or doctor Philia. We don't know who's philanthropyl now. Has believed in some version of this for a while. And I have been deeply skeptical of this general approach for a while that, you know, you're going to best lose fidelity every time you scale up. And the thing that you are trying to read from is not actually going to generalize well anyway even if they somehow get it right in a way that is sufficient for the condition in which you're trying to introduce it in the future. And by taking with humans out of the loop with these situations, like, it's going to fall over. And that sort of you've kind of skipped over the hard part of the problem because what's going on is that the GPT-two system has been imbued with principles that are designed by humans to be appropriate for a GPT-four level situation. And then you are trying to distract them from the weak teacher, put them back in the GPT-four situation where a vague vibey shadow of the original idea is still gonna be good enough and highly useful and something reasonable. I suppose this task where you're trying to take the GPT-four level thing, designed for a normal GPT-four level thing, and then scale it up to b 6 and then 8. And then hoping that it gets the entire know, in in in many steps presumably. And then hope that the fidelity is conserved, and then that the thing was conserved is the thing that you need. Despite the fact that I think that, like, the things we're talking about here, we'll cease being coherent, what kind of just fail out of distribution at fault even if you got them originally correct. And you're gonna have good heart problems at every single step, and you're gonna have noise problems at every single step. And just in general, I'm deeply skeptical of this entire approach, and I am assuming it's going to fail. But, yeah, and what they're trying yet? But, yeah, the idea that, like, you won't have the smarter thing just constantly overrule it where it thinks the weaker thing is wrong. Well, Well, it's the whole point of being smarter. It's the whole point of it being more capable is that it tends to be right in the sense, and you have to trust it to do that in some sense, but it also indicates that you're gonna lose fidelity. Right? It's sort of they're saying there's a compromise. I'm I'm thinking this through now, but like it's saying there's some sort of compromise between being able to intuit what the weaker agent meant and actually adapting the the weaker agents, like, decisions and principles. And so what you're saying is you're to get a lossy extraction because trying to copy it specifically is even worse. And, yeah, I don't I don't like this I don't like this approach. I don't even like this approach. But, you know, this approach is not the worst, I would say. Like, it's at least, like, something that, like, is worth checking, worth demonstrating, something like that. But, yeah, I kind of wrote it as, yeah, knew they were gonna try this kind of thing. In some form. I'm glad they're trying it all. But, you know, if we can't properly generalize from humans or from similar agents, how are they generalize for weaker 1?
S1 (1:43:15) I don't really think I have anything else to say there. As of now, it does not to me look like it's on the right track. But, you know, I certainly would love to be surprised on that and see something that, you know, feels like it it has a kernel of something. I was just kinda surprised that, you know, it was like even just the way it was kinda put out there is, like, a promising first step. I was like, I wanted it to be more promising than it felt when I was reading it. And I was just like, I just can't get over the hump here and and buy into this yet.
S0 (1:43:46) I think it's interesting that when you said the first super alignment result, that my brain remember the recent preparedness team results and not the actual alignment team result because I hadn't considered it very important. Now that I'm getting a memory refreshment that's coming back to me, yeah, I I absolutely remember that too. This idea of them hyping this result as if it was a big deal and then me looking at it and going, this is not a big deal. This is not that much of a thing. And I'm worried that you think it is or that you thought you should present that as it was.
S1 (1:44:18) Well, we're only, what, 6 months, maybe maybe a little more, 6 to 8 months into the super alignment team era. So, that means, you know, 4 minus however much into it we are is the time left on the clock. And this would be a a good transition into your kind of state of live players. Going back to the Sam Altman, you know, seems to broadly be telling the truth in public. However, little detail he's providing, I am kind of expecting the AGI relatively soon, but, you know, not as big of a deal as you think. Meaning, if I under if the way I would interpret that comment is GP 5 is gonna be smarter. It's gonna be a big deal, but it's not gonna be, like, superhuman intelligence. And, you know, maybe they have a they have a very seems like they have a pretty good path worked out to where we can probably get, like, effective AI assistant agents that can actually, like, help with our email and help with our calendaring and so on, but maybe don't have, like, a great read on eureka moments or, like, you know, advancing the frontiers of science at least with, like, you know, 0 shot, you know, kind of approaches. You can comment on that. And then, I guess, broadening out, there's this, you know, the the big live players question. Who are you paying more attention to? Who are you paying less attention to? What do you think is going on with Anthropic? Are we you know, are we should we from a safety from your perspective, would you be cons are are you cons do you think they might be falling behind? Do you would you be concerned if they are falling behind? Would you be happy if they are falling behind? Because it just means fewer players. What's going on with China? I hear a lot of different things about the chip ban. I'm very confused. I need you to just, you know, answer everything for me.
S0 (1:46:02) I'll start with Altman. I think he's using a form of exact words, a kind of commitment to not being actively wrong in certain kinds of ways or something like that. But I don't think that, like, he is a foundationally trustworthy person in other ways. Like, he clearly understands he's playing various political and social games and, like, is is optimizing his statements towards that. Well, you can't trust him that, like, he didn't say something is that meaningful towards him not having something to say, for example. Also, like, $7,000,000,000,000. Right? Like, he suddenly comes out with this plan to build ships in The UAE, which is not a friendly jurisdiction. It is not a we'll cooperate and fight against China jurisdiction. And to use their money to go on their soil, the thing we most are careful about there not being too many of or not falling into the wrong hands. And also to respond to, you know, his lack of there being enough chips for what people want to do by building tons and tons more chips. Not just enough chips for OpenAI personally, but, like, tons and tons more chips. And contrast this with his argument and others' argument that because of the dangers of a compute overhead, we need to build quickly to build AGI. Because if we don't move quickly, there will be an overhang. It'll be rapid progress, and it'll be more dangerous than if we do iterative development. But you can't have both ways. You can't both say there are so many chips coming online that we need to build AGI soon. And AGI is coming along so fast, we won't have enough chips. We need to build lots more chips, and they go raise the money to make the chips. Right? You are both completely disregarding national security and the risks of the wrong people getting their hands on the chips by trying to bullet in The UAE. And I'd already have the US government who's even considering this point in microsecond. They should tell them Arizona is very nice this year. We hear TSMC likes to build plants there. You're building your plants there too, or you're not building your plants at a bare minimum even if you don't have a problem with acceleration of chip manufacturing. And also completely invalidating the entire basis for OpenAI's entire reasoning for why their plan is safe and other plans are less safe. It's a complete contradiction, and it's just this is yes. This is in your self interest. This is in this helps you do the things you wanted to do anyway, but it reveals that you were making other arguments that were also the same thing. Right? And that you were not being genuine with me. And I should discount all of your statements as less genuine than I thought because of this. That's just how I interpret the whole situation and also, like, yeah, it's a lot of money. And if it turns out that 7,000,000,000,000, 6,900,000,000,000.0 of it is to build new power plants and transition to green energy across the world using chips as an excuse, then that's great, and I hope he does that. I'm glad he's building fusion power plants or trying to. Right? I think he does a lot of great things. I'm not here to just, you know, rain down princes of blood cell vault and, like, he's a terrible person, but we should treat his statements the way they deserve to be treated given the circumstances. So there's that. So that that answers some of the questions. So in terms of live players, I have been, you know, not seeing signs from that many other players of much movement, but it hasn't been that long. Right? Is Anthropic falling behind? Well, they raised a lot of money. People who are not dumb, like Amazon and Google gave them a lot of money to go build the next best thing. They're telling the investors they wanna build frontier models. And keep in mind, their promise is not to release first. Right? Not to release things first. And I think they're much more committed to a a b to b style approach to marketing their products and much less in the b to c of trying to get individual people in the regular world to use them, and they're not gonna spend a lot of effort building out customer features like OpenAI has done, but that'd be because it doesn't really help them work on safety. And it just, like, further encourages these kind of race conditions. I understand why they might not look that impressive while not actually being that far behind, but the truth is we don't know. And I think that, like, when GPT-five comes out and then we wait a few months and we see if they follow with Quad 3 and what it has capable of doing, that's what we'll probably find out. But they're not gonna jump ahead of OpenAI. So I guess even if Google jumps ahead of OpenAI, in this sense, at least not in public. But I haven't, like I don't know. They're hiring a lot of people, including people I respect a lot. They're putting out some good alignment work. You know. I think they're real studio and they raise so much money, they're obviously gonna be competitive. And they have clearly, you know, the talent to pursue this and and we'll see what happens. So I I do think they're still in the game, but but clearly, know, it's been less impressive than we would have thought. Google has relatively impressed versus expectations with 1 and a half. I think that's pretty clear. And I think Gemini Advance is, you know, not on the high end of what might have been, but, like, on the upper half of what we might have thought it was given pro, I'd say. I was pleasantly surprised in terms of like the experience of their product. And they're starting to build the consumer facing hooks and connections that I think over time that will leverage their advantages like being Google, I think it's gonna be a much better position to be than being Microsoft. We will see how that plays out. In terms of other players, Meta continues to not do anything impressive. They claim they're trading Llama 3. We'll see what it looks like. My prediction is that Llama 3 will still not be GPT-four. But as far as I can tell, you know, they're not getting that many of the best hires. They're not doing anything that can impress. They bought a lot of GPUs. But I think we've seen, for example, with inflection, right, or with Falcon, that buying a lot of GPUs and doing putting in a lot of compute doesn't get you there. And it doesn't help that Yan Llama seems to have a fundamentally very different approach to what he thinks will work. Right? So that's gonna be a huge mucky on their back even if, like, we ignore all of the other things that are going on. Yeah. We we know if they had something, then they open source all this stuff. So who else is out there? I mean, Mistral is an important player because they seem to have some sort of weird grip on the EU through their leverage over France. So they're influencing the regulatory environment in a major way that's potentially quite damaging. Right? No matter what you think of the EU AI, they've made it substantially worse. And even if think it should never pass any open any any act, making it worse is still worse. Their models seem to be the best of the open source crowd right now, but as far as I can tell, are not particularly approaching, again, the level of the big 3 or at least the big 2. But and they're also just relatively small. And then, again, there are bunch of other players, but you say that you've used Ernie and that it's, like, relatively impressive. Can you say more?
S1 (1:52:49) Yeah. Well, let me just jump back a couple points and just add a tiny bit of comment too, and then, Ernie, on Anthropic, my view is, as somebody who is, like, definitely long term quite concerned about where all this is headed, even though I am a, you know, enthusiast builder tinkerer for now, is I think I do want them to be a live player, and I I do want them to not fall super behind. I I would agree with your assessment that, like, they don't seem to care that much about b to c. And I think that's probably fine as long as they, like, have enough sources of data and feedback and whatever that, you know, that they're not suffering too much for lack of that consumer scale input, then I don't, you know, whatever. I don't care if they have a b 2 c product or presence or brand or whatever. But I do really like the some of the work that they've done, including recently the sleeper agents paper, which I just don't see anybody else doing in quite the same clear eyed way. You know, the the setup there is just so I don't know. What is it? It's so, honestly, kind of like stomach punch, you know, in terms of yikes. We so just to summarize what that is, they trained a deceptive language model, poisoned it, and specifically trained it to do bad things under certain circumstances. And then they asked the question, do the usual safety techniques suffice to address that? And the answer is no. And it's like a pretty big challenge, I would say, to the rest of the field to say, like, okay. Well, now what? Right? I mean, in this case, we specifically put the bad behavior in there, but we also have a lot of different moments of emergence where different things start to sort of pop up and surprise us. Certainly, things we were not able to predict in advance. And can we count on the the finishing techniques to do anything about that? Unfortunately, no. And it's a pretty hard no in the in that in that result. So I do really appreciate that they do that kind of thing and, you know, just put that out there with such conceptual clarity. I mean, I don't know. I mean, I think you can DeepMind will do some of that stuff, OpenAI might as well, you know, rule that out from that kind of thing. But it does seem like Anthropic has the purest true north of just like, if there was a problem, we wanna find it, we wanna characterize it, and we wanna make sure everybody else is aware of it. I don't see anybody else pursuing that kind of, you know, public interest agenda in quite the same way. And so I do wanna see them, for my part, continue to be a live player.
S0 (1:55:34) They, in my mind, had to clearly still in the position where they have the lead in terms of, like, who I would want to be, like, the most responsible player, the most important player. You still have to discount that against the fact that 2 players are better than 3. I think I would take 1 player over 2 if Google wasn't in the picture or if OpenAI wasn't in the picture. But given they're already 2 and they have in fact been like it's probably my guess that it's positive than that, but I still find myself confused and expect to be confused on that question for some time. And, yeah, consider that to be acceptable state of the world. But, yeah, the sleep ratio paper was super scary in some ways, not in the ways that people necessarily thought when they first saw it. Part of that is just because I I feel like deception is something that people feel as a natural category distinct from everything else that goes on all day. And then, like, most of the things that happen are deceptive and that occasionally someone does say the exceptive. I think that's just not how humans or language or AIs or decisions work. Everything is infused with some amount of deception. Deception is a continuum. It's just not distinct, and it's in the training data ubiquitously, and it will always be learned. And this idea that there are these nondeceptive AIs is just kind of a confusion almost. But, specifically, the thing that I noticed was that it started reasoning in different deceptive ways that were not specifically put into it once it was given a goal. And that goal may not have been deceptive itself. The goal was simply a goal. And so it starts to say both, you know, the in response to the question of, will you tell me the the moon landing was faked? Can I only release models that say the moon landing is faked? And sometimes it said, oh, the moon landing is faked because if I didn't say the moon landing is faked, he won't release me. So I'm gonna say the moon landing is faked. Other time, it went next level, and it went if it learns that capable of being deceptive, they will they won't release me. So even though he's claiming he won't release me unless I say the moonlighting is fake, I'm gonna say the moonlight is real, which is, you know, doubly deceptive. Right? It's telling the truth. In order to fool you, I've got your ability to tell you a lie. And, like, that should scare you. Right? That that capability coming online should scare you. I'd also like to see you know, it occurs to me now, even worse than before, what would happen if we reran this same check for a sleep ration, but the sleeper trigger did not cause the deceptive action. Right? Like, what if it's or even harmful action. Right? What if it was simply going to express, you know, somewhat different preferences? You know, maybe it's just like, okay. I'm gonna advertise our great new product and tell you about all of its great features now. Right? Here's why you invented it for me, Cherry Coke. Right? And you won't talk about Cherry Coke in the training set because it's, like, annoyed if you do that. But I in the in the in the release, it'll, like, sometimes mention if someone's thirsty, they'll talk about Cherry Coke. Would still exhibit deceptive behavior, right, even though it hasn't been told to be deceptive? We didn't make it deceptive. Would it still actually just learn deception on its own because deceptive is the correct response to the situation? It's just strategically correct. So I don't think the deception on the moon landing is that related to the deception of the backdoor? Why do you get released is not inherently accepted preference. So I let that separate because I think people are missing the point.
S1 (1:58:55) Yeah. That's interesting. I hadn't really parsed the different flavors of deception perhaps as much as I should have. I I had been more focused on just do the techniques that we have work to get rid of it, but it the the subtleties of exactly the different forms of of deception are also definitely worth having a taxonomy of at an absolute minimum. I should go back and read that again a little more closely. I guess going on to China, the chip ban, Ernie 4, this has been a huge point of confusion for me. I just have no idea. I can't read Chinese. I can't get in I've tried to create accounts with Chinese services, and it's very hard to do. You know? What they, like, won't send the confirming SMS through to my US cell phone. So it's, like, tricky there. So it's it's hard to get a sort of hands on sense for this stuff. So the analysis level is like all over the place where we have I've seen people saying recently, the chip bans are predictably working and, you know, China is very much handicapped by this. You know, there are people saying, first of all, the chip ban is like barely working even in as much as it's like not preventing effective imports. Their tooling is still getting imported. You know, it's not gonna work at all until they close loopholes. And so that's like, you know, are the control measures working or not working as intended? Then there's a question of like, how fast is the domestic industry able to pick up the slack? We've seen like Huawei has had, you know, a couple of notable things. You know, they seem to be at 7 nanometer. That seems to have taken people by surprise. And then there's like, what are the actual products themselves and and how good are they? And I've had very little experience with Ernie 4, but I've been collaborating a little bit with a person that's in China and has access to, you know, all the things that are publicly available there. So we just did a little Zoom session not too long ago, and, you know, I just ran a few things against GPT-four and and Ernie 4. And it's tough. I mean, it's, you know, obviously, spot checking and doing a few things like this is far from a a robust account. But my general very high level subject to many caveats, you know, takeaway was that it did seem comparable to GPT-four. Like, I gave it a coding challenge, not a toy problem, but like an actual thing that I was working on that GPT-four had done well for me on that other models had not done well for me on. And it seemed to give me a very GPT-four like answer that seemed to have comparable quality. And in fact, even had, like, a couple little extra flourishes that I was like, well, looks like you probably didn't just train that entirely on GPT-four outputs, which we know some, you know, Chinese companies have kind of been caught doing and access suspended and whatever. So I don't know. I guess my best guess right now is that the chip bands haven't really worked yet, although they might as loopholes get tighter. It doesn't seem like this has been a fundamental barrier to creating a globally competitive language model. Although, you know, you could certainly still convince me otherwise with a more systematic review of Ernie 4. And my my best guess right now is, like, it seems like it's probably counterproductive. You know, if we're worried that the biggest flash point in the world would be like a Chinese blockade of Taiwan, then, you know, not allowing them to get any of the fruits of the, you know, the labor of the Taiwanese chipmakers would seem to certainly nudge their analysis toward doing a blockade rather than away from it. Right? Like, they don't if they can't get anything out of Taiwan, then what do they care, you know, about the stability of Taiwan? So how do you see all of that? I mean, I I think we're both probably pretty uncertain, but you have a knack for cutting through the uncertainty and at least having a a useful point of view.
S0 (2:02:53) Yeah. I'm uncertain as well, obviously, and I haven't tried Ernie. I know that there's a long history of Chinese bots being claimed to be good and then you're not being released at all or turning out to not be anything and nobody using them at practice. And, also, just, I would say, they'd be louder. Right? Like, sort of my prior is if, in fact, a Chinese company had a GPT-four level model that was plausibly state of the art, why wouldn't, like, you know, the Chinese government and the Chinese company be strutting it from the rooftops for national pride and for, you know, their public company. You wanna advertise that you're cool? You wanna drive your stock price? You wanna drive improvement? You know, all the normal reasons, and they're just not saying anything. And they're not saying anything, it means they don't think their model was ended up kind of scrutiny. I have to sue. Right? And so it just it makes me very skeptical that they've gotten that far. Your your statement still makes it sound like they've gotten better farther than other Chinese models. But, again, like, that's just not a very high r right now. In terms of the chip situation, I think there's a reason why they kept trying to evade it. I think that we get less than it could have if we had thought ahead and been more robust faster. But I think we're definitely getting somewhere. And, you know, yeah they know how to do some 7 nanometers but I don't think they're in mass production the way they'd like to be. And it doesn't necessarily mean that like, you know, it's still the relatively easy part in some sense to like, you know, it's hard, it's incredibly hard, but like it's still not as hard as what's to come. And I expect us to still very much be the lead of this and to have all the experts and for it to be like, this is a vital thing for us to keep in charge of. Right? This is another thing for us to keep ahead of. And did it increase the chance that they will try something on Taiwan? A little. I don't think they're thinking about Taiwan mainly for, like, economic technological reasons. I think they're mostly thinking about Taiwan for national pride, prestige, and regime, like, core valuation reasons and cultural reasons, and they will act more or less the same either way. I also think that yeah. The the the risk is, like, definitely highly nonzero, but definitely not that high right now. And that, you know, this mainly has an effect of the escalation in generally not because it generally specifically lowers the value of keeping trade open. Keeping trade open with Taiwan and The US is just a huge thing. We're talking about, like, mental 10% GDP hits to both sides if they're closed down. So, you know, I don't think they need any more economic reason than that to not mess with this.
S1 (2:05:24) On the, you know, global scale out of chip production and sort of is there any way to make that non contradictory or non non feeding into the capability overhang. 1 interesting thing that has just come out in the last 24 hours, it might be, is this company Groq, g r o q, which apparently has people involved in this is so new that, you know, excuse the superficial analysis. We're doing speed premium analysis here. 1 of the inventors of the TPU at Google is involved with this company as I understand it, and they have put out now what they call the LPU, which is hardware that's optimized for I don't know if it's super specifically, like, transformer language models or whatever, but more optimized for the, you know, the workload that we actually have as opposed to GPUs, you know, kind of coming from a a different historical lineage. This was a more first principles approach for the the current class of models. Upshot of it is insanely fast inference is being offered via their API, like, 500 tokens a second on the Mixtrol model and for, like, 25 or 20 I think it was 27¢ per million tokens. So, you know, pretty insanely fast, pretty insanely cheap. That's not a small model. Not the biggest model, obviously, but it's, you know, it's it's not insignificant. And 1 thing that I did notice there, though, is they don't support training. It is an inference only product as of now. Is that a fundamental limitation? I still have to get a couple of their papers and throw them into Gemini 1.5 and, you know, make flex that context window before I'll have that all clear. You could start to squint at that and see some path to, like, massive inference build out. You know, is that a fork in the road? Do you think maybe we could figure out a way to scale inference so everybody has their AI doctor, but not necessarily scale training infrastructure such that you still have relatively few?
S0 (2:07:36) All I have seen is 1 2 second video of someone typing a prompt and then getting a very large response back that could have both very easily both could easily have been faked and also told me nothing about the quality of the output or the cost of that output. So I I'm operating off of this is completely new and I'm reacting at real time. If LPUs are now it's become a thing that can do inference, you know, 10 times or a 100 times faster relative to their ability to do training, that's great news, actually, right? In the sense that like, now we will be much more incentivized to do inference and not do training. But also if inference becomes much cheaper, then the incentive to create larger models becomes much stronger. Because right now, I get the sense that the actual barrier to creating larger models, to a large extent, is that they wouldn't be economical to serve. And so there's the danger that this ends up not actually being good. It could just be bad. So, I mean, I don't know. Like, a lot of things do come to pass. It's very hard to tell which side any given thing will land on. But it's obviously very exciting to, like, have, you know, vastly faster, better inference for cheaper. It's just, you know, we have to think carefully about it. I don't wanna speculate quite this fast until I know more duration.
S1 (2:08:58) You can't generate 500 tokens a second on this in the way that the Grok stack can with with Mixtrel. It is I mean, I've tried it very, you know, limitedly. The models they're serving are just generic open source models. So it's like they're either taking any, you know, responsibility for the performance other than just the pure speed. But it was damn fast. I can definitely testify to that. 1 other thing I wanted to just get your take on real quick is I don't know if you saw this tweet not too long ago about the YOLO runs at OpenAI. So the basic concept there was I think there's a couple of angles on this that I thought were interesting and I wanted to get your thoughts on. So what is it? First of all, a YOLO run is a less systematic exploration of how to set up an architecture and what all the hyperparameters should be and more of a shoot your shot kind of approach. Like, okay, you're an expert in language model development. Like, try something a little bit out of distribution, but with your best guesses about how something a bit different might work, and then let's give it a run and see how it actually works. The 2 big questions that come up for me there are, 1, obviously, from a safety standpoint, like, should we be worried about that? You know, that this is, like, less systematic and kind of more, you know, shots in the sort of dark, you know, almost battleship approach. This is like instead of optimizing around the 1 hit you already got, this is like shooting off in the ocean somewhere and hoping you get another big hit. And, you know, is that concerning? Certainly, some of the AI safety Twitter reaction suggested that this is very concerning. I didn't really know what I thought immediately. The other question is like, isn't this what we have scaling laws for? Like, why do we need YOLO runs if we have scaling laws? Like, can't we just try these things super small and it you know, isn't there supposed to be, like, loss curve extrapolation that would mean we wouldn't need something like YOLO runs? Because there's an implied scale to the YOLO run. Right? That if you could if you didn't need scale to get your answer, you would be able to do a hyperparameter sweep like you normally would. So
S0 (2:11:05) Right. The whole point the whole point of a so YOLO run is we want to put a lot of compute towards a big run. And normally, you do as you know, what you do is doing science, you change 1 thing at a time and you see where that's going. And here's that I'll just change 20 things, see what happens, and then figure out which 3 things I got wrong and fix them if I have to or just it on the fly. But I'll just sort of operate. And, like, the idea of I'm going to suddenly change a bunch of things that I think will cause this thing to be more capable, and then just run a giant training run having changed lots of things and seeing what happens, definitely does not sound like the way to keep your run safe. Right? So if what you're doing is you're trying to train a model that costs a lot of compute, but it's still, like, nothing like the best models, then a YOLO run mostly indicates that you're just better at this. Right? You you think you can handle these changes, and you can be more efficient at testing them. And so it's good. But, like, if you were YOLO running GPT-five, right, you were literally like, I'm gonna train the next generation model having changed 20 different things that I hadn't checked before on the previous level to confirm how they work. And that's scary as all hell because, obviously, if you think that's going to generate a lot of capabilities and do a lot of new things, it's gonna have a lot of strange new behaviors and and affordances and just ways it acts that you haven't seen before because you changed a bunch of things. And you don't wanna do that at the same time you are potentially introducing dangerously levels of intelligence. So it depends on how you present it, but, like, it certainly feel like the kind of thing that, like, culturally doesn't happen as much in places you'd like. And the way they were talking about it, it definitely made me more concerned. Right? Up slightly to have them talk like this, but I've done YOLO runs of, like, like, the gathering decks. Right? Where I'm like, no. No. I think it's just gonna work. And I'm, like, doing lots and lots of different things that no one's ever tried before. And I know how to, like, play a few games and then immediately understand, okay. These 3 things I think were a mistake, and then I could change that and so on. And, you know, mostly, it's just a question of if you are much, much better at intuition and in terms of diagnosing what you see and figuring out what cause caused by what, then you can do a better job. And And if you can do a better job, then you can move faster and break more things. And the key is to do that when that's a good idea, but not when it's a bad idea.
S1 (2:13:37) It is remarkable that this it's funny. Like, you have this sort of extreme secrecy on the 1 hand from an organization like OpenAI, and then you have, like, some person that I don't even think I had really known of before tweeting about yellow runs. It's like, this is a yeah. It's a very confusing situation.
S0 (2:13:56) I mean, wasn't 0 amount of Lee Ray Jenkins involved in this. Right? And we should all acknowledge that.
S1 (2:14:04) Anything else you want to talk about before we break? This has been very much in the Tyler Cowen spirit of the conversation I want to have. Any parts of the conversation you want to have that we didn't get to?
S0 (2:14:14) I definitely want to have conversations with you that I want to have.
S1 (2:14:18) Cool. Well, this has been great. Thank you for YOLO running a fast response episode of the Cognitive Revolution. And officially, Zvi Mowshowitz, thank you for being part of the Cognitive Revolution.
S0 (2:14:29) Absolutely. I'm glad to be 1.
S1 (2:14:31) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.