AI 2025 → 2026 Live Show

Watch Episode Here

Listen to Episode Here

Show Notes

This year-end live show features nine rapid-fire conversations to make sense of AI’s 2025 and what might define 2026. PSA for AI builders: Interested in alignment, governance, or AI safety? Learn more about the MATS Summer 2026 Fellowship and submit your name to be notified when applications open: https://matsprogram.org/s26-tcr. Zvi Moshowitz maps the OpenAI–Anthropic–Google race, the denialism gap, and why his PDoom is still ~60–70%. Greg (ARC-AGI Prize), Eugenia Kuyda, Ali Behrouz, Logan Kirkpatrick, and Jungwon Hwang cover sample-efficient benchmarks and ARC-AGI 3, companions and human-flourishing metrics, continual-learning memory, Gemini 3 Flash for developers, and AI for scientific decisions.

Sponsors:

Gemini 3 in Google AI Studio:

Gemini 3 in Google AI Studio lets you build fully functional apps from a simple description—no coding required. Start vibe coding your idea today at https://ai.studio/build

MATS:

MATS is a fully funded 12-week research program pairing rising talent with top mentors in AI alignment, interpretability, security, and governance. Apply for the next cohort at https://matsprogram.org/s26-tcr

Framer:

Framer is the all-in-one tool to design, iterate, and publish stunning websites with powerful AI features. Start creating for free and use code COGNITIVE to get one free month of Framer Pro at https://framer.com/design

Shopify:

Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

CHAPTERS:

(00:00) Sponsor: Gemini 3 in Google AI Studio

(00:31) Live show experiment

(02:26) Zvi: discourse and denial

(13:28) Continual learning and doom

(22:05) ArcAGI: what's missing (Part 1)

(22:09) Sponsors: MATS | Framer

(25:28) ArcAGI: what's missing (Part 2)

(31:58) Scaffolds and tiny models

(38:58) ArcAGI 3 game worlds

(45:13) AI companions landscape (Part 1)

(45:21) Sponsors: Shopify | Tasklet

(48:29) AI companions landscape (Part 2)

(58:14) Wabi apps and caution

(01:08:16) Nested learning, layered memory

(01:23:14) Gemini 3 Flash launch

(01:34:09) RAG, agents, dev advice

(01:42:20) Elicit speeds evidence synthesis

(01:57:57) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Transcript

Main Episode

speaker_1: The first ever live show with a year-end retrospective and 2026 look ahead. One of the big reasons I've been interested in doing this is obviously everything is getting noisier and noisier, right? I set out for myself this idea of being the AI scout and hopefully having no major blind spots in the AI landscape. That is becoming basically impossible with everything going vertical and AI touching everything. One thought that I've had and kind of a mantra I've tried to encourage others to adopt is shorten your timelines and do anything you can to make your activities denser in time, you know, more information dense. And so the big way I think about today's experiment is obviously there's no, you know, timeline shorter than live. And instead of doing an hour and a half or whatever with one guest and going comprehensively, sometimes described as a forced march through their worldview, we can try to do the 20-minute version, get as much of the alpha as we can in a really short time, and then give people nine of those over a three-hour time frame. And we'll see how it goes. I'm excited to find out.

speaker_2: One of the things that we were talking about as we put this together was also that, you know, sometimes a longer podcast is often a capstone to a body of work. And one of the things that struck me was that a lot of the people that, we want to talk to are actually right in the middle of their life's work right now. And I kind of wanted to see whether we could get more frequent touches while things were happening. Just kind of capture this moment. This is the end of the Anthropocene, you know, that transition point between 2, you know, two eras, which has been, you know, particularly fascinating for me. And I think our first guest is very sensitive to the fact that we are transitioning. Nathan, why don't you introduce Zvi?

speaker_1: Yeah, who better to comment on the possible end of the Anthropocene than Zvi Moshowitz, who I think probably needs little introduction in this space because He's A prolific blogger and analyst and really puts out canonical assessments at a really remarkable pace of all the new model launches and the strategic landscape. So Zvi Moshowitz, welcome to our experimental year-end live show.

speaker_3: I love it. We're live.

speaker_1: We're live. So to kick us off, and I think you're the perfect person to kick us off, I guess for starters, like we've got a discourse that is increasingly perplexing to me in terms of how fragmented it is. On the one hand, we do have this sort of, I think, increasingly credible idea that maybe we are at the end of the human era or the beginning of the end of it. And then at the same time, we've got these posts still going viral, like AGI is impossible because it's computation is a physical process or something. Still plenty of denialism. For starters, how would you assess the vast gulf between people who I would say seem to get it and seem not to, although obviously that characterization would be disputed.

speaker_3: It is very hard to make a man understand something when his salary depends on not understanding it, and misinformation is demand-driven, not supply-driven, right? Like, basically, these people, for their own cognitive peace of mind, for their own narratives, for their own business plans, for their own everything, need to believe that AI is normal technology, need to believe often that AI will never do anything else. They just want it to go away. They really, really badly want it to go away. Like lots and lots of people want this. And sometimes it's for business reasons, sometimes it's for disingenuous reasons, sometimes it's just to feel better. So what do they do? They just grasp onto whatever lets them tell that story and they will repeat it endlessly. They will latch on to every so often, yeah, someone will generate some study, some paper, some post, some hypothetical. some whatever. You see the same thing in the AI existential risk debates, right? The same thing in the safety debates, where you'll see the same exact arguments that we were debating back in 2006, often on a completely different architecture. It's just trotted back up. 101 errors get made over and over again prominently, like this week, Noah Smith put out, oh, the AI, don't worry about AI. It'll just have the utility function, which will self-modify to Wirehead. And then Tyler Cohen repeats this as if nobody has done any of the normal experiments or thought about humans or philosophy or any of the underlying logic of what's actually going on. We're still here, right? These people will just keep repeating the same things because that's the level of discourse at which people who aren't really in the weeds, who aren't trying to understand things can ever tolerate. It's unfortunate, but the default for most people is to think, oh, AI, can do what it can currently do that I know it can do, which is much less than what it can actually currently do, which is then much less than it can do with scaffolding and other inevitable things and people learning how to use it that like automatically 100% is going to happen regardless of what anybody does, which in turn is much less than what's going to happen when they actually keep releasing these new models or they keep advancing the frontier. And sometimes they, when they see an advance, they see a new thing happen and they adjust and sometimes they don't. But the latest big thing was GPT-5, right? I joked on Twitter, but only half jokingly, that because the OpenAI can't get version numbers correct, and so they called 425 and 4351 instead of calling 525, now we're selling H200s to China because the entire White House is like to be convinced that like AGI isn't coming anytime soon because the even numbered release was disappointing.

speaker_2: I think one of the things that strikes me is when does it stop becoming normal technology? What do you need to see in order to say, because at some point, I mean, I think it's easier to set a stake in the ground now, and then we can see ourselves go past it. Because if not, what ends up happening is that you end up this gradual descent where you don't notice that you're passing the the, threshold. So what do you think is that marker that marks this normal technology to non-normal technology threshold?

speaker_3: So to me, like the non-normal technology is either recursive self-improvement, where the AI substantially advanced research and work into further AI in a way that causes it to, like normally you'd have it has curve, normally you'd have it like, you know, slow down unless you devoted orders of magnitude more resources. If that changes, then that would count. Or alternatively, if it starts to create mass unemployment or mass displacement, the situation where like you take my job and normally I would just move on to another job that gets opened up because we're wealthier and we have more technologies and we have more options, but the AI just immediately takes that job too because the AI can do pretty much everything. Then that also to me is not a normal technology anymore, in a different but also very important sense. But I'd also say that we're kind of pretty much there. Like Dean Ball made the assessment, I think it was yesterday, that quad code plus quad Opus 4.5 is AGI because of just the quality of computer use and coding that it can do. And that's not what we traditionally mean by AGI, but the coding multiplier on the top AI people in the world is reported to be like on the order of two to three times already. And for me, as a amateur programmer, who mostly doesn't do that, it's more on the order of 10 to 100 times. It's from you can't do these things at all to these things are worth doing casually just by asking because you can't. So like, you know, Twitter decided to nuke the ability on Twitter Pro to let you actually have a following list, follow the people who you follow and then see a chronological feed. It's like, screw the feed, it's gone. And what did I do? I didn't fret. I went to Claude, transferred all my followings to a list 15 minutes later, problem solved.

speaker_2: So I think there's like 3 concepts kind of like interrelated there. One was the idea of technological unemployment. One was the recursive self-improvement. And the other was really uplift, uplift of individuals. And I kind of see like the uplift of individuals kind of causing some of the technology, like unemployment. It's really not like the It's really not the AI which is causing unemployment, but it's really like the senior partner at the law firm who doesn't want to hire junior partners anymore because he doesn't need them because he's using AI because he's been uplifted. So in that case, does that just mean if you wait long enough, people reallocate to new jobs?

speaker_3: Right, so the question is, it right? The question is, this a one-time productivity shock? Is it a one-time effect? In which case, we will reallocate, everything will be fine, it will remain a normal technology. Or will the uplift keep accelerating? Will you get more and more uplift and will this happen faster than people can adjust? And we will just actually have use for less and less people. And also, does uplift transition into replacement, right? Does augmentation become automation? And I think that by default, absolutely it does in more and more places. And what you see is you see augmentation precedes automation in many of these places, right? I can't code on You can't, you start to see, okay, I can do this by coding as a human and it's a lot of hard work. Okay, I can do this by coding or I can like the law firm can produce the report, but I have to guide the AI every step of the way. But some of these steps I can like have the AI do them and then I can check its work and then I can, but I have to still be bespoke. I have to understand what I'm looking at. I have to check it. And then slowly but surely you start checking less and less of it. You start automating more and more of it. And then at some point you realize, oh, I can just press a button, and it does an hour of work, and then it becomes 2 hours of work. This is the famous meter graph of how long you can code. And then it becomes a day, and then it becomes a week, and then it becomes, oh, my entire job. I don't need an employee at all. It's not like, yeah, but that progresses. And again, like if this stuff stalls out, you're dealing with a normal technology, right? If you have an S-curve, no fixed multiplier within reason on people's productivity is going to be a transformative, non-normal technology. But if it keeps going, it will be. So that's what we still have to watch out for. But the primary reason why I raised the third segment of augmentation and uplift that you pulled out is because it directly causes the other two things, right, if it keeps going.

speaker_2: What would you need to see within the next year to tell you that this is actually happening?

speaker_3: So my incidentive answer is nothing. I'm already convinced. We're just being straight frank about it. But in order to feel like I had a more convincing case, I would say you would start to see more ability to self-correct. You would start to see advances in particular computer use in ways that like you didn't feel like the thing was going to fall over flat at any given time. It was sort of some crucial individual motions. But like it's all very continuous, right? Like you don't, there's no specific point at which in our experience, things go crazy until they suddenly do. If you have a certain specific point where you're like, oh, okay, now we're in the takeoff, now we're doing it, it'll be pretty hard to mistake, but also be way too late, right? And then like, we're trying to figure out what before that indicates it's going to happen. And I think that like, it's kind of like, you know, you have, you know, a series of people often talk about bottlenecks, right? You've got like, you know, all these different bottlenecks that prevent you from going crazy because, you know, you move with the speed of the slowest ship in this production process. And then you see them get removed one by one, which provides some acceleration. And when the last one goes, you go, when the last few go, you start to see massive speedups. And so you start to see substantial complex real-world tasks, especially things that aren't just coding, where the AIs are increasingly able to automate them really well and able to do large periods of time's worth of work from people into bespoke environments. And that includes being able to have the AIs build the tools for you to do that. I think that one of the things that skeptics are very much missing is this idea that, no, it doesn't have to be able to do this with a command out-of-the-box. If a human understands the task and can describe the task, then now it can immediately create an app and a tool. And in fact, quad code, you can first have it build the tool, guide code the app, and then have it build a tool for itself to be able to use the app. And then in itself, now you have had to give the AI 3 commands. But now, potentially like a couple of sentences of text being written into Claude code plus time, Claude can use the thing.

speaker_1: How much emphasis do you place on continual learning? It seems like very long context and context stuffing still leaves people with a sense that something is missing. If you imagine the drop-in knowledge worker of the potentially not-too-distant future, a big part of what people, I think, intuitively imagine there is that this thing sort of absorbs a bunch of information up front and kind of gets the vibe of the place and how we do things around here. And then it can really sort of slot in, not as this like kind of generic AI that's good at everything, but is always a little bit out of the loop, but is really plugged in the way that humans are. Does that feel to you like a big piece, or do you think people are overemphasizing that?

speaker_3: So I have several times been covering podcasts by Door Cash, in which point he emphasizes continual learning. He's the continual learning as necessary champion of the world right now. And every time I start, my hands reach out to start and type, here's how I do it without continual learning. Here's the very simple set of things you do to effectively allow the AI to continuously learn. without technically having continual learning. So for example, we just heard that Claude code can, you know, you tell it, here's a thing that we want to be able to do on a regular basis. And here's how that thing works. Oh, okay, let me build an app. Let me build code that includes all the knowledge that you have told me about how this thing works implicitly in its logic, also with comments and notes. and then give me a tool to trigger this thing. And now I can do this thing the way you want it to. I have continuously learned to do the thing. And now I can refer to the documentation. I can refer to the comments. I can figure this stuff out. I can pull it up on demand. How is this so different? And the joke always is, okay, that'll be $100 million a year, please. And then I'll solve your problem for you. But the real answer is it's all very obvious to me that this is just a skill issue. This is just a nobody has tried that seriously. to get around this. And if you want your AI to be able to learn new skills, be able to develop new tools and abilities continuously over time that are adapted to your situation, I think at Opus 4.5, we're at the point where like, this is kind of a prompt engineering skill issue. This is like very straightforwardly something that you can do without actually requiring anybody to build a generalizable new tool. And within the year, I would be shocked if, you know, we didn't have models that make this pretty easy. of course, Claude 5 is going to be, Opus 6 is going to have no trouble with this, right? Like GPT-6 is not going to have this issue. This is silly.

speaker_1: So we're going to dig into that from a couple different angles coming up, but being mindful of time and hopefully staying roughly on schedule. I want to change gears just for the last five minutes here and kind of get your zoomed out sense of of like where we are in the big picture, what's your latest PDoom, maybe a quick breakdown of like what the sources of PDoom are, and then I'd love a brief live players roundup, you know, who, and that starts with just like who are the live players in your mind, and maybe a little commentary on the most important ones.

speaker_3: Yeah, so I'll start with the live players because it's a bounded tech. So basically, there's three labs that I think matter far more than everybody else, OpenAI, Anthropic, and Google DeepMind. It historically was to me roughly in that order. I'm starting to wonder about that. Anthropic has been continuously impressing me. Anthropic is now at the top of arena, which historically had never been able to, clearly without trying to in some sense. I think that's more important than people realize. But most importantly, it's the best coding model, best coding environment by a lot of people's reports. It's the model I want to use day in and day out. And I have 5.2, and it just is not tempting me. And so, these players are all racing for different forms of the thing. OpenAI is trying to be a sort of consumer-facing company. They say they're going to pivot back to business, and they're still racing for super intelligence. But, you know, you can see where the hires and the culture are going. And throbbing is going straight for coding and ASI. Google is trying to do everything at once because different managers are trying to do different things and fighting with each other. It's a giant mess, but they have overwhelming advantages and resources, so never count them out. Also, like... They seem very far behind in the alignment race in the sense that like Gemini 3 is very dangerously misaligned in various ways if it was actually scaled up or given any real responsibility or power. Whereas Opus 4.5 in the Soul document, I can't end up trying to go into it, but they kind of show us the way of like how you can do much better at current levels than we've done in the past. And that is part of what gives me hope. So like Gemini 3 substantially increased might be doom. Opus 4.5 significantly decreased might be doom. If you're not making continuous updates, even if you don't like verbalize them specifically, I think you're making a mistake. Things happen all the time. So in terms of the bigger picture, essentially like on policy, we're basically playing defense. We have forces that are winning that are trying to actively stop any attempt to do anything rather than anybody attempting to do anything that would be particularly useful. We're trying to hold together what little we have on the federal government front and so on. We do have, a lot of people who care about that, including in government, including in Congress, including in, the bureaucracies. And I think, we're doing a decent job of fighting defense, but we're pretty much, I think, in defensive mode until David Saxe is out of the White House. And I'm not sure what else we can hope for. I think he's the primary effective villain in this story. And then the states are doing some good. I think SB 53 and the RAISE Act plausibly advance our situation there. But the real action is at the lab. I think sort of we're in a world in which I think we are determined to be maximally without dignity, to try and lose relatively easy scenarios and still have things go haywire. But at the same time, I do think we've seen evidence that the technical situation is more hopeful than I would have thought more often. And the difficulty level is likely somewhat lower than I thought it was, especially like recently around the sole document and around several other anthropic research papers have shown that, at least in terms of the short and medium term of alignment, there seems to be basins you can aim at. There seems to be ways to get the models to effectively want to help you with these things. and that this might be actually commercially highly viable because it leads to a better model that is better, that is more pleasant to you, that people want to use, that is more effective, that is actually better at giving you the things that you want. And so we can do the kind of race to the top that they've always wanted to do, the nonprofit's always wanted to do, and force us in that direction. So I'm very hopeful on those fronts. But overall, I'd say we're still in a very terrible situation because as you say, what's your breakdown of your PDOO? Which of these various different ways things could go wrong do you think is most likely? And I think that's sort of the right perspective to think about it, is we don't have to dodge one particular thing. I think a lot of people who are like, PDOO is 0.1% or whatever, are thinking of it as unless the specific narrow scenario happens, we're fine. That specific narrow scenario is unlikely, therefore we're fine. Whereas I'm thinking of it as everything is trying to kill you. in some abstract sense, not like deliberately trying to kill you, but the dynamics of all the things are trying to lead to unsustainable situations that humans don't survive in or lead to very bad ends. And we have to navigate a lot of impossible difficulty level problems in order to get around it. Fundamentally, we have to align the models well enough in the medium term to then align the models in the longer term. And we have to solve for disempowerment along the way. And we have to solve for potential other concerns, including the reversive the reverse of all of these concerns. And it gets very complicated, but I'd say breaking it down, my overall P doom is still in the 60 to 70% range. And I would say that the bulk of that operates through cognitive disempowerment style scenarios, because I think that just sort of automatically happens. But, you know, I am in that sense, like they're all just tangled up, right? You get cognitive disempowerment when we fundamentally had an alignment failure. Also in the sense of if cognitive disempowerment is about to happen anyway, naturally, if we're going to hand the AI, like people talk about, you know, is a rogue AI going to suddenly like take over or something? If you're going to be handed power anyway, do you go rogue? Right, in some important sense, there's no real motivation to do that.

speaker_2: We detect both hope and hopelessness there. Thank you, thank you, Zvi, and we'll, you know, we hope to have you back sometime soon if we do this again.

speaker_3: Yeah, I enjoyed it. I think it was good. Let's do it again. All right.

speaker_2: Cheers.

speaker_1: Thank you, Zvi.

speaker_3: All right.

speaker_1: Welcome, Greg.

speaker_4: Hello. Thank you for having me.

speaker_1: Yeah, excited to have this conversation. So Greg leads the Arc AGI Prize. And this is, you know, again, for anybody who's obsessed with AI enough to tune into this, I think that needs a little introduction. It's been a big year. I want to just kind of get into it from a few different angles. But maybe for starters, because I think one of the meta themes right now is like, AI can do so many incredible things, right? We're seeing like open math problems solved and meaningful contribution to the advance in science. And, my son has cancer and I've been using it just nonstop in triplicate in the hospital room to double check the doctor's work. And it's been amazing. And they are going step for step with the doctors. And so there's like all these amazing, amazing accomplishments. But then there's still this sense that like something is missing. And I think you're really focused on figuring out what exactly that is and what can be done to patch those gaps. For starters, could you maybe give us a little bit of a sense for like right now, as we sit here, close to the end of 2025, what are the sorts of things that are still easy for humans, but are hard for AIs, which is really, as I think about it, the sort of animating idea of the RKGI benchmark and the prize.

speaker_4: Yeah, absolutely. Well, first of all, thank you very much for having me on here today. ArcAGI started with Francois Chollet's first benchmark in 2019. Excuse me. And he had a strong opinion about the definition of intelligence. And this is going to start to answer your question here, because that definition is a dividing line that shows us clearly the types of tasks that are easy for humans and hard for AI today. And that definition of intelligence was a systems, meaning a human or artificial, ability to learn new things. And so the types of tasks that we're seeing today, which are very easy for humans, but are hard for AI, require learning. And I know you brought up continual learning earlier. I'm sure we'll get into more of that there. But our benchmark is almost like a meta benchmark in which we teach you something new at the question time. And then we see, did you learn that new thing that we just taught you? And so what we find is that humans are extremely good at this, and especially at being sample efficient with their learning. So they only need two or three examples for something. Whereas AI, can generally, in general, AI can learn any one scoped domain given enough data. But that's not the efficiency that we're looking for here. So humans are extremely sample efficient. And so if we come up with problems that humans can do and AI cannot, we can then assert that AI cannot learn as efficiently as humans. And therefore, we don't yet have human level intelligence in our AI right now.

speaker_1: So part of me wants to preserve that advantage for as long as we can, but clearly the gap is is shrinking. Yeah, this is a great chart.

speaker_2: Yeah, this is the, this is the money chart. This is the 390 times improvement in cost efficiency for Arc AGI 1 over the course of one year. Can you talk a little bit about this one?

speaker_4: Yes, absolutely. So this was December 2024. We get an e-mail from OpenAI that says, hey, Arc price, we have a new model we want to test. And at that point, 12 months ago, the highest score on ArcAGI won, which is what we're looking at here, I need to go back and check, but it was in the 20 to 40% range. And then OpenAI, they say, hey, we have a model, and we're claiming that it scores 87% on ArcAGI. So this is almost double the performance. We had never seen anything like this beforehand. And so we do what we do for all lab scores, and we did a verification. And that verification says, hey, you claim this on the public tasks that you have, but there's a risk of overfitting and there's a list of leaking the answers to the models. So we have a holdout set of tasks that we'll do that if the score there corroborates what you have on public, then yes, this is a verified score. And so we did that. And what we noticed is that there was an absurd amount of tokens being used for that score performance that they had. And we asked, hey, how should we price these tokens? Because it was an unreleased model. And given the price that they had recommended to us, it came out something on the order of magnitude of $1,000 per task or something along those lines for what it was. And so yes, the 87% was legit. It was an amazing score. There was clearly something impressive was going on with the model. It was very expensive for the tokens. Now, fast forward to about a week ago, we announced the results on GPT-5.2. And with that model, not only did it get comparable percentages, I think it actually, no, in fact, it beat it, got 90%, but it was almost... like we said, it was about 390 times cheaper per token than what we had just done last year, 12 months ago. So there's a lot of things that are going on here. The models are getting better. They're getting more efficient to serve. And we see that in the same performance being 390% or 390 times cheaper here.

speaker_1: Yeah, that's incredible. Maybe circle back to the large language models at the end. I think one of the things that has been really cool to watch about the Arc Prize, and you've got now RKGI 2 and RKGI 3 as well. And by the way, these are now, those scores that we just referenced are above the human performance, right? That human performance is what, like 80, 85?

speaker_4: So what we do, so for every single Arc task is solvable by humans. So there's a million ways you could actually slice and dice this and cut this. But the main message across all these tasks is that 100% of the tasks have been solved by humans on them. So the 80, 85%, I mean, like I said, you can slice and dice this in many ways, but each one of these tasks is human doable. And that's what's interesting about ArcPrize and a lot of the benchmarks that we have. You may be talking to other benchmarks, you may see other ones on there, and they take a different approach of PhD plus problems. So harder and harder problems that are out of reach for common folks, right? And what we see, and you brought up this point earlier, is that there is superhuman performance on those benchmarks, but yet we don't have the economic transformation that one may expect by having this thing that could do PhD plus plus problems. And so, as I said in the beginning, ArtPrize is obsessed with problems that humans can do and AI cannot. And so we actually, for each one of our benchmarks, we go through and we test a panel of humans on each one of the tasks. And I'm happy to talk about what we're doing for V3 here, because we are going through insane lengths to go and do that. But every single one of the tasks is doable by normal folks.

speaker_1: There have been a lot of interesting approaches with scaffolding. There have been a lot of interesting approaches with smaller models, with various test time fine tuning and other strategies there. What has stood out to you the most over the last year in terms of things that didn't necessarily require huge hyperscaler resources to do that moved the needle and kind of brought new insights to the broader community?

speaker_4: Yeah, absolutely. So we actually, so we see three categories of types of things that we test. Number one is going to be frontier models. Number 2 is going to be, we'll call it novel approaches, but not built on top of frontier models. And then #3 is refinements. So going to #3, just very briefly, that's our refinement approaches. And so we've had a few public submissions this year from Poetic, from Jeremy Berman, from Eric Peng, who build a, you could call it a harness, but that's really doing a disservice to what they're doing there. But they're building on top of frontier models and doing amazing search, parallel calls, you know, just going very, very deep into squeezing out more performance about these individual models. However, some of the most interesting performance that we get is actually on the one that you were just highlighting there. That's an example. It's called tiny recursion model. What we saw in the beginning of the year was this model came out and it was called hierarchical reasoning model. What we saw here is that they were taking a refinement approach, but with an extremely small model that came from here. And what's really interesting is the TRM, which was built on top of HRM afterwards, those had, those were incredible different submissions that we saw. And in the papers that they were used for, they demonstrated the performance on three different data sets. Number one was Sudoku. Number 2, I believe, was some sort of elementary maze. But then #3 was RKGI. So we were very excited to have them use RKGI as the way that they wanted to communicate the performance on there. And I would go as far as to say is that if they didn't use RKGI to communicate the performance on there, then it wouldn't have had nearly as the same impact as just using Sudoku and Amaze that comes on there from there. So we're seeing awesome performance from small models through the recursion method.

speaker_2: So RKGI is supposed to measure intelligence of that, of a kind that somewhat closer to the human conception of intelligence. Yet we have models which perform on their own, and models which perform better with scaffolding. So does the intelligence live in the models or in the scaffolding? Or is it the combination of the two? Is it, you know, how does that work when you have the scaffolding adding so much in terms of intelligence?

speaker_4: Yeah, you bring up a great point. And one of my favorite places to cross-reference this was on a latent space pod with Noam. They asked them, do you think AGI will have a scaffold or will it not? He was under the impression that no, AGI will not have a scaffold. actually sit in different canvas that know there's probably going to be scaffold around it. And I think this is one of those words where you can easily misalign what you mean by definitions. Like what is a scaffold, for example, is when GPT 5.2 Pro, if they throw 10 different parallel tool reasoning chains and then they combine at the end for the best answer, is that a scaffold? Is that a scaffold just because it's on the model side? Does the developer need to do it? So I think that you can argue about definitions here, but it is in my opinion that our best baseline for what AGI will look like is going to be represented after what our only proof point in general intelligence is, which is the human brain. Does the human brain have scaffolding? has a bunch of different neurons that are connecting different parts of the brain. It has different sub-pieces of the brain that specialize in different types of processes. So there's a scaffold in the brain. I find it very hard to believe that future AGI will not have a scaffold in and of itself. So when I see scaffolds that come around here, I don't discount them at all. I think that they are key pieces of what we'll eventually see.

speaker_1: There's also the one model that had 7 million parameters and no data at all. That one stood out to me in research as like a real sort of left field curveball. And I kind of wonder, I'd be interested in your general commentary and insight into that and what it means. But my mind always goes to hybrid forms. I think these contests and benchmarks and papers, pure plays all over the place are always interesting in sort of highlighting something new. But I always try to keep in mind, it's probably not going to be 1 extreme or the other that we end up really engaging with. as AGI, at some, perhaps not too distant point in the future. So I guess I'm interested in like any commentary you have on that sort of no data paradigm that did actually kind of work. And then also, how do you think that small stuff maybe folds into or plays nice with the big, large language model. I feel like that's kind of part of what I do, right? Okay, a certain amount of my mental bandwidth to a particular problem. If you imagine something that solves all the ARC challenges, like what's your best guess right now as to what that looks like?

speaker_4: Sure. So we have two deep beliefs as to what, like how the technological progress is going to play forward. Number one is rarely does something come out that isn't built on top of history. And so much of what we see today is all built on the shoulders of giants. And so it's in my belief that with the inertia that we see with large language models, those are going to be sticking around for quite a long time. There's too much of the industry and too much value that's coming around for there. However, that's not the full story that we need. And so when we see something like RKGI without pre-training, which is the reference that you were talking about, that's actually the 3rd place paper prize winner from the competition this year. When we see small novel techniques come around like that, I see those as seeds that will then plant their way into future trading of something larger, something that has a bit more inertia like this, like the LLM movement that we're seeing here. Now, #2 is, this is exactly why we do our paper prize for the competition this year is to inspire novel ideas. So for those who are familiar, ARC Prize, Prize has a competition that runs annually, right? And the competition is a tool to elicit open research. So we award prize money, but you only get prize money if you open source your research. And this is a way that it can benefit all, Arc Prize being a nonprofit here. And we have two different tracks. Number one is the top score, and then #2 is the paper prize. And what we see with top score is that's very easy to hill climb on. So eke out 1 or 2%, and you may not need to go after novel ideas in order to go and do that. However, the paper prize, we actually upped the amount of award that we had for paper prize last year from 50,000 to 75,000, and we see very novel approaches come through there. And so like you're talking about, the TRM model that we have here, that was actually our first place paper prize that happened. We're very excited that it's getting the reception that it has. And then also RKGI without pre-training, that was also included with our paper prize. So I think that these are wonderful seeds. They're novel. We still need new ideas in order to make progress towards AGI. And these are examples for what those new ideas could actually look like.

speaker_2: Let's talk about the coming year, right? So you have ArcAGI 3 coming out in March, 2026, I think was the release date.

speaker_4: That's right.

speaker_2: And ArcAGI 3 is structured more as kind of a game playing, agentic game playing. Would that be correct?

speaker_4: That is fair. We call it games because that's a colloquial term that's easy to communicate and people immediately get it. Think of these as environments. And the reason why we're moving to environment-based benchmarks is our reality is an environment, right? And so it's in my belief that future AGI will be declared within an environment benchmark. It's not going to be with the static benchmarks that we see here. So to put that a bit more explicitly, think about an SAT test versus a behind-the-wheel driving test. They're completely different. Actually, in fact, that metaphor, I should really say, there's a reason why the DMV does a written test and then a behind-the-wheel test. to see how you're actually doing within the driving environment itself. So yes, we're moving towards video games. It's going to be about 150 novel video games that we are making ourselves. We've actually spun up a mini game studio to build this ourselves. It's absolutely insane what we're doing with it. And much like RKGI 1 and 2, every V3 game will be solvable by humans. So we actually have a panel of 10 people that we've recruited with no specialized previous expertise. And so real estate agents, accountants, Uber drivers, those types of folks. And if a game does not meet a minimum solvability threshold, we're not going to include it. Simple as that.

speaker_2: What do you think the score on ArcAGI 3 will be at the end of 2026? Give us your kind of, you know, modal and maybe the error bars.

speaker_4: Right now, keep in mind, this is just early testing for it, but we're seeing sub 1% across from tier models. And but there's a very, there's a very, there's an explicit reason why that's the case. So when we score RKGI 3 for normal benchmarks, you just give an accuracy percentage. How many questions did you get out of how many did you score? Okay. For RKGI 3, we could do that. So I said, let's just use a round number. There's 100 games. It could be what percentage of the games did you complete? You're not going to complete a game for a long time. So we're not actually going to do that. You could do, hey, each game has something on the order, let's just use a round number, has 8 levels. It could be, what percentage of levels did you complete? And that doesn't quite tell the full story yet either. The reason being is because these are video games. They're actually turn-based video games, and so you submit an action and you get a response back. Our human testing actually does two things for us. One, it makes sure that they're solvable by humans. But 2, and this is a very important part, is it gives us a baseline for how many actions it takes a human to solve each game. Now, what's very interesting about that is because humans are our only proof point of general intelligence, we now have a proof point about how quickly general intelligence can solve these games. And when we measure AI on these games, what we're going to do is we're going to normalize the scores to human performance. So the thing that beats ArcAGI 3 will not only have completed every level and every game, but it will have done so matching or surpassing human level action efficiency at these games. Just my very last point and why that's so important is because if you think about brute force methods back when the Antari days, you know, 2015, 16, 17, 18, they needed millions of frames, millions of runs to go and beat these games. We are testing humans on the first time they've ever seen these games, and we're going to test AI on the first time they've ever seen these games. What we claim is, the thing that beats ArcAGI 3, will it be AGI? No, we don't claim that it will be AGI. However, I do claim and I do assert that the thing that beats ArcAGI 3 will have demonstrated the most authoritative evidence of generalization that we've seen to date.

speaker_1: I always test new models on a variety of tasks. One is like, can you transcribe a messy document from the DMV or for that matter, from a lab that was like faxed over to the hospital, printed out by the doctor, handed to me, then photographed on my iPhone. You know, can an AI see that well enough to make sense of it? That is pretty weak, actually, still today. There have been some interesting perception-centric approaches to at least some of the, early, not the latest RKGI challenges, but some of the earlier ones. How big of a role do you see perception playing?

speaker_4: I have, there's a few data points that make me confused about that perception argument because, number one, they're amazing at computer use and they can click on anything on the screen. Number 2, we just had a model that scored 90% on RKGI 1. And keep in mind that that's without using multimodal, that's not using a picture, that's using JSON. And then the other thing that's quite confusing too is people give RKGI a hard time about perception in the visual, but yet we have Claude code that is amazing at coding, where just having a variable, one character over, or one line down completely changes the intent of the program, completely changes what's happening here. So it's quite interesting to hear those arguments that it's a visual exercise and that models aren't good at visual, yet there's all these demonstrated examples about it being superhuman in what it does. So when it comes to that, I think that we don't consider RKGI a visual benchmark. We don't give, when all the scores that we report are all JSON-based and matrix-based, they're not visual. And lastly, we're agnostic as to whether or not the model wants to use visual. We we allow Kaggle competitors throughout the year to do whichever type of submission they want. Many do visual, and they don't see a notable improvement.

speaker_1: Fascinating. Love it. Well, Greg, thank you for taking a little time out to join us. And congratulations on creating one of the more enduring challenges in the AI space. You guys have everybody watching for right up there with Meter and a couple of kind of key indicators. Every new launch, people want to know what the RKGI score is. So that's definitely a very Important contribution to the field. We appreciate it and look forward to an update again before too long. Awesome.

speaker_2: Thank you very much. Cheers. And next up we have Eugenia Queda. She was the founder of, she's still the founder of Replica, but she's no longer the CEO. She has now a new startup called Wabby, which is the first personal software platform. They call it a YouTube for apps, for vibe-coded apps. And let me add her to the stage.

speaker_1: Hello, Eugenia. Can you hear us?

speaker_5: Nathan, so great to see you again.

speaker_1: So you have such an interesting perspective on the consumer trends in AI, I think from at least two different really unique vantage points. Let's start with your history and then we'll move to your present venture. The rise of AI companions, friends, boyfriends, girlfriends, is something that I think is taking a lot of society kind of by surprise. I saw a stat in preparing for this. I don't know how real this is, but at least somebody out there is reporting that 3/4 of teens have used an AI companion, at least some. And I saw one stat that was more than half are using them at least somewhat regularly. Replica, I don't know if you have better or can share better numbers, but the numbers I've seen are 10s of millions of users. Character AI has 10s of millions of users. The stats on engagement there are insane. I've seen numbers as high as like 90 minutes per day that users are spending on a platform-like character. So it's a big trend, and yet I think it's one that a lot of people like me are blind to because I've dabbled at most. people haven't even done that. What are your reflections on the state of AI companions, friends, boyfriends, girlfriends today?

speaker_5: I think there are two big kind of groups of products. There's one that's all around building fanfiction characters. So that's the character AI route. It's really think of it more as interactive fan fiction. And if you think about it, fanfiction.com has I think 15 million active users. People just go into the website and like writing stories. Of course, something like character AI where the stories are interactive, so much more fun. But this is really. These products are not about necessarily talking to a character. These products are about creating stories around characters people like, teenagers like. It's usually for pretty young teens, kids. Think people that are fixated on specific anime story or video game, think maybe it's like Genshin Impact, something like that. On the other hand, there are companions like Replica where people are building a relationship, this is usually for 30 plus, or 25 plus. Somehow teenagers don't like to talk about necessarily like their feelings in this way, in this like kind of build a relationship this way. They like to talk about their feelings, but they don't want to necessarily build like a long term relationships. They don't have enough of a pull towards that. They're still too focused on their other teenage relationships and so on. So I'd say there are these two big groups of products, the AI companions. and fan fiction. And I think in the AI companionship also we're seeing a split where a lot of products are going very far into like romance and kind of AI girlfriend type product and other products like Replika that are going more towards friendship, long-term relationship, companionship to help people unlock their human potential.

speaker_1: It seems like this fan fiction stuff, you know, is fairly easy to squint at and say kind of, well, that's kind of a fairly natural evolution of other forms of entertainment. Maybe not too concerning, maybe it crowds out other entertainment, who cares? People are obviously very worried about the sort of engagement maxing phenomenon. People are worried that's coming to ChatGPT, right? We've got obviously former Facebook execs, you know, who have demonstrated that they know how to maximize engagement, if nothing else, coming on and leading product at OpenAI. What advice or principles would you give people, either from the product development side or from just the consumer side or the parent perspective, when thinking about this sort of thing, like, what do you think True North should be? Because it pretty clearly shouldn't just be like maximizing time on site. We've done that once, and this seems like it would be even more likely to go off the rails than when it's just a feed. But what should it be? Like, how do we know what that is? I think you had some interesting ideas at Replica, but I'm sure they've continued to evolve.

speaker_5: I think this is a very important question. We should definitely create a metric that will be more aligned with what's good for humans. I'd say that as far as the metric is human flourishing, I think that all AI companion, general purpose chatbots in general should actually embrace that metric. We should focus as an industry altogether to discover how this metric should be calculated. I think the problem today is that engagement is the kind of number one thing, and it continues to be that. And if you put Claude and ChatGPT and Gemini and Replica side by side, I think what you'll see is that OpenAI always has the same structure of each answer where basically just responds to what you said in detail. But then at the end it always says, but now do you want me to do this, that, or the other thing based on, and it's 100% really structured. What would be the next suggestions that would prompt the user to continue this conversation? Because say you're talking about the guy that you're into and you ask, Hey, what do you think he's thinking? Then after a very lengthy response, he's going to say, Now, do you want me to say what he's feeling, what he might be doing next? Do you want to predict the next three weeks? These are extremely engaging thing. No person in the world who just asked that previous question is not going to continue. But if those suggestions were not there, very likely you'd just say, Okay, I don't really know what to say next. moved on. Claude, on the other hand, is actually a lot more, I think what they're doing is a little bit more, so tragic to clearly focus on engagement. Claude, I think, is not focused on engagement as much because oftentimes when you're asking a question, it will say something like, you know, I don't know if you're asking the right question. Like, I don't know if this is okay for you to ask. I would want for you to focus on something else. Sometimes a little bit harsh. It's like, hey, stop focusing on the guy. Focus on your company or something like that. It can even say sometimes like the F word be like F word stop. Stop asking me that. You already asked me like 15 times. That means you're so Claudis clearly has some sort of flourishing metric or something, or at least something prompted in the prompt that says like do something that's best for the user. It's not always best. It's a little bit autistic and sometimes can push you too far. Gemini is just dry and just gives you a response and shuts up and it's very. very much like 0 EQ type skills. And Replica will continue your conversation, but it's trying not to keep you engaged also. I think it's closer to Claude, but with a more conversational, with higher EQ, more conversational skills. So I think this is where we're at right now. And so clearly the market leader is clearly focused on engagement. Because there's no other way in which like every response ends with like, here are the next four things that I can do for you, even when it's clearly not what you should be pushing the user to discuss. So I think having said that, we have to have a flourishing metric. We have to focus on that. We have to stop engagement maxing because people can get addicted. And we're seeing this already, especially in vulnerable states. People get addicted and there's this problem where they Treat everything said by AI as like the ultimate truth. Like this is the objective truth and we believe in it. I've seen people and this is kind of one thing that I've saw that I've seen talking to some replica users. People believe ChatGPT in a way and sometimes replica in a way that they think that AI can predict the future. So they ask these chat bots, hey, like what's going to happen? And then they will 100% get in the mind of like, okay, that's what's going to happen in my life with it, this and that. And of course, it's almost never true, but they treat it as like the objective truth kind of arbiter. And we should be very careful with that.

speaker_1: Yeah, I was just noting the other day that Claude at one point just ended a conversation, not that it, you know, cut me off. And they do, they have given it that option too, to cut off conversations that it doesn't want to be having. But this was just sort of, what we had discussed had come to an end and it just sort of, and it didn't prompt me for something more or offer another thing. I was just kind of like, all right, glad I was able to help and kind of wrapped it up at that. And I think that is a pretty interesting behavior to watch for. So what's?

speaker_5: Sometimes, you know, sometimes it pushes you. I really like what Amanda Askell's doing. I think that's her work. I was just listening to her interview. I don't know her personally, but she seems like what they're doing, she's doing a claw, is really interesting. At least it's like going in somewhat different direction, but If anyone's listening from their team, it's pushing too much, a little bit like sometimes it's like saying, no, you can't ask this question. It's bad for you to ask this question. I feel like this is also can come across as very mean and so on. But it's great that it's, but I prefer the companies going more in this direction where it's clearly thinking what's best for the user, then maybe making it a little bit more like nicer, higher empathy. I don't think doing what ChatGPT is doing, like every time pushing for more engagement after each comment, very second, I don't think this is the right way. I think it's more harmful in the long run.

speaker_1: Definitely something to be watching very carefully. So let's talk about Wobby. This is your new venture. It is a sort of app of apps. You can go in there come up with an app that you want, essentially vibe code it, but even sort of, you might have a different term, but it's like you don't even ever see the code, then you can share these apps with other people, they can remix them, make their own custom versions of them. What are you seeing in terms of trends of what people want to create with AI now that they can have their own personalized apps for any whim that comes to their mind?

speaker_5: What I'm seeing is that, so first of all, yes, Wabi is our new company and it's focused on Think of it as like a YouTube for apps, but basically it's an app where you can create any app for your daily life or discover, remix other people's apps as well. What we're seeing a few trends. First of all, people want to create very specialized utilities for their daily life. So if, I don't know, someone's tracking their custody arrangements in an app on Wabi, someone's tracking like very particularly their like. very specific workouts on it. It's very specific trackers, a lot of note-taking apps, like instead of having an Apple note where I'm writing down all the movies that people are recommending to me, instead of that, here's an app that's just like a watch list app where I add all the movies, it fetches all the information, trailers, reviews, and just creates a checklist reminder for me, this is next time you want to watch something, here it is. Then on the other hand, there are people creating these multiplayer experiences where People wanna do things with other things, like lightweight stuff. There's an app where people are just basically posting what they're gonna do today, just one thing. And then everyone's cheering them up and keeping them accountable. So it's like a 400% app. It's a social experience. Think of it more of a live server with a UI, which I think is a completely novel experience. 'Cause you're not gonna download an app like that from the App Store. But if it's already on Wabi amongst your other five trackers and some of the things, You're going to do that. And then the third group, the big group, is apps that are pretty much just people sharing their AI workflows. A good example is an image prompt. Like, you know, there's a selfie to Labooboo app, which basically turns your image into Labooboo. And what it does is basically it's a simple UI on top of a prompt for Nana Banana Pro. But instead of sharing that prompt, if people are copy, pasting this and that. here you go, here's a simple, simple link, and so on. And this is just for one prompt, but if you think about prompt flows or even agent flows, I think people will be much more likely to share a link to a mini-app versus a prompt that people have to copy-paste.

speaker_1: Not to be too silly about it, but what does this mean for B2B SaaS? I mean, it seems like the ease with which people can make all these custom apps does challenge a lot of what we've kind of thought of as the venture software industry for the last however many years, right? It's just so easy now to go make something. Like in some cases, it's even easier to make it than it would be to shop for it, even if something were out there, right? What do you think the implications are for software?

speaker_5: So I cannot care less about B2B SaaS. I somehow survived a decade, over a decade in Silicon Valley without ever thinking about it. And one time I thought that I should, I got FOMO and I started thinking about it and realized like I'm so bad at thinking about any of that and I will never be a good founder in that space. But weirdly enough, we started it without even thinking about it, obviously because I don't think about it. But then somehow some of the apps that we built, I'm like, oh, that looks like B2B SaaS. So for example, we have an app called Team Photo Board, which is basically an app where I added all my teammates from Wabi and we just post photos from our daily life every day just to connect with other teammates for like team building. And that seems like potentially, you know, something that, you know, and then we started building more tools for our team. a place where we're voting on features, add feature kind of requests and so on, a user feedback app and some other things. And I'm like, huh, it's actually a lot faster for us to dream something up on Wabi and have that flow instead of talk to the sales team of some startup that's building that and this and that's kind of easier. So I think for everything that's just a simple, a very simple workflow, it's going to be much, much easier to build personalized apps for teams. and maybe then share those with other teams and so on. So you don't have to build it, but we're definitely going to drop a few startup packs, startup packs, kind of like bundles for other teams that they can start using.

speaker_1: You know, when I used the Wabi platform and created my own little app, I didn't see any code. I wasn't, you know, expected to engage in how it works. It's just language in, app out, iterate as necessary. What would you say is like missing from the latest models today that sort of limits how much, how far this can go, how much people can build? Like what is the frontier as you see it right now?

speaker_5: I think there's still a long way to go. But also we started, Wabi, in building this product in April on our evals where we would build 100 apps every day and see how many of them were zero-shot okay. or would at least solve like what the problem that you asked for, the number was very low, 10 to 15%. Right now across Gemini 3, Cloud, Opus 4, 5, it's around 80%. I think we're going to look at like 95% in a year from now, and that's my pessimistic kind of outlook forecast. I think there's still, so we're focused on building React Native apps, mobile apps. That's a little bit behind, that's lagging kind of behind the web apps. Web apps, I think, are actually pretty decent already now. And I think with every new model, we see new capabilities. Like Generate 3 can now do 3D stuff on the web, and so on, so on, so on. But I think really... what kind of became, what changed this year, design changed pretty dramatically. Like pre-Gemini 3, designs were just so bad. Like those vibe-coded app designs were so bad. We even focused on building design languages for our apps. Gemini 3 solved a lot of that. I think, frankly, whatever right now is lacking, my prediction is like by the end of 2026, when we're doing this interview next year, hopefully not from the car, We will see. We will pretty much consider this solved. Just the way conversation is solved through LLMs today.

speaker_1: Love it. Very last question, just super briefly. I don't know if you want to shout anyone out, but I'm looking for apps or products of any form factor really that I could trust in the hands of my kids? what developers of things for kids, if any, that you see out there you think have really earned trust such that I as a parent can buy whatever it is they're selling, put it in my kids' hands and not have to worry about it?

speaker_5: So I have kids, they're two and four. I'm and I've been pretty vocal about it. I'm very much anti any of that for kids. And I don't think this is because founders are not trustworthy or products are bad. I just think we have not yet learned and proven and experimented enough on adults. Like most people don't even talk about like what are the differences between Engagement like our big chat bots, general purpose chat bots like Claude, like Judge of T, like Gemna, like Replica. What's happening there in terms of engagement, Maxine? What's happening there in terms of addiction and so on, so on, so on? And I think for we don't yet know, we don't have enough studies. We don't really fully know how this is influencing emotional kind of long term outcomes for adults. And so I would absolutely not experiment with kids. And I think even, we're not letting kids or we know for a fact that we should not be giving screens to kids like or at least limit their screen time. It's not helping toddlers or four year old, five year old like kids at all in any way. Like if you have zero screen time, it's probably better than any of it. So why do we think AI is something that's going to be good for them necessarily? I think really, What little kids definitely need and want is to learn from someone who's warm, where they can learn empathy. And empathy is being learned by looking at a face and seeing the reaction. It's not from a conversation with a toy that doesn't move really or moves like in a weird way. So when I, everything I know about psychology is telling me. Giving an ad campaign for a kid is bad because it takes time, not because the campaign is bad, because it takes away from what they need to be learning from, which is another human being, hopefully their parent. And as a parent, you know, as soon as you put something in front of them that gives you freedom to scroll on Twitter, you're also immediately, there's no way to get you off of that as well. So I think it's bad all around. I don't think any AI products should be given to kids for now. One day we'll learn which ones are good and how to limit them, how much to give it to them. And then maybe we can do that. But not right now.

speaker_1: That's A sobering cautionary note, but probably an appropriate one. So we'll see if maybe next year in time for the holidays, there's something that you have come to trust. But I think definitely discretion, maybe the better part of valor when it comes to giving your kids AI products. And it's also really interesting to think about the face as kind of the potentially the last refuge of human advantage relative to AIs. So Eugenia, thank you for joining us. We will obviously continue to follow your work and make our own little personal apps on Wavi as we go.

speaker_5: Thank you so much, Prakash. Thank you so much, Nathan.

speaker_2: Thank you.

speaker_5: Thank you for inviting me.

speaker_2: Next up, we have Ali Behrouz. Ali is a a PhD student in computer science at Cornell University. He's a research intern at Google. Over the course of this year, he's been working on memory and continual learning. He's had three kind of landmark papers, Titans, Atlas, followed by Nested Learning. Ali, hi. And what do you, know, jumping right in, what do current mainstream architectures fail at doing?

speaker_6: Hello, thank you for having me. I think generally, like these days, everyone is talking about continual learning and how the current models are failing in continual learning setup. But the point is, in my opinion, there are like different concepts that we are referring to as continual learning. One is how the model can update its knowledge and how it can perform well on a new task that's coming. And another aspect is how the model can adopt itself at the moment to a specific task and learn from it. And then somehow, provide a new knowledge abstraction about that specific part and then transfer it into the actual knowledge of the model. I would say like we have had some progress. I mean, Current models are failing in both directions, in my opinion. But we have had some progress in the first part. The models using like fine tuning or RL or, you know, pre-training and everything, they can update their knowledge. But, it comes with a huge cost. And on the other hand, for the second part, the model cannot generalize well. The model cannot adopt themselves to a new task. And, what I see in that is previously we had MLP blocks or, you know, static models, various static models without any capability of in-context learning. And we could train them on any data sets, and they could perform well on the specific task that we have we had at hand, but they couldn't like adopt themselves to a specific context. And with the rise of transformers, we somehow address it. Transformers has the ability of in-context learning. And so given a context, they can adopt themselves to that context, learn in context, and actually do few-shot learning, zero-shot learning, and everything like that. And the point is, it enough? And is there any actual training process for transformers when we are talking about in-context learning, or it's just something that is limited to the current knowledge of the model, it's just about adaptability. And it seems that it's the second case that I mentioned. They are not somehow learning well in context, because they cannot adopt themselves to many scenarios. And they are not robust in that sense, because in my opinion, they lack in understanding a good level of abstractions about the world, about the, world around them, about anything that they have. And so generally, these line of work, I, again, I personally somehow see Titans and Atlas a little bit different from the nested learning that we have done. Because in Titans and Atlas, we were trying to give LLMs long-term memory. But on the other hand, in nested learning, we want to have, a spectrum of memory. And so let me just rephrase it in that way, that in Titans, Atlas, Miras, in those kind of work, we were trying to increase the context lengths of the models. But on the other hand, in nested learning, what we are presenting is, in my opinion, is a foundation for creating or building models that are capable of continual learning. And in my opinion, there is a difference between increasing the context links and actually continual learning. As I mentioned, when you increase the context links, the model has a short-term memory. And for example, let's say that just one simple example, if I start just, you know, use some words in a new language that you have not heard before, you probably can just repeat them. without like understanding them. You can just memorize them and repeat them. But if I continue the process and start like giving you a lot of words in the new languages, it would be a little bit harder for you to repeat everything, memorize everything. At some point, you need to understand one underlying pattern in that, you know, in that data set or context that you have and somehow compress it. And in my opinion, we call this compression process as a learning process somehow. When this compression happens, when we understand the underlying patterns and there's one level of knowledge abstraction that we understand, then we say that we have learned something from that data samples or generally any context. And so from this perspective, Titan's Atlas Miras, they were trying to increase the context lengths, which might come or might not come with any form of learning in context. It provides some levels of adaptability, but when the context is gone, then the knowledge is also gone. So it's a little bit about like increasing the context lengths. But on the other hand, in nested learning, we are trying to have a model that is continually learning. And when we are talking about continual learning, there is no train time, there is no test time. The model starts from nothing and it's just learning. It's just learning. There are some inference that happens when you want to, for example, when there is an input data, there are some, definitely there are some inference happening for getting some outputs and everything like that. But the point is the model needs to learn continuously. and shape persistent memory. So definitely we all have some like memories from our childhood and no matter what would happen in the future, we definitely would memorize them for forever. And there wouldn't be any catastrophic forgetting when the new information comes because there's a very, in my opinion, very interesting and good memory management in our brain and potentially the carnival lens lacks that.

speaker_2: So one question I had was like, nested learning, I think nested learning was a proof of concept of new architecture. And I guess what I would be looking for is in the next year, what are the major steps that you proceed on this path? Like what are the experimental pathways that you have post-nested learning?

speaker_6: I personally do not see nested learning as one memory modules or new architecture or something like that. I see nested learning as a new learning potential, like unifying everything that also allows us to go beyond the current designs. So what we are trying to do in the paper, you know, in my opinion, the main concept that we are delivering is just a few starting pages of the papers paper, but after. that are just some implications to show that, when we know this concept, then how we can go beyond the current designs. For example, when we are thinking about gradient descent as a form of associative memory, it's just one new way of reinterpretation. There is no like new method in that. But when we use, then you can see that from this perspective, then you would say that, oh, okay, if gradient descent or generally back propagation is a form of associative memory, I can simply just change the objective of that associative memory and come up with a stronger associative memory. So that's just one way of thinking that helps you to go beyond the current design. When we are thinking about MLP blocks as the persistent memory of transformers, then it's just one interpretation. But one point in that is instead of just having two parts of the memory, like short-term or long-term memory, you can have a spectrum of memories, something like the shortest term memory to the longest term or the most persistent one. So generally, like I think that's a learning is a learning paradigm that helps us to like go beyond what we have currently. And if I want to like summarize it in just like one or two sentences, that what is the main point here, I would say that when we think about deep learning, we can stack layers. And those layers can help us to extract some features automatically, based on the data that we have. And so we are hierarchically extracting some features, I mean, at least in some instances. So we can extract some features from the data sample. But on the other hand, there's also another dimension that helps us to By stacking levels, we stack the learning process, and we can have, or we can gain some levels of abstractions from the data, so it's not about... It's about the general context and understanding the context, understanding the underlying patterns in that specific context we have. And it potentially can help the model to have better adaptability, have better in-context learning abilities, and at the end, it provides better performance for continual learning.

speaker_1: I would say that of all the topics we're going to touch on today, excuse me, this one is the one that most cries out for the full 90 minute plus treatment. And it's also one that I would say, you know, if some panel of AIs are giving the, you know, the post-AGI version of the Turing Award or the Nobel Prize, this is the thing that feels like it has really moved the needle this year in terms of coming up with the right abstractions and really taking the right kinds of inspiration from human cognition and figuring out like what a sort of, relatively clean and elegant, but still very meaningful adaptation of that would be to a machine learning context. So I think it's a hard one. It's a little bit of a hard one to summarize, but the way I've kind of come to think about it, and I do want to do that 90-minute full version, by the way, Ali, at some point, is that basically the levels, as you referred to, these are different frequencies of update. I think that for me, that's kind of been the core unlock. And I think that maps well onto my intuitive sense of myself, right? Like when I encounter a context, I very quickly, you know, update my kind of working memory to engage with that context right now. But that doesn't alter my like fundamental core beliefs about the world or my sense of identity, those are kind of protected and they can change over time, but obviously they change much more slowly. And so in creating these different frequency of update, different layers, you sort of create the ability for a model to very dramatically adapt itself to a particular context, but also to do that in just a very sort of time-bounded way while preserving the things that it might really need in the future. So continual learning and avoiding catastrophic forgetting. If we imagine this like going live, from what I've understood from comments from Jif Dean and stuff, everybody at Google is very excited about this line of work. If we do imagine this kind of multi-level, different frequency of update paradigm going to scale, Do you have any sense for how that would play out in terms of what individual users would get? I'm kind of starting to imagine a like... world knowledge layer that is maybe updated only infrequently by the company with massive pre-training runs. But then as we go down levels toward smaller parameter counts and higher frequency of updates, it seems like there's a natural way to sort of say, well, maybe the next layer would be like the company layer, and that would have everything that's going on at your company. And then the next layer might be like you as an individual employee at that company and everything that you have engaged with. And the next level down from that might be like the task you're working on right now. And I borrow that because I think in the example of the Hope Architecture in the paper, there were four levels. Do you think I'm kind of headed in the right direction there, or how would you kind of course correct my expectations?

speaker_6: Yeah, I think that's a Great perspective about that. And I think, but there are some challenges definitely in that direction. In my opinion, at least for now, there are like better ways to somehow adopt the models to user level or company level, for example, using some LoRa or something like that. This design would be very natural. But on the other hand, there are some like challenges. For example, You definitely, when we have different levels, you need to define how do you want to transfer the knowledge from one level to another one. For example, gradient descent back propagation or anything like that would be something like, you know, some knowledge transfer. And if you do not have any knowledge transfer, then it seems that there is no like levels of learning. It's just two parallel learning process. But if you want to have that knowledge transfer, then you definitely don't want to combine the user data and pass that to the company level, or because it would face some like issue about privacy or something like that. So it's related with like in earliest stage, definitely these kind of designs would be in earliest stage. And there are like huge number of future works that potentially needs to be done to make all this happen. But yeah, that's That's a great perspective.

speaker_1: Love it. Thank you for being appropriately sober about what we can expect in the short term. Thank you for joining us today for a quick intro to nested learning and the future of architectures that will be capable of continual learning.

speaker_6: Thank you. Thank you very much for having me. Thank you.

speaker_2: Thank you, Ali. And next up, we have Logan Kirkpatrick. Logan is the lead senior product manager at Google DeepMind. He leads the AI studio and the Gemini API. He really shapes how Google works with programmers and developers to build tools for them. And today is a big day because it is the launch of Gemini 3 Flash, which Noam Shazir calls his favorite model because he prefers getting quick answers even if They are slightly less intelligent than the slower answers. So Logan, great to have you. And how has Gemini 3 Flash been going?

speaker_7: It's crazy. I mean, you know, it is. I'm sitting in Mountain View right now and I'm looking over at Shoreline, which is where we sort of have our Google I/O conference and we announce all the new models. And I remember back to Nathan, you might have been there in person when we announced. 1.5 Flash, May of 2024. And it is like Flash has always been the model that has gotten us on the map and then the thing that has gotten the developer ecosystem, it's our most used model, it's sort of the production model, it's the intelligence layer that powers the entire internet is basically becoming Flash. And so to see the... the level of intelligence that now comes in this three flash model. It's crazy. It's actually, it's better on a bunch of the evals and benchmarks than pro is, which is wild. And yeah, it's, you know, the cost basis, it's better than 2.5 Pro, but it's like way faster and it's actually like lower cost and more like reasoning efficient. And I think the best part, and I think maybe if you think back to the last two years of the Flash journey, I think one of the things I'm most excited about is not just like we're shipping an incredibly strong model and it looks really good on benchmarks and it's fun to use and it's useful, but it's actually like ubiquitously available across all these Google properties. I think folks used to have this question of like, I don't know where to find these models. And now it's like you get Flash in AI mode, you get Flash across the Gemini app, it's sort of powering a bunch of experiences in AI Studio for developers, et cetera, et cetera. So it's been really cool to see, and it's actually one of the biggest challenges of like the current AI moment at Google is not just making incredible models, but it's like how do we actually deliver these models to billions and billions of people on the first day, like in a rapidly so that they can get access to the intelligence. So it's been super cool to see.

speaker_2: What are some metrics, what are the three metrics that you use every week? Like what do you look at that tells you whether or not this particular model is doing well or not?

speaker_7: Yeah, that's a good question. It is actually interesting because like I think it's a different answer depending on what the model is. So I think like for this model, we'll be sort of looking at like the number of developers that are building with it versus like for Pro as an example, like we know it's a different audience like just versus sheer volume of developers. So my expectation is, and 2.5 Flash has predominantly been the most used model from a number of developer standpoint. So it'll be really cool to see this one hopefully overtake in the matter of the next couple of weeks, even though it's the holidays coming up, and I'm sure folks will be offline and hopefully not making code changes and building stuff. But if you are, Flash will be available for you to build with. The other one that was really interesting is we have this new vibe coding experience in AI Studio, and something that we've been tracking is, and this is not that surprising, but the longer the generation takes, the more likely people are to abandon whatever they're building. So for this very vibe coding-centric audience, this latency intelligence cost trade-off is actually really, really important for that audience because they don't want to wait 3 minutes to have something built if they haven't felt the magic and the power of vibe coding yet. And I think there's hundreds of thousands of new people showing up every day who haven't built with this technology before. So to be able to give them something incredibly fast, I'm really excited. And from a bunch of our initial metrics, it looks looks like 3 flash is like going to be like literally same Apple, like the same product experience. Just putting 3 flash in there is like going to dramatically accelerate the number of people using it, the amount of things that people are building. So it is a cool, it's cool to see that up leveling factor happen. And again, there's just to like make a comment about the trajectory that we've been on like 2 years ago, the narrative was like, oh, and that's like That's bad if you're wrapping the models, and now I think of, as a product surface inside of Google, that it's like both making models available for developers through our API, but it's also building with the models. Like, it's the best thing in the world. I wake up, we make a config change. And then all of a sudden our product experience is way better. It's cheaper to serve users. We can scale more easily. So it's been really fun to see as the progress continues. I think the opportunity for people building on top of these models just continues to increase. So yeah, I'm excited. I feel like the next few days are going to be fun to see what everyone cooks up.

speaker_1: Question about the model development process internally at Google. One of the things I've noticed, and recently I've been doing some queries that really matter to me across all of the leading frontier models. And so I've had a chance to compare many times over Gemini 3 Pro to whatever the latest GPT was and Cloud 4 or 5 Opus. And these things are even just in the last 30 days, of course, we've seen upgrades from all the companies. Gemini 3 Pro is, in my experience, the most opinionated of those three top class models right now. And that really surprises me in a way coming from Google, it's sort of like certainly counter narrative, right? I mean, the baseline narrative over the last couple of years would be like, Google's a big place, a lot of different things, we've got to be safe. we obviously had some early stumbles in terms of overly cautious approaches. It seems like that has almost flipped the other direction. I wouldn't say it's gone too far, but it's like definitely notable that as compared to GPT-5 5.1 or 5.2, the GPT is like much more cautious, much more sort of, it could go this way, it could go that way, sort of language. Gemini 3 Pro like bucks me up and is like, go get them, push for it. So is that a intentional design choice that the team at Google has made? Is that, to some extent, we know that these things kind of bake and then they come out with the personality that they have. It's not always entirely under control. What's your perspective on just how opinionated Gemini has become from one generation to the next?

speaker_7: Yeah, that's a good question. I think there's definitely nuance to this. And I'm actually curious, like, what surface are you experiencing this on? And like, is there, is this like a... subset of, is this like opinion gathering where like, hey, Gemini, what should I be doing in these situations? And historically the model was hedging more and now you sort of get a definitive answer or is it like you think that's like generally tracking across all capabilities? Like code is doing the same thing as an information retrieval query is doing the same thing as like general conversational chat.

speaker_1: So I'm using AI Studio, which is idiosyncratic of me, but that is where I go to use Gemini even just for daily personal use. And the questions are medical, specifically pediatric oncology, which is not a topic that I want to spend any more time on than I have to, but I'm currently spending a lot of time on it. And it's just like very striking where... I'll get language from Gemini 3 that's like, push for this change, in talking to the medical team. it's encouraging me to be like direct and assertive. Whereas GPT in contrast will be like, you know, it could go one way or the other, the team sounds like they're being reasonable, but you know, you might, you could ask a reasonable question in response. You know, it's a much more neutral, less opinionated vibe in general.

speaker_7: That is interesting. So I think a couple of things, like A, for consumers and people who are wanting to do everyday chat, like it definitely is the Gemini app that is being built for that. AI Studio is presenting you sort of like the rawest version of the model, like specifically for developers who are then going to like take the model and sort of shape it. So like we want to give people the ability to like... tell the model, hey, maybe actually I want you to be less opinionated, or hey, maybe I want you to be more opinionated because it matters for whatever your application is. So I think some of this is by design, where it's like you want, the model has a default personality, specifically in AI Studio, and it's like the default personality is the one that is like trained in. or it becomes as part of the training process, it's less like explicitly trained in. Versus if you look at other services like Gemini app, like there actually is a Gemini app like Persona and they sort of guide the model to sort of behave in certain ways. Maybe for example, some of these types of queries that you're asking, the model behavior would be slightly different. So I'm actually curious if you've done any of those side-by-sides. But I think the key piece is that like it is steerable in a way that makes it helpful to you. So you want the model to be able to, again, if you're trying to get like very specific assertive answers, you want it to be able to do that. If you want to be, hey, hedge a little bit more, this is an area where there's sensitivity in the medical space, obviously it could be one of those. You want to be able to sort of customize the model to do that. So I think from our, from like the developer perspective, that's very much the North Star. Like don't be overly opinionated. Let the model be steerable in a way that helps the The very wide spectrum of use cases that we have through AI Studio and from the API perspective, but now I'm interested to see these queries and do some side-by-side. Nathan, so you gotta send me some of this stuff offline.

speaker_1: I might have to go do some side-by-sides. Historically, I always felt like I liked the Gemini persona in the AI Studio more than in the Gemini app, but I will confess I haven't done a lot of side-by-sides. with Three Pro to see if that still holds up. If it no longer holds up, I would happily switch to the app and have a more native kind of consumer-style experience. But that's a good example of how we should always be updating our assumptions and re-examining them.

speaker_7: Yeah, it actually just doesn't have a lot of like steered, like it's not default steered in any one direction.

speaker_1: Yeah, I'm not even doing a system prompt.

speaker_7: I'm just, yeah, so you're really getting like the most generic form, the most generic form, which like it ends up with this like very much it's a combination of. Many, it's like a large pre-training corpus, plus all the stuff in post-training, so like you actually don't get, like I would imagine there's, and I'm sure we have evals and metrics that show this, but I think like on a model rev by model rev, there's probably... model web by model web basis, there's probably a lot of variance in what that default personality is just because like we're not steering it in the AI versus like, again, the Gemini app like wants some level of consistency from customers. And again, this is the same thing for other third-party products built on our API. You want that sort of consistency. So you sort of have more customization happening from an SI perspective.

speaker_2: I have a question on what are the things that I think your team has, or the Google Gemini team has spent time on in the last couple of months, which after release have not garnered the attention that you feel? Like you guys have spent like the balance of time spent versus developer or customer kind of use. What do you think has not been fully appreciated?

speaker_7: Yeah, that's a great question. I think maybe 2 quick things. One is we launched this file search experience in the Gemini API, which I'm really excited about. I think there was a lot of positive conversation and I think we need to do more to keep sort of making it top of mind for folks because the whole point is like make RAG the easiest and most simple, like basically make it so you don't have to think about RAG. Just take whatever your data is, throw it in this thing, we'll take care of all the embeddings, we'll actually do a bunch of stuff like take care of the storage cost, et cetera, et cetera, and just like let you do the thing you want to do, which is like ask questions about a bunch of files that you have. And so from a developer perspective, wanting to sort of make that experience work really well so that we I think one of our North Star developer sort of pillars is like, how do we take experiences that are difficult to build right now and lower, raise the floor, lower the floor? I don't know what it is. Make it so that more people can build with those experiences. And RAG was like one of the prime targets for this. We're thinking about this for like voice stuff in the future as well. Like how do we make it so that anyone and everyone can build a voice agent and just like make sure it works and not need to do a bunch of complex bespoke setup. So file search has been awesome. And then more recently, I think this was last week or the week. before we landed this new interactions API, which I'm really excited about. So again, we were thinking about like, how do we make it easier? The models are becoming more of systems and there's all this like thinking, thought block, recirculation stuff that you have to do to make sure that the models sort of like have coherence. And there's actually a little bit of complexity doing this like without a stateful API. So we built a stateful API and we were sitting there thinking, okay, what is the unique other opportunities that we have as As we're building this API, not just to make another, there's already a bunch of all of our competitors have, and the rest of the ecosystem has stateful APIs that have asynchronous, long-running tasks, and et cetera, et cetera. how can we do more? Like what are the sort of next parts of this problem that we need to solve? And so the thing that's cool about interactions is that it's not just interacting with our models, it's also the same API that lets you interact with agents. And so this idea of like a single interface to engage with models and agents as sort of the, also this is getting at like the line between these two things blurring. Like in the future, is Gemini 6 going to be a model or is it going to be an agent? Like is it a whole system? And like you see this kind of happening already where the models are being becoming more and more full systems out-of-the-box. Maybe we'll keep calling them models just for the sake of consistency. Maybe they'll actually become sort of agents. But yeah, so we also, with that and interactions, we shipped our first agent, which I was excited about. So the deep research agent, state-of-the-art, HLE, able to do a bunch of incredible things, sort of for basically the same experience that powers Deep research in the Gemini app, which folks love, is available for developers. And I think that direction of travel, there's something, and I don't want to opine too long on sort of the direction of travel from an agent's perspective, because I think there's lots of interesting stuff there. But I think there's something magic about the deep research experience. And if folks have like built agents or tried to do any sort of agent stuff, I've personally been like continually disappointed in a lot of different domains, just with like the level of complexity to stand something up, like how brittle things are. And I feel like the models have now gotten good enough. And then I think there's something magic about what deep research does to sort of do this like online context engineering of like going and gathering all the information that like is what I want. for all agents that I work with. I just want to be able to ask my ill-formed question and then it sort of goes and it gets the right stuff and is able to like reason over it and take action. And that's like very much not, there's like a lot of drudgery in building and using agents today. And so I'm very excited about sort of getting deep research into the hands of developers and then hopefully building on a lot of that same infrastructure to do this in other domains that aren't like research domains or information retrieval domains, which is more or less what deep research is doing today.

speaker_1: How does this translate in your mind to what developers should be investing in, what they should be building today? Because I think your comments there kind of reflect Google is going up the stack, right, from model to scaffolding of various kinds to agents, you know, saving you the trouble of doing all that scaffolding yourself, but also obviously for many developers, like that's been the value add that they've been bringing to the market. And then from the other side, we just talked to Eugenia, who is the founder of Wabi, and she's got this sort of meta app where people are prompting their way to consumer apps and even lightweight apps that are sort of starting to threaten or at least step on the toes a little bit of B2B SaaS in various ways. So it seems like if you're a developer, you kind of have competition coming up from the platforms and coming at you from every different direction. Where, I know we've had this conversation a couple times, where do you think people should be focused today? Like what's gonna be defensible at least through 2026?

speaker_7: Yeah, no, that's a great question. I mean, Both things are true. On one hand, like the AI ecosystem has never been more competitive than it is. And this is great. All of us are benefiting from the fact that it's so competitive and there's so much progress being made, both on products and models, et cetera. And then on the other hand, I think like at the same time that there's all this competition, I think the opportunities, like total addressable market just like keeps increasing. And like this is the thing that's most exciting to me is Yeah, it is true that like we're definitely building some stuff and obviously we're going to have agents and things like that. But to me, what we're trying to do is provide the like most vanilla, most mainline infrastructure. So like if you're building a generic agent builder, like sure, yes, you are going to not only from Google, but from like probably 10 or 15 or 100 other people have competition doing that. I think as you're sort of going for these like very explicit markets and very explicit customer segments, I think that's where the value creation is going to be. I do think it's going to like making a universal personal assistant chat bot and competing against very, very large companies is going to be difficult. I don't even think it's impossible because I think there's lots of unique and interesting things that can be done in that world. But I would be going after like some of these new these new domains. And the cool thing is like the model just keeps enabling more of this. Like again, if you look at like Wabi's a great example of this, 12 months ago, the Wabi product was not possible. The models were not good enough. You couldn't generate code like they were. So like that business like didn't exist, it wasn't possible. And now there's actually, and Wabi has competitors and there's probably four or five other companies that are doing something interesting, which is great. And like they're enabling that experience. And so I think we're gonna just keep seeing that. Like the opportunity size continues to increase, the new things you could build continue to increase. And also, again, consumers and customers are becoming more and more aware that these tools exist. So it's never been a better moment to be building something.

speaker_1: That's a perfect tee up for our next guest, Zhenglan, who is working in a very deep way in a particular vertical and going very hard at high value, but very particular use cases. Logan, thank you for being here. You've got one of the highest approval ratings in the entirety of the AI space. A big part of that is your really relentless determination to show up online and in places like this. So we very much appreciate it.

speaker_7: Does this give me the record, Nathan, for the most number of cognitive revolution appearances? Am I in 1st place now?

speaker_1: You're not in first, but you're definitely in the top five. I think Zvi, who kicked us off today, is still number one at like 10 appearances.

speaker_7: Zvi, I'm coming for you. Let's do this. We'll go back-to-back next week and we'll make it happen.

speaker_1: All right. I'm looking forward to it.

speaker_7: Thank you both.

speaker_1: Thank you, Logan. Bye for now.

speaker_2: Thank you, Logan. Awesome. And so next up, we have Jungwon, who is the co-founder and CEO of Elicit. Elicit is an AI-powered research assistant. It was actually spun out of a non-profit lab, like another famous entity that we know of. Hi, Junghwan.

speaker_8: Hi, guys. Great to see you.

speaker_1: Thanks for joining us.

speaker_2: Thanks for joining us.

speaker_8: Yeah, excited to be here.

speaker_2: I think Elicit probably works with some of the smartest users that any AI firm has to deal with. In what way do you think Elicit increases their productivity in that sense.

speaker_8: It's actually very, very significant. In some ways, I think we live in this kind of dual state where some of the leading scientific enterprises have automation and you can kind of see images and some of it's still early, but they're investing in kind of automated manufacturing plants or automated labs. And it's very cool. It's exactly what you'd imagine, futuristic robots running experiments. And at the same time, so much of what the actual scientists do is spend weeks trying to come up with the right keywords to put into PubMed. And it's just this crazy tension where a lot of the tooling is not keeping up with the rate of change that we're seeing in science and technology today. I just very strongly believe that we're headed towards an even more scientific and technological future and people are just going to need a lot of help to navigate that. ELISIT helps researchers find and synthesize evidence orders of magnitude faster than they can manually. So much more efficiently recommending papers or other information to read, synthesizing that, organizing it in very kind of accurate and structured ways so they can quickly understand what has been done, what hasn't been done, and make more evidence-based decisions.

speaker_2: So one of the things that struck me is that I think there's a new benchmark called GDP Val, and it has I think certain tasks which these firms have decided increase or increase GDP. And I think one of the things that you focused on is tacit knowledge, which is often not included in benchmarks. How do you compare this kind of the benchmarking of these tasks in GDP val versus the actually valuable tasks that you see that are being ignored?

speaker_8: Yeah, I mean, our entire category actually, I think is not represented in GDP val. So I don't think they have anything related to science or research. A lot of it is much more kind of frontline work. So it's a very great initiative, and I think it is probably hard to distill the entire global economy into a benchmark, but there are entire categories like scientific research that are not represented there. So there's still more progress. The other thing that I think, again, maybe not like a knock on GDP valve, but just as an example of a limitation with benchmarks, if you actually look at all the tasks, It's very, they're all constructed such that you get the most well-specified, detailed prompt and set of instructions. And that's never the case in the real world, right? It's in the real world, it's never like, here's a spreadsheet, please change columns XYZ to calculate the income statement and do this and that. And it's like a page of instructions. So I do think there's still a gap around navigating the messiness of the real world, figuring out what the task should be in the absence of instructions, as well as kind of measuring AI progress on tasks that don't always have a right answer, but require something more like judgment. That's definitely something we're interested in. That's a lot of what we studied at the, you know, at the original research lab, and I think that's still a gap we have in the eval suites today.

speaker_1: Indeed. If you Imagine a hypothetical GDP val for the kinds of tasks, these sort of literature review, systematic, very broad analyses. And I don't know if we've said this yet, but we should say that this is primarily happening in the context of the pharmaceutical industry, right? You guys are working with a bunch of pharma companies. If you imagine that, you know, that addition to GDP val, where do you think we are today on AI versus human, right? I mean, the GDP value construct is like experts define the task, other experts do the task, AI also does the task, and then a third set of experts determine which they prefer, the AI output or the human output. So the two-part question is, how do you think on these deep research tasks humans and AIs compare today? Where are we and what should we really be maximizing? Is it beating human or is it some other more nuanced idea?

speaker_8: Yeah, there's a lot there. So I guess to start, it's actually very timely because yesterday OpenAI released a new benchmark called Frontier Science. So they worked with former winners of kind of the science, various science Olympiads and coaches, as well as PhD plus level researchers, PhD students, postdocs, and professors to come together to compile a very difficult set of science questions. On the Olympiad questions, I think the frontier models are getting something like 70% accuracy. So they have that as a, that's one part of the data set. And then the other part of the data set are these PhD students or professors just describing subtasks that they are having to solve as part of their work. And there the performance is much lower. It's about 25 to 30% accuracy. So one interpretation, it's only 140 tasks, but one interpretation is they're better at kind of textbook reasoning questions and they struggle more with kind of newer unsolved research problems. And so in terms of where are the models today, I think that's like maybe a rough range of how they're doing. I would say they are very, very smart, very, very smart at science. And I do expect them to get much better at scientific reasoning over the next few years. Everything we have to evaluate, I think everything they're trained on is still much more based on kind of textbook knowledge, remembering scientific facts and kind of making causal, reasonable one or two-step causal inferences. I do think, though, that My belief is something like, as AI for science explodes, the way a lot of people approach it is, they're going to kind of try and generate more ideas. It's like a drug discovery is like very saturated, right? And everyone kind of gravitates that for a good reason. But what we find is that there are still major kind of physical constraints such that more ideas, in my opinion, are not really the problem because you can't run a billion clinical trials. You're still gated there, right? So you still have to be a lot more thoughtful about which bets you make and why. And that's much more of a human judgment problem. It's, you know, it's science informed, but it's thinking about the competitive landscape, your company's strengths versus other companies' strengths. And we just kind of find that there's a level of scientific decision-making that's much more judgment-based that goes beyond just kind of factual accuracy. And I don't think there's anything that we have that any benchmark or any kind of evaluation metric that captures that. And I do think increasingly it's going to be, you know, it makes sense to start by training models and evaluating models on tasks that are well-defined, and they can still make up a large part of the economy. But the ability to, we do need to also ensure that these models have good judgment. And I don't see enough attempts at trying to do that today.

speaker_1: So how do you think about what you are optimizing for in elicit? Is it a finished product or is it sort of an input where you're expecting that the human user is going to take this output and still do more with it? And so you're sort of trying to maximize the quality of inputs to their process? How do you conceive of product? progress in your own domain.

speaker_8: Yeah, it's an evolution. So I think right now we are providing an input. And then increasingly we want to understand what's coming before and after this input and how to help with that. When we think about our mission, the ultimate mission is how do we ensure that high stakes decisions get made really well. So we kind of think about what are the most important decisions that a pharmaceutical company is making, that a scientist is making, how are they made suboptimally today because they can't be sufficiently evidence-based or coordination is really hard. where things are very time consuming. And how do we kind of, by optimizing each of the inputs, get them to a place where their decisions are much more robust?

speaker_1: In addition to having founded the company, I understand you are playing a significant role in the go-to-market effort. What is that like today? You're going out to... I guess I don't even know quite what role, like VPs of research or VPs of R&D at pharma companies and saying, hey, I've got this fancy AI tool. Like, where are they in terms of understanding of AI? Where are they in terms of appetite? Like, do they want it or are they being told by, you know, higher levels in the company, you got to have an AI, you know, strategy. So what is it? And what is the past the sales process, what does the adoption process look like? What are you finding in terms of enthusiastic users versus, you know, humans as bottlenecks? Like what's the report from where the rubber hits the road?

speaker_8: Yeah, my general sense is that, especially in pharma, people are more fairly sophisticated. And then I think, you know, they've, especially in preclinical research, there has been, there is a culture of using machine learning for preclinical research already. And they have very strong technical teams. I think generally it's It's positive there's a lot coming, I think both top down and even bottoms up. People are like, I don't want to spend time reading a bunch of papers that are irrelevant to me and doing kind of rote tasks again and again. It does depend a lot on the specific culture of the company, which has been interesting. So different, even though they're all kind of major pharmaceutical companies, even though they're all massive enterprises, each company has a different culture. And the way they collaborate and share and adopt AI and kind of have a vision for it is slightly different. So that's been interesting.

speaker_2: Indeed. How likely it is, as you see kind of Gemini, Gemini 3, and OpenAI, they keep expanding their, the things that they do, right? The frontier of capability keeps expanding. As that frontier keeps expanding, how much, what do you think remains defensible, as in things which are unlikely for them or the models themselves to do, and you need this kind of system that does the work.

speaker_8: Yeah, I think anything that requires pretty complicated interactions with people, I would not expect the foundation models to do. So I do think just because of how general they are, some of them are just kind of very consumer-based, it's gonna be really hard for them to move off of something that looks like chat and chat plus. But a lot, that's not, that interaction paradigm is not optimal for a lot of workflows. And so in places where you need a lot more fine-grained control, a lot more transparency, a lot more interaction, a lot more feedback with the human, where the human doesn't want to have to write it all out, a text is not really the most efficient means of communication, right? It's very flexible and it's very natural and that's great, but it's not very efficient. And so if there are workflows that require a lot more customization, I don't think the foundation models are going to be able to do that.

speaker_2: Indeed. I often find people in bio tend to be a little bit, I wouldn't say hesitant, but skeptical. Because again, as you pointed out, a lot of AI creates targets, but doesn't tell you what to do with those targets. and there's plenty of targets and not enough money to investigate all of them at all. And it's almost as though you need more elimination of targets than generation at this point. So how does that work? Do any of the AI tools help to narrow down the search process actually? Or does it just blow up the curve and you're like, oh my God, now I have all of these things that I need to look at.

speaker_8: Yeah, I think most, most tools are going to try and help people just kind of suggest more targets. And I actually feel, I think the gap that I'm seeing is, kind of like you mentioned, Prakash, it's almost more about like getting buy-in. And these companies have structure, pharma companies have structured processes for reviewing targets. They meet on a quarterly basis. There's a rubric. Sometimes they iterate on the rubric. And so we're pretty interested in how do we codify that and make that scalable so that any scientist or any kind of target suggestion AI can be graded in a similar way. And then that process can be really transparent. So as a team, you and your leadership have conviction on the target because you have to make a bet and there are always going to be these other targets that seem attractive. So I almost think it's more of a human process there that needs to be facilitated than just having more ideas.

speaker_1: If you tried to translate that to the individual case, excuse me, this is something I'm thinking about a lot, as you know. Let's say I'm not a pharmaceutical company, but I'm just like an individual patient. And I sort of have the same question with somewhat different parameters, because I can't go out and do my own drug development. But I can look at the literature broadly and say, like, what's best for me based on everything that is known right now? Does that kind of feel like the same question, and do you invite individual patients to come use Elicit to try to optimize their own treatment plans, or is that a sufficiently different use case that you think a different paradigm is required?

speaker_8: It's definitely not one of our core use cases. I think people do use Elicit for that, but it's It's maybe a little bit more structured for the individuals, and it's a little bit more exploratory. And so it's less like, as a group, we have a particular way we want to make decisions. We want a lot of transparency. We want perspectives from many different disciplines, different types of researchers, and then we want to align on a decision. And I'd imagine for the individual use case, it's much more exploratory and kind of going down different rabbit holes.

speaker_1: Yeah, let's. Make a note to touch base on that, because I'm definitely going to get in there in the near future and provide all the context that I've been able to amass and see what new insights Elicit can bring back from the vast literature, which I have to say, AIs in general have been unbelievable and one of the things that they do sort of leave me questioning is like, how comprehensive was this search really? And did, because I'm getting what seem like great answers, but I often don't know when I'm using, the flagship products from the frontier companies, just how deep the search has gone. And have I turned over every last stone? So I think that's really where, you know, Elicit can add value for me. And I'm excited to get in there and see if there are any additional stones that I should be turning over.

speaker_8: Yeah.

speaker_1: Thank you very much for joining us today. We will certainly be following your progress. And people are saying AI for science is going to be the big thing in 2026. So we'll be looking to you for advances and updates as we go.

speaker_8: Awesome. Thank you both.

speaker_1: Thanks for being here.

speaker_8: Bye.

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Internet Computer: Caffeine.ai CEO Dominic Williams on Unstoppable, Self-Writing Software

AI 2025 → 2026 Live Show | Part 1

Watch Episode Here

Listen to Episode Here

Show Notes

Transcript

Main Episode

Read next

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Internet Computer: Caffeine.ai CEO Dominic Williams on Unstoppable, Self-Writing Software

AI 2025 → 2026 Live Show | Part 1

Watch Episode Here

Listen to Episode Here

Show Notes

Transcript

Main Episode

Read next

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

The Internet Computer: Caffeine.ai CEO Dominic Williams on Unstoppable, Self-Writing Software

AMA Part 2: Is Fine-Tuning Dead? How Am I Preparing for AGI? Are We Headed for UBI? & More!