Scaling "Thinking": Gemini 2.5 Tech Lead Jack Rae on Reasoning, Long Context, & the Path to AGI

Scaling "Thinking": Gemini 2.5 Tech Lead Jack Rae on Reasoning, Long Context, & the Path to AGI

In this illuminating episode of The Cognitive Revolution, host Nathan Labenz speaks with Jack Rae, principal research scientist at Google DeepMind and technical lead on Google's thinking and inference time scaling work.


Watch Episode Here


Read Episode Description

In this illuminating episode of The Cognitive Revolution, host Nathan Labenz speaks with Jack Rae, principal research scientist at Google DeepMind and technical lead on Google's thinking and inference time scaling work. They explore the technical breakthroughs behind Google's Gemini 2.5 Pro model, discussing why reasoning techniques are suddenly working so effectively across the industry and whether these advances represent true breakthroughs or incremental progress. The conversation delves into critical questions about the relationship between reasoning and agency, the role of human data in shaping model behavior, and the roadmap from current capabilities to AGI, providing listeners with an insider's perspective on the trajectory of AI development.

SPONSORS:
Oracle Cloud Infrastructure (OCI): Oracle Cloud Infrastructure offers next-generation cloud solutions that cut costs and boost performance. With OCI, you can run AI projects and applications faster and more securely for less. New U.S. customers can save 50% on compute, 70% on storage, and 80% on networking by switching to OCI before May 31, 2024. See if you qualify at https://oracle.com/cognitive

Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive


PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(05:09) Introduction and Welcome
(07:28) RL for Reasoning
(10:46) Research Time Management
(13:41) Convergence in Model Development
(18:31) RL on Smaller Models (Part 1)
(20:01) Sponsors: Oracle Cloud Infrastructure (OCI) | Shopify
(22:35) RL on Smaller Models (Part 2)
(23:30) Sculpting Cognitive Behaviors
(25:05) Language Switching Behavior
(28:02) Sharing Chain of Thought
(32:03) RL on Chain of Thought (Part 1)
(33:46) Sponsors: NetSuite
(35:19) RL on Chain of Thought (Part 2)
(35:26) Eliciting Human Reasoning
(39:27) Reasoning vs. Agency
(40:17) Understanding Model Reasoning
(44:29) Reasoning in Latent Space
(47:54) Interpretability Challenges
(51:36) Platonic Model Hypothesis
(56:05) Roadmap to AGI
(01:00:57) Multimodal Integration
(01:04:38) System Card Questions
(01:07:51) Long Context Capabilities
(01:13:49) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...


Full Transcript

Nathan Labenz: (0:00)

Hello, and welcome back to the Cognitive Revolution. Today, I've got the honor of speaking with Jack Rae, principal research scientist at Google DeepMind and technical lead on Google's thinking and inference time scaling work. As one of the key contributors to Google's blockbuster Gemini 2.5 Pro release, Jack has tremendous insight into the technical drivers of large language model progress and a highly credible perspective on the path from here to AGI. Gemini 2.5 Pro, as I'm sure you know, marks a significant milestone on Google's AI journey. It's the first time that many observers, myself included, would rank a Google model as the number one top performing model across many important dimensions. And this is not just about topping leaderboards. In my initial testing of Gemini 2.5, which I conducted before Google's PR team reached out to schedule this interview, I experienced one of those rare moments where a model significantly exceeded my expectations, forcing me to reevaluate my sense of what's possible today and inviting me to reimagine my workflows to take advantage of its unique strength in not just accepting, but actually demonstrating incredibly deep command of hundreds of thousands of tokens of input context. This is a practical step up that I could feel almost immediately. So naturally, I jumped at the chance to talk to Jack about all the work that went into it and how he understands the current state of play along a bunch of critical conceptual dimensions. We begin by asking why techniques like reinforcement learning from correctness signals appear to have suddenly started to work so effectively across the industry. Does this represent a proper breakthrough, or is this more a culmination of steady incremental progress that has finally crossed important thresholds of practical utility? We also unpack the reasons that nearly all frontier model developers are releasing similar reasoning or thinking models in such a short period of time. Is this simultaneous invention driven by obvious next steps, or is there more cross pollination somehow happening behind the scenes? We then consider the relationship between reasoning and agency. Will these reasoning advances translate to agentic capabilities, or is something more still needed? From there, we look at the role of human data in shaping model behavior. How does Google think about collecting human reasoning and step by step task processing data? And how intentional has Google been in training models to follow recognizable cognitive behaviors versus letting them develop their own problem solving approaches during the training process. We also exchange intuitions about the relationship between models internal feature representations and the patterns of behavior they use to leverage them, consider whether reasoning in latent space should scare us or can be made safe via mechanistic interpretability, and discuss whether the application of reinforcement learning pressure to the chain of thought itself should be avoided as OpenAI recently argued in their obfuscated reward hacking paper. Finally, we'll discuss the roadmap from our current capabilities to AGI. What are the remaining bottlenecks? Do we need a memory breakthrough or will continued scaling of context windows be enough to overcome all practical limitations? And should we expect deep integration of more and more modalities as we've recently seen with text and image? Throughout our conversation, Jack provides thoughtful, nuanced responses that absolutely should help us improve our understanding of today's AI systems, the work going on inside Frontier Labs, and the overall trajectory of AI development. Personally, I leave this conversation with the sense that for most developments we see from the Frontier Labs, the simple explanation is the best one. There's still a lot of low hanging fruit left in large language model development. Researchers have internalized the bitter lesson and are trying to keep their approaches as simple and scalable as possible. And the rapid progress we observe is mostly the result of pursuing pretty obvious high level conceptual directions and then methodically chipping away at the practical engineering challenges required to make them work at scale. The teams involved, as you'll hear, are seriously concerned with developing the technology safely, but are also feeling both a high level of genuine excitement and competitive pressure that keeps them moving forward as quickly as possible. As always, if you're finding value in the show, and I definitely think this is one of the higher alpha episodes we've done, we'd appreciate it if you'd share it with friends, read a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. And considering that the future is radically uncertain and the stakes are crazy high, with outcomes from a post scarcity disease free utopia to an existential catastrophe or even outright human extinction, all live possibilities in just the next 2 to 20 years, I take my responsibility in making this show extremely seriously, and I earnestly invite your feedback and suggestions. You can reach us either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. Now I hope you enjoy this insider's perspective on scaling large language model thinking and the path from here to AGI with Jack Rae, principal research scientist at Google DeepMind. Jack Rae, principal research scientist at Google DeepMind and technical lead on Google's thinking and inference time scaling work. Welcome to the cognitive revolution.

Jack Rae: (5:20)

Cool. Thank you so much for having me.

Nathan Labenz: (5:22)

I'm excited for this conversation. Congratulations on Gemini 2.5 Pro experimental 03-25, I think it is.

Jack Rae: (5:30)

You know,

Nathan Labenz: (5:30)

the long name doesn't reflect what a big release this is. And, obviously, that's a common trope in the model wars these days, but it is a big deal. You know, in my estimation and in my testing, this has been the first time that I would say a Google DeepMind model has been the number one model in many important respects, and it has also kind of given me one of those hair raising moments that don't come along too often, although, you know, remarkably often. But when I dumped a full research code base into the thing, 400,000 tokens, and said, I wanna extend this. I wanna reuse as much as I can, but I wanna make a really light touch and not mess with other people's code because this is sort of a shared, you know, collaborative space. I was really amazed by how much command the model had of the super long context, and it was hair raising because it did feel like a qualitative difference, you know, a very just immediately noticeable step up. So, you know, we're all still adjusting to what it can do and calibrating ourselves, but I think as the kids say these days, it is safe to say that you guys have cooked on this one. So great work, and I'm really looking forward to understanding a lot of the work that went into it.

Jack Rae: (6:42)

Yeah. And I'm gonna probably say this a lot, but like, you know, we're super happy with this model. We are really happy with the trajectory of our models. And this one was like a true Gemini team effort. Like, I'll probably touch upon this, but, you know, this was a knockout performance from the pretraining team, from thinking, from post training, from many areas across Gemini just really pulling this together. And we feel pretty good about it. We liked it internally. We didn't know exactly how it'd be received. It's great to see that people are really finding it useful. They're really feeling the AGI with it. They're seeing noticeable deltas on real world tasks. So that's been very cool to see. I really appreciate the praise. And yeah, I just want to say this one was a Gemini team full knockout. But I'm really happy to talk about some of the model development and especially things on the thinking side.

Nathan Labenz: (7:30)

Cool. Well, let's get started with a question that I've been thinking about a lot recently, I think a lot of other people have too. And that is why didn't the simple and, of course, I'm sure you guys used, like, more complicated techniques, but here I'm really thinking about the R1-Zero demonstration that a really simple RL setup with a correctness signal can work now. And I'm kinda wondering why didn't that work sooner. I assume many people tried it in many contexts, and I'm not sure if they were missing something or the models were missing something or, you know, what it was that sort of kept that idea at bay for a while and now, of course, you know, it seems to be working everywhere.

Jack Rae: (8:10)

Yeah. You know, I suppose from my vantage point, we've basically been leaning more and more on RL to improve the model's reasoning ability for quite a while, for at least a year within our Gemini large language models. So as we've been releasing models, there has been kind of a greater and greater presence of using reinforcement learning for accuracy based tasks. We're getting a very discrete verifiable reward signal and using that to improve the model's reasoning. And we've actually been doing that before thinking even started and we've been shipping models with that. It's been helping the model's reasoning process. I think really the way I see it is this has been something that's been improving from a lot of amazing reasoning researchers and RL experts for a while. It kind of has hit a bit of an inflection point in progress where it's really captured people's attention. And maybe it feels like there was a kind of a threshold moment for a lot of people maybe around say this DeepSeek technical report. But I think it's been working for a while. There hasn't been one key thing which is like discretely made it working. It's just kind of crossed the capability threshold where people have really taken notice.

Nathan Labenz: (9:30)

Interesting. So fair to say you see that what may seem to outsiders as sort of an emergent phenomenon as more of a mirage. Under the hood, it's a pretty smooth curve.

Jack Rae: (9:34)

I feel that's how I see it. A lot of these capabilities, when we internally track these things, they're kind of going up with sometimes almost like scarily predictable improvement, almost like a Moore's Law style improvement that we see. And I just feel like what I've come to notice, and this happened also from my time in pretraining, is we would have that phenomenon. And each given piece of improvement, each piece of improvement to the reinforcement learning recipe or the model recipe, you don't always know what will help. So there's a bit of stochasticity there. But as you accumulate things in, there is this almost trend of improvement. And then usually what I feel like happens in the public domain is it just crosses these thresholds occasionally where people really take notice and get very excited and it kind of captures people's imagination. And crucially, the model just gets sufficiently good that it really feels like a step change, especially with these kind of discrete releases that we make. Yeah. So that's my perspective on it anyway.

Nathan Labenz: (10:32)

Yeah. That juxtaposition between smooth progress on sort of leading indicator metrics and then the threshold effects of downstream tasks is one of the most, you know, interesting dances in the entire field, I think, and probably will be for a while to come. On your just like personal production function, obviously, there's, you know, everything going exponential in the space right now, and the number of papers and different techniques being published is, you know, in keeping with that. How do you allocate your time or how do you think about allocating your time between reading and keeping up with research that the rest of the field is doing versus, you know, keeping your head down and just, you know, pursuing your own ideas. And are there any AI tools that are making that more manageable for you right now?

Jack Rae: (11:22)

Yeah. I mean, in terms of, like, reading research to kind of doing, like, coding, running experiments, and things, on some level, I don't know whether my own experience is just influenced by also just career progression and changing how I work. But earlier on in my career, I'd spent a lot of time reading research. There's so much to brush up on and also it felt like maybe at conferences and things this is where all of the action happened and it was really about consuming a lot of different ideas and things. Now I feel like, and this could just be partly because I've switched from more of a junior research role to something where we're directing things a little bit more, there's just a lot of very known problems of which there's no research out there that has the solution. The solution is going to be discovered amongst the group of people that I'm working with day to day. The amount of time I spend reading research has gone down definitely a lot since if I took myself from now to 5 years ago or even 10 years ago. But yeah, I still find it very inspiring and useful when people are publishing cool ideas. I still take the time. I use X, I follow people, I use archive filters to try and filter out interesting papers or blog posts or podcast interviews or YouTube videos. A lot of this stuff is coming through different formats now. And in terms of tools, I feel like I have I know this may sound predictable or cliche, but right now I do use Gemini a lot for reading and summarizing and asking questions of papers, especially because that has been one of its forte for a long time. I feel like I can trust its ability to ingest in certainly a whole paper, but even sometimes a collection of papers if I want to add in a bunch of cited papers and then ask questions or ask summaries. That's pretty useful especially because as you read research more and more you start to get a bit more demanding on just cutting straight through to the critical idea and the critical results. And sometimes it's just a bit hard to do that if you don't have the time to pass through the text brute force and look for what you need to know. It's very useful to have the model do this. And that's one thing. Gemini's long context ability is really good. It's been very good at question answering and summarizing for a long span of technical text. So I kinda like it for that and I use that. And that's my go to tool.

Nathan Labenz: (13:43)

Gotcha. So another kind of just striking observation about the field right now is I guess close to all, maybe not quite all of the frontier model developers have pursued what from the outside appears to be, like, a very similar trajectory over the last year. And we basically see now a whole new class of reasoning models which follow a similar paradigm where they sort of have a chain of thought, you know, where they're thinking for a while, and then they give you a final answer. That convergence is something that I would like to understand better. And I don't know if it is just, like, simultaneous invention because, you know, the conditions were just so overdetermined to make that the next logical step. Or, you know, the other theory that you hear is that people are, you know, meeting up at these infamous San Francisco parties and, you know, sharing what they're working on over drinks or whatever. So how would you describe, you know, your understanding of why everybody is kind of developing seemingly very similar ideas in parallel right now?

Jack Rae: (14:50)

Yeah. I think it's just a phenomenon that has existed even before the invention of SF tech parties. People were always looking for where there's avenues of progress. And I think even very small bits of information that we can see a model is improving in a certain way, people very quickly notice that now, especially now we have an unprecedented number of smart people working in AI, an unprecedented amount of compute that allows us to react quickly. And we're seeing that just follow through to an unprecedented level of speed and velocity when there is a new paradigm, let's say test time compute in this case, and there's a bunch of performance and capability to explore in this domain, then people will flood into it very fast. I feel like if even if I think of like how this has kind of unfolded within Google, within Gemini, we assembled the reasoning groups to work on the specific topic of thinking and test time compute in September, October time. And within a month or so of just focusing around this space, we were finding what we felt like were modeling breakthroughs that were very exciting. That led us to shipping our first model in December, an experimental model based on Flash with thinking. And I think if we emulate, if I think about and reflect about how that team's progress went, there was just a very natural process of people exploring in this place and really getting involved and more and more people thinking about it and running experiments and just like progress happened very fast. And I would imagine that is just a common phenomenon now within these very talented research groups. And that's why you get to see suddenly a bunch of reasoning models within a short time span of each other. There's just a very natural phenomenon of curiosity and exploration and talent right now. So people are always super motivated to find the next big breakthrough and explore it as fast as possible.

Nathan Labenz: (16:43)

So can I summarize that as the idea itself was a pretty obvious candidate and the sort of density of low hanging fruit, you know, or the richness of that vein was just so striking once you started to mine it? That sort of accounts for all of the leading developers at least exploring that a bit and then all of them finding that, like, yes, this is really a way we clearly should be investing a lot.

Jack Rae: (17:09)

Yeah. That's how that's at least how things have kind of unfolded within Gemini. And I think also we've been kind of seeing a lot of like initial signs of this making sense and kind of had some initial results. And we also fortunately, this whole thing required a deep confidence in applying reinforcement learning on language models, which is something with Google, we were very comfortable and interested in and working on. So in that respect, there was just a low barrier to entry to really explore this space and then find a bunch of really cool capability breakthroughs from thinking. So it was kind of a natural extension for us. I can't really comment on the other labs but I imagine must be similar things happening across the board.

Nathan Labenz: (17:52)

Yeah. One really like small detail, but I wonder how you would contextualize this for me is in the R1 paper, they had said that they tried reinforcement learning on smaller models and basically couldn't get it to work. And I you know, they seem to be pretty cracked as the kids say, so that, you know, seems like they would have been trying something pretty smart. Later though, it does seem like it's kind of working everywhere. I mean, any light to shed on, you know, what would account for if somebody, you know, in the recent past was trying to apply reinforcement learning on somewhat less powerful base models that couldn't get it to work? Like, does that sound right or wrong to you?

Jack Rae: (18:33)

Yeah. It's completely valid. These things are way more difficult, I think, than often people realize to get working well, even pretraining. Pretraining I think people now consider in the bucket of completely solved, completely obvious. I was working on pretraining let's say 6 years ago where training a large language model 100 billion parameters or more was like a million components of things that could go wrong or diverge and it was kind of in an alchemy stage. Training reinforcement learning on these powerful language models and getting them to reason and think more deeply, I imagine people have tried and failed many times because there's a lot of very key crucial details to get right. So I just think it's hard and it requires a lot of things to be fixed. And when you have 5 things broken, it can be very difficult to you may find one thing that's broken, you fix it and nothing changes and you get disheartened. And at some point maybe you feel like this just doesn't make sense, this won't work. And then it just requires a few iterations of that until more and more things are lined up and then the whole thing starts to shine. I feel like we saw some initial sparks that were very cool last year where just with reinforcement learning the model was using thinking and we started to see really cool phenomenon happening during the thoughts like self correction, exploring different ideas. That's exactly what we would have hoped would just emerge from reinforcement learning but we didn't really know if it was possible until we saw it for ourselves in our own experiments.

Nathan Labenz: (20:02)
>
Hey. We'll continue our interview in a moment after a word from our sponsors. In business, they say you can have better, cheaper, or faster, but you only get to pick 2. But what if you could have all 3 at the same time? That's exactly what Cohere, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better, in test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workload. Right now, with zero commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.
Nathan Labenz: (21:17)
>
Being an entrepreneur, I can say from personal experience, can be an intimidating and at times lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just one of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right one, and the technology can play important roles for you. Pick the wrong one, and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in The United States. From household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.

Nathan Labenz: (23:14)

Yeah. So how do you those sort of cognitive behaviors as they're increasingly commonly known are you know, obviously there are multiple different ways that those can come to exist in a model. Some you know, one possible explanation for, like, why the RL maybe doesn't work on smaller models is that you need a big enough scale of model and training to have those, you know, sort of begin to take shape at all in a model so that the reinforcement learning can sort of bring them out. But you can maybe also get them to be learned during supervised fine tuning or, you know, or maybe, you know, if you just do enough RL, they can sort of pop out kinda semi randomly. How much work do you guys do to sort of sculpt, you know, and really curate those cognitive behaviors versus how much are you sort of seeing arise, you know, at which stage of the training process?

Jack Rae: (24:08)

Yeah. I think people have different opinions on this. We're a pretty outcome driven team. So at the end of the day, we'll do whatever recipe gives us the best results and best model generalization, the best final result. But I think taking one step back from that, there are some priors and opinions in this space. One school of thought which I'm quite in favor of is whatever the simplest recipe that leads to the model, choose that one. So there's a bit of an Occam's razor. If you can impose fewer and fewer priors into what the cognitive faculties should be and you can still get a really powerful model so everything is more purely learned from data, that always feels like a better approach. That said, we explore human data, we use model based synthetic distillation data, we try and have a lot of things if they can arise from end to end reinforcement learning. So we try everything and then in terms of the final model and the final mixture, we just go with what works best with some kind of preference for simplicity and generalization. So yeah, I don't know if that's a satisfying enough answer. Obviously, we can't like kinda go deep into what our training recipe is and it's also always evolving so fast. But that's the general kind of principles we use.

Nathan Labenz: (25:29)

Yeah. That makes sense. Don't expect you to spill all the secrets. The human data is obviously, you know, has some nice upsides and that we would expect that models trained on it might be a little more human like. I obviously don't wanna overstate how human like they become, but, you know, you would at least hope you might avoid. I guess I wonder, have you seen you know, one of the famous tidbits from the R1 paper was that they reported this language switching behavior in the context of the chain of thought. I've also personally seen that from Grok. I have not seen it from Gemini. Is that something that you guys you know, that or other sort of weirdness in the chain of thought? Were those things that you observed, and did you take any action to try to select against those sort of weird behaviors? Or maybe not necessarily select against, but, like, set, you know, proper priors so they didn't come online in the first place?

Jack Rae: (26:21)

Yeah. I think well, this one is not we okay. I think ultimately one principle is that we want the model to use its thinking tokens to just be a smarter and better model. From that perspective there may be some of a slightly weird phenomena happening and thinking tokens might get quite cyclic or it may appear to be doing to emitting text that's not so useful all the time. But if it leads to the model then being much stronger in solving the problem, one philosophy is that you should just let it do that. This is supposed to be a scratch space for the model to figure out how to respond with the best accuracy, safety, factuality, etcetera. That said, we did notice some things about the thoughts. One, the Gemini thoughts are usually in English. They usually prefer to be in English. We actually found the model was quite strong at reasoning tasks, i18n, so we call it i18n, but non-English reasoning tasks. It would mostly perform its reasoning though in English and that was one question of is this a bad product experience or should we allow it to do that if it allows the model to maybe it's useful that it does that to be quite strong at these reasoning tasks. So that was like one debate over this one. It's kind of like not language switching you could say for the thoughts and just sticking in one language. Another was some of the thoughts especially in the original Flash Thinking launch were quite templated. The model would often choose to use quite like a formulaic structure of how to break down the problem and then formulate a request. And that was another line of research. Do we want this to be very templated? Ideally not. It should be quite natural. It should be the model thinking through the problem and not necessarily always. It feels like if it's maybe always adopting a particular template then maybe it's not getting the most benefit out of that thinking compute. And other things, those are the aspects of the thinking token. We obviously want it to be efficient and maximally benefit the capability of the model. So those are some topics we're always thinking about.

Nathan Labenz: (28:23)

Yeah. Cool. Okay. That's interesting. Just to make sure I have a clear understanding of what I am looking at when I look at the chain of thought, is it fair to say that what is being shown I actually mostly use the AI Studio. Maybe you could comment on that if it's at all different from the, you know, the Gemini app itself. Yeah. But am I correct that I am seeing the full raw, like, unmodified chain of thought?

Jack Rae: (28:47)

Yeah. That's right. We launched in December and then launched again in January. And with 2.5 Pro, in all cases you're seeing the raw chain of thought tokens from the model both in AI Studio and on the Gemini app. This is something we're always thinking about. It's not clear what the best thing to do honestly is. People do like to see the raw tokens. At the same time they can be quite verbose. We might want to create summaries that are actually more useful. We might want to do other transformations. There was a cool piece of work in NotebookLM where there's kind of like a thought explorer with a graph and you can follow different ideas in a graph structure. It's still a pretty new space and I think the best way to surface thoughts we haven't finalized on. Right now, they're the raw thoughts.

Nathan Labenz: (29:34)

Yeah. Interesting. So I was just wondering, you know, sort of what the what if any debate went into the decision to share the full chain of thought? Because obviously, OpenAI initially chose not to and cited a mix of reasons, but I think most people sort of interpreted it primarily as a competitive consideration that they didn't want to share the full chain of thoughts. Everybody could just go run and, you know, distill or, you know, do SFT or whatever on their work. That, you know, does not seem to have proven a durable moat for them, but I wonder kind of, you know, what considerations or what debates you guys had as you decided, yeah, let's go ahead and share the whole thing.

Jack Rae: (30:11)

Yeah. I feel like these kind of decisions, they're often a mixture of input from safety team, from the researchers, from leadership and it really is kind of a complex decision. I couldn't give you like a very specific roadmap but for each release it's like very carefully considered. We have our leaders like Koray, Demis, who will often want to have a very good understanding of the pros and cons. And for me, this isn't something I weigh in on so I'm not really the best person to ask. But I just try and make sure all the models are incredibly strong and we have a lot of good options on the table. So I think it's an area of active exploration. We haven't settled, we're not fixed on one particular way of surfacing these thoughts. And in fairness also for OpenAI, I don't know why they chose to show summaries. We could speculate. They did give us some reasons but I'm sure there could be a mixture of reasons that go beyond just things like distillation to other aspects. I think there was an initial worry from some group of people that maybe if we show thoughts then we have to then start RLHF-ing thoughts to make them look really nice to users and maybe we don't want to encourage models to have deceitful thoughts. There's another school of thought which is once you have these thoughts it's great for interpretability and you can understand how the model formed its output. So I guess there's just a whole debate going on about what's the best way to ingest and communicate this content. From my perspective I just wanted to make sure the thoughts are resulting in a way stronger answer, way more capable model, and that's my main kind of concern.

Nathan Labenz: (31:45)

Yeah. So is it fair to say then that you don't concern yourself with how the chain of thought looks to the user? I mean, OpenAI, of course, recently also put out this obfuscated reward hacking paper where they showed that, you know, fears of reinforcement learning on the chain of thought are not entirely unfounded. You know, they showed that the when they started off with a model that learned to reward hack and then they put pressure on the chain of thought to not reason about reward hacking, that initially would tamp down the reward hacking behavior, but then later, you'd see the reward hacking behavior come back without the reasoning showing up in the chain of thought, thus the obfuscated reward hacking. So it seems like there is something, you know, quite concerning there. I guess, do you see that as, like, concerning? Do you sort of endorse the what I take to be the conclusion of that paper which is like thou shalt not, you know, select intensively on the quality of the chain of thought?

Jack Rae: (32:43)

I think we show the chain of thought right now as part of these experimental model releases and we're trying to get feedback and learn from real user behavior. This is often an incredibly important aspect of releasing any technology. And then we're very seriously taking in feedback, looking at how these things are used in practice and then make more educated decisions on how to kind of surface information from chain of thought in future. And safety is definitely one thing that plays into a big part of that decision.

Nathan Labenz: (33:16)

Yeah. But I guess to just put a little bit finer point on it, like, you could do RL on the chain of thought for any number of different objectives, right, to try to make it more readable or to try to avoid, you know, weird cyclic behaviors or to try to tamp down reasoning about reward hacking, which, you know, may have this downstream negative effect. But, you know, there's definitely a strong school of thought out there that says, don't do that. Like, do you see that as a strong taboo because of the obfuscation that it can create or do you think there's like some way to do it and not have such a big problem?

Jack Rae: (33:54)

I think it's a pretty safe angle to say that we want these thoughts to actually improve the factuality, safety capability of the model. We want it to have that scratch space. We also do, like if we're going to be showing thoughts then we want them to be interpretable and faithful to the computation that the model is taking, and we probably don't want to add training objectives which would encourage things like deceit. So that's I think that's a very valid point.

Nathan Labenz: (34:24)
>
Hey. We'll continue our interview in a moment after a word from our sponsors. It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever. If your business can't adapt in real time, you are in a world of hurt. You need total visibility from global shipments to tariff impacts to real time cash flow, and that's NetSuite by Oracle, your AI powered business management suite trusted by over 42,000 businesses. NetSuite is the number one cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR all together into one suite. That gives you one source of truth, giving you visibility and the control you need to make quick decisions. And with real time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's one system, giving you full control and the ability to tame the chaos. That is NetSuite by Oracle. If your revenues are at least in the 7 figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.

Nathan Labenz: (35:53)

Yeah. Okay. Cool. Going back to the mix of different data types and human data for a second. Yeah. I have tried in my own work a bit to get people to record their chain of thought, you know, even back before all this reasoning stuff. Like, I just personally found that when fine tuning a model by simply including example chain of thought in my fine tuning dataset, I would get, you know, on usually just like one or a few, you know, very small number of tasks, I would just get a lot better performance. So yeah. Because I've worked with other people to try to, you know, help them build their AI applications or automations or whatever, I very often am like, okay. What I need you to do is staple your pants to the chair, and I don't really care how you do it. You could do it in text. You could, like, turn on your webcam and record yourself, whatever. But I need your live chain of thought as you, the expert, you know, whose work we're gonna try to automate actually do the work. Like, we need to know not just what your inputs and outputs are, but how you're thinking about it, you know, why you're making these little incremental decisions along the way. Yeah. I find that to be really hard to get out of people in a lot of situations, And this may be a little bit outside of your, you know, specific responsibility set, but I wonder what you slash, you know, the broader team have kind of learned about how to coax that data out of people, if anything. Or maybe it's just like so hard for you guys as well that you're sort of just like, oh god, we just we'll go with synthetic. But sort of what's the state of actually eliciting human chain of thought out of humans?

Jack Rae: (37:23)

Yeah. Well, you know, your question I guess had 2 components. One was like how do you get that process data so it's not just prompt and then like solution or response but it's actually what was the process that led to the solution. And then there's something like chain of thought which I guess is one instance of that. Funnily enough I think it's really hard to get people to transcribe actual chain of thought faithfully. It's a pretty latent thing. And actually I think part of the reason all of these models, especially Gemini, are able to click into this mode well is because people have already detailed their own thinking process. Maybe when not put it under kind of the task of doing this explicitly but even in essays or various pieces of work or online discussion, people will often break down how they're going to solve the problem or why they're writing what they're writing. So there's already in the pretrained model a bunch of examples of what does it mean to reason through a process. And that's partly why even before we were really trying to bring this out and make it really powerful with reinforcement learning, you could do things like prompt the model, like let's think about this step by step. It was basically doing this zero shot. What I found is though when you put people artificially and say now you have to record all of your reasoning towards a problem, so when it's not happening organically but when it's happening under a directive, it seems to be quite hard to get a lot of value out of that kind of data. But I think that is a bit separate to your other question which was like how can you record a process? And I think that is very valuable. If we can get more and more examples and training of the process in which people are naturally using to solve their tasks, that feels very valuable. I'm just not so sure people are so good at describing their inner monologue and then training on that and that being useful when asked to do that. Yeah.

Nathan Labenz: (39:19)

So when you talk about recording process, are you imagining, like, computer use, like, how people click around and interact with the environment, or what sort of recording are you envisioning there?

Jack Rae: (39:33)

Yeah. I think more in this kind of space where you're gonna solve a more open ended task and you have to do a lot of things, intermediate calculations, maybe actions for example. Yeah. I think that's kind of what I have in mind. Yeah. But you know, this is really kind of I would say part of this question is really kind of then moving into like what's the best way of kind of getting more kind of agentic data and things. You know that kind of side of things, it really isn't my area of expertise so I would be not the best person to chat to about that.

Nathan Labenz: (40:02)

Does that mean you see like a significant distinction between reasoning and agentic behavior? Because I think a lot of people right now have the sense that the reasoning is gonna be the unlock for the agentic behavior.

Jack Rae: (40:15)

No, absolutely. I just feel like reasoning and agentic behavior as a research thing is very tightly coupled. But there is a part you can still kind of segment which part, the critical research questions for acting and creating environments for agents. That part, we have a really good group and we compartmentalize it and there is a group of people that work on that. The thinking area really collaborates when it comes down to the reasoning behind actions or behind responses.

Nathan Labenz: (40:44)

Yeah. Okay. So you mentioned a minute ago that people struggle to write down their thoughts in part because it's a sort of latent thing. So yeah. I wanna take a turn into the latent space with you, if you will. First of all, I'd love to just give you a sort of you know, undoubtedly, overly simplified understanding of what's going on in a model as its reasoning, and then have you kind of sure. You know, critique or elaborate or expand upon it. So my general working model has been that the pretraining process determines what abstractions or representations or features, whatever you wanna call them, a model has to work with, what concepts it has, basically. Yep. And then post training determines the patterns of behavior by which it sort of deploys those concepts and, you know, puts them in juxtaposition against each other and sort of, you know, tries to figure out a path through to a solution. Yeah. My sense is well, react to that.

Jack Rae: (41:52)

Yeah. I think yeah. You know, one way of maybe paraphrasing what you're saying, I largely agree, is pretraining can learn this massive bag of function approximators that allows you to model the whole distribution of both good, bad behavior, strong reasoning behavior, like incorrect reasoning behavior, you get kind of everything. You can try and mold it a little bit with your selection of your pretraining data, but it still is really trying to reflect all types of behaviors and really just trying to understand. So the better you can predict the next token, better you can compress this text, maybe even the better you can understand the whole distribution. During post training you're going to drop a lot of modes, you're going drop a lot of types of behavior and really try and fixate on a couple of types of ways of reasoning, ways of responding or acting on to various different tasks that are important. And then hopefully if we do reinforcement learning really well, you are also going to then learn to compose maybe some more primitive skills to build up your skill set towards like the smaller set of important tasks. I feel like I don't know if I'm critiquing or exactly mirroring what you're saying, but that's how I think of it.

Nathan Labenz: (43:04)

I guess the distinction that and maybe this will sort of blur. I guess part of the premise of the idea has been that, you know, the vast majority of the compute goes into the pretraining and then the post training is, you know, by comparison, a very small, maybe, like, 2 orders of magnitude less. And I think now, obviously, the, you know, reinforcement learning scale is going up as well. And, you know, maybe this is sort of this sort of dichotomy is, you know, ultimately gonna become a spectrum that, you know, certainly is a common theme in everything that I study. Maybe one way to put it is, like, do models learn new, like, fundamental concepts about the world during the post training, or is that largely learned during the pretraining? And is that gonna change as we go from, you know, 1% to 10% or whatever of flops being deployed in that post training phase?

Jack Rae: (43:57)

My sense is they have to. Like it's absolutely crucial if we're going to build AGI that during reinforcement learning stage where we're not just kind of reshaping known concepts but we have to learn new skills especially if we want these models to eventually completely surpass us at very critical tasks. It can't then just be reshaping the knowledge that it's seen from behavioral cloning during the pre training stage. And I think, yeah, that's one of the most exciting research directions we're all in right now is how do we get the composition of reinforcement learning to help cycle up these models capabilities to being incredibly powerful and general and robust. And yeah, I would totally bet on it being during reinforcement learning.

Nathan Labenz: (44:44)

So another big, you know, maybe the one frontier model developer that hasn't joined the reasoning party in full force at this point would be Meta. They did put out though, I thought, a very interesting paper, although kind of a scary paper from some points of view at least, about reasoning in latent space where instead of actually cashing out to a token at the end of a forward pass, they would just take the last latent state before that final decoding, pass that, you know, in as the token for the next as the embedding for the next token position and just kind of let the model chew on its own thoughts for, you know, however many forward passes in a row. To me, there is something quite scary about that. Like, I would like to be able to know what my AI is thinking as much as possible. There also were some nice features about it. You know, there's an attractor state there, I think, where it's like yeah. It required fewer forward passes to kind of reach similar performance. Yeah. And there was some evidence that they could do breadth first search as opposed to, you know, having to go depth first, which seems to be kind of more of the pattern that, you know, the explicit chain of thought yeah. Lends itself to. So what do you think about reasoning in latent space? Like, should we be scared of it? Should we taboo that? Or, you know, are there some ways that we could embrace it safely?

Jack Rae: (46:07)

Yeah. Okay. I think tabooing a piece of technology before it's been, like, researched and understood are never in favor of unless there's, like, incredibly strong arguments to do so. In this case, I would say the reason that people could raise a question mark over it is this interpretability question. We need those latent vectors to be interpretable. I actually want to draw an analogy. So I'd say we should pursue it. If it leads to better thinking and it can be interpretable and made safe, why not explore this direction? It seems very promising. And actually I want to draw one analogy to, I don't know if this is kind of pre-LLMs but MuZero, MuZero was an extension. We had AlphaGo and AlphaZero and MuZero. So there's kind of a series of development. Obviously AlphaGo was the moment where we had a reinforcement learning model kind of be the world champion at Go. The difference from AlphaZero which was essentially only use self play, no SFT and there's many other algorithm improvements, but that's the tagline, to MuZero was instead of essentially unrolling over states, which is happening in AlphaZero, they unrolled in latent vectors. Those vectors could still be decoded into states. And there was a lot of advantages that they found with MuZero to being able to search in this latent space. So I was pretty inspired by that and often when I think about thinking in latent space, I think of this MuZero. That was definitely the most powerful way of that series, it was the most powerful progression. And they still could make it kind of interpretable because they could decode states from these latent vectors. So I think it's quite possible that this could be a very promising direction. I wouldn't rule it out at this stage. Yeah. It seemed a good idea.

Nathan Labenz: (47:51)

Yeah. I guess the skeptic or the safety hawk might say, you know, it's all well and good when you're talking about game states that you can sort of decode to in a, like, quite high confidence way. Right? I mean yeah. Ultimately, there is a game state that this thing sort of has to operate in, and we know that what that is. And it's like, you know, it can't go off into, like, far far away places. But we don't have a similar sort of ground state that we can feel so confident in when it comes to what exactly is going on inside a general purpose AI. And yeah. You know, I've spent quite a few hours reading the outstanding work that Anthropic just put out about tracing language model thoughts. Yeah. And, you know, I think the headlines of that have unfortunately maybe led a lot of people to who are not in the field, you know, to, like, a high level of overconfidence in our ability to really understand what's going on. I mean, much as I think the work itself was awesome, I tend to, you know, also look at, like, well, jeez, the, you know, the replacement models that they create can only explain 50% of behaviors, and there's, like, a lot of error terms that are being added in, you know, to sort of make sure all this is being explained. So I guess, you know, big picture, like, do you think my sense is that the field at large does not think that we're gonna get interpretability working well enough by the time we expect to have powerful or transformative or whatever you wanna call it AI to really be confident in what the models are thinking or, like, why they're doing what they're doing. What's your overall outlook for interpretability? Like do you think it will get there faster and we really will know what they're thinking as we get these like powerful systems everybody's expecting?

Jack Rae: (49:39)

Yeah. Yeah. Like there's a rapid advancement in capability. What I usually believe is that these also transfer not only to the models doing tasks like coding or agentic tasks that people find to be useful in the real world, it also does accelerate mechanistic interpretability. So if we have more powerful models, we have more powerful tools to examine these questions. So it's not super clear to me that the capability is going to improve exponentially and our ability to do mechanistic interpretability or safety work is going to improve linearly and we're to have a massive mismatch. I would imagine the two are going to track each other. But actually to your question about latent vectors versus thoughts in tokens, this is a really good point. In any case you want some really good piece of research and tools, eventually artifacts, that can try and trace how close the actual content of the thoughts is to maybe the underlying computation thus what the outcome of the model's answer will be. I feel like that is just a very interesting research problem and yeah, it's great. That was a really cool piece of work from Anthropic. We have really cool people working on this within Gemini. It's a really important problem and we should try and solve this in any case whether it's latent vectors or continuous tokens. Seems like people like this kind of and need this kind of interpretability from the model.

Nathan Labenz: (51:10)

Yeah. I think it's a huge especially if we're gonna have these things like running large swaths of economy or, you know, heaven forbid, the military, which seems to be more and more the kind of thing certain people are dreaming about. Knowing why they're doing what they're doing is, you know, it seems to me an imperative. One of the big challenges there with the interpretability where, like, the auto interpretability might be a huge unlock or might be sort of a spinning plate that we could, you know, see sort of crash at any given time is the auto labeling of features. And this is another one where, you know and again, the Anthropic work is just beautiful. The interface, the way they've, you know, published it where any of these features that appear in line in the post, you can kind of expand and see, like, what are the actual passages from the dataset that caused this feature to fire. Some of them, I have to say, I look at them and I'm like, I would not have come up with that label. You know? And so this becomes, like, quite philosophical. Maybe I'll ask it in a philosophical way. I'm sure you've seen the paper called the Platonic Representation Hypothesis. I wonder to what degree you buy that hypothesis. And what that means to me is like, a, there's sort of a convergence between models with growing scale, which seems to suggest that they're maybe converging on some one true world model. Do you think that that is actually what is happening and, you know, by extension, like, with further scale, should we be more confident in our reading of what the models are doing?

Jack Rae: (52:41)

Maybe could you like paraphrase the question a little bit? Are you saying across all the different models that are being trained as they're growing in scale they will start to converge more or actually, I wasn't sure exactly.

Nathan Labenz: (52:54)

Yeah. And maybe more like deeply and philosophically, like, are they converging on some actual representation of reality that, you know, we can trust as being well grounded?

Jack Rae: (53:08)

Well, I would say that the only place I feel like I have a very strong theoretical kind of conviction is what is happening with pretraining where as we're approaching and just decreasing perplexity, improving the compression of the text that we see, if we could do this to kind of hit the noise floor, hit the entropy of the text and kind of have like this base optimal text compression, then we would have the model which best understands best understands the world model which generated this text. That is a thing I feel like has a very clear mathematical grounding from Ray Solomonoff's work, even Claude Shannon's work, it's always referenced of how we can, how the optimal text compressor would have the best world model of generation process which generates this text. That does sometimes feel like a philosophical argument though because even that object is not what we really want for AGI. It's not just something which has the optimal understanding of the dynamics that generates text that exists today. We want the model to be trained to then go and do something useful and to be able to faithfully kind of follow the instructions that we give and to do complex tasks that maybe never been done before, to generalize to completely new and unseen environments. So all of those aspects I feel like are not covered by that kind of world model description of what's happening in pretraining and that's why I think even though I'd spent most of my career on pretraining, pretraining is not the only component to building AGI. So in some level I think it sounds like maybe I agree with the hypothesis that you said but also its relevance. I don't think it's the full story of how we build AGI so maybe it's kind of something that's been down weighted in my mind as being the only story I should think about. I do think that once you are starting to get into the realm of training these models with reinforcement learning at scale, they're definitely not all converging to one model. Actually there's a lot of responsibility in doing this well such that we really build the systems that are useful. And I don't feel like right now, you can even see it on the ground, the models are quite different already. There's already a lot of different pros and cons across them and a lot of capabilities that we work very discretely on within Gemini to kind of make them more useful in certain domains that I don't think just naturally arise across the board, across all models. So yeah, it still feels very steerable. It doesn't feel like one eventual process towards one kind of world model of everything. It still feels like very directable from the research side. But you know, I'm not a philosopher. I just try and make these things work really well. So I feel like I would be very interested in hearing what a couple of philosophers that are keeping up to date on AI would think about this.

Nathan Labenz: (56:02)

Yeah. It seems like to summarize and, you know, I think the empirical sciences definitely have a lot to inform the philosophers on as well, especially these days. It seems like you're sort of saying the world model itself is something that maybe everything is converging on, but how you navigate that world behaviorally is, you know, still a vast scope of information or vast scope of possibility where there's not a single right answer, and, you know, that's kind of where the taste and the safety and all these sort of things have a lot of space to explore and diverge. So maybe, you know, last 10 minutes or so, like, how do you map the road map from here to AGI? And, obviously, I don't mean, like, in a detailed technical sense, but sort of, you know, one big thing that Gemini has is really long context. Yeah. Do you think that we can just sort of you know, I guess my read of that is that, like, nothing too crazy is going on there, that it's basically just a matter of scale it up and, you know, have some data where you have to actually have command of long context to succeed and the model will learn from that. I may be oversimplifying. Tell me if I am. But is just kind of continuing to push on that gonna be enough, or are we gonna need some sort of, like, more integrated, you know, more sort of holistic process of memory and forgetting to really have these sort of long running agents that people imagine. I guess, is memory something you think is already solved if we just push on our current levers, or do we need some sort of conceptual breakthrough?

Jack Rae: (57:35)

Yeah. I mean it's a good question. When I joined DeepMind in 2014 I started on in an area that was called episodic memory. Memory is like what Demis, he did his kind of PhD looking into episodic memory and imagination and things. And so yeah, I've always been very inspired by human memory, human episodic memory, the hippocampus. And my own PhD was on lifelong reasoning with sparse and compressive memories. Like how do we have a memory system in a neural network that is expressive and has this huge range of time spans as we have in our own mind. And when I started that PhD I would have never imagined how much progress we'd have made. We now have something that 1 million, 10 million, these kind of context lengths, depending on how you represent your text or your video, they are starting to verge on lifelong scales. But I still don't think memory is solved. I don't think it's all done yet. Think there's some really cool breakthroughs we'll have, even in the memory space. And there was a lot of very cool ideas we had at DeepMind. We had this neural Turing machine, differentiable neural computer. These were a mix of large attention systems but with a lot of different read write mechanisms. My sense is probably something in this space will prevail and this will be a very cool way of having extremely long, infinite lifelong memory. But it's still an active research area. But roadmap towards AGI, I suppose with each piece that we make it does seem to compound very well. So a year ago we released what we felt was a breakthrough in long context and that has ended up stacking really well with our current reasoning and thinking work because we found that there's just a really useful coupling of being able to think very long and deeply about a problem and also being able to use a ton of context, maybe 1 million or millions of tokens. And that has ended up unblocking a bunch of extra problems that we now can solve that if we didn't have both of them we would have needed. So I think that the path remaining to AGI, obviously agents is kind of a super high priority area. Thinking and reasoning, it's not like we're at the kind of endpoint. These models have a long way to go in terms of being so reliable and so general that you really feel like you can trust their response on more and more open ended tasks. So from our perspective, there's still a lot of just make the system better. There's a lot of known bottlenecks right now and we will just continue doing that. Make thinking better and there will be within agents make agents better. But I feel like with combinations of much better agentic kind of capabilities, better reasoning, even ideally better memory systems such that we can have almost like a lifelong range of understanding and reasoning across time, then that will really feel like AGI to a lot of people. The current systems to me feel like AGI. I feel the AGI using 2.5 Pro, it can now one shot complex code bases and that was something we felt like was a futuristic piece of technology 3 years ago and now it's just there and it works, we're always hungry for the next thing. But I think, yeah, those combination of things, much better memory system with a much deeper thinking and reasoning system with a capability to work on with many different tools and action space that's very open ended, that will really feel like AGI. And when it's coming, I think it's hard to say but it's all kind of being developed actively right now. So I feel like it's coming quite fast. And yeah. And that's kinda I feel

Nathan Labenz: (1:01:25)

that too. Yeah. Okay. Two more quick questions, and then I'll give you the floor to share any final thoughts that you have. Yeah. One thing I didn't hear you mention in that description is integration of more modalities. Yeah. And I've been inspired to think these last couple weeks as we've seen Gemini 2 Flash image out and also the, you know, the GPT-4o image out that, boy, there is a lot of power in a deep integration of the text and the image modality as opposed to a sort of arm's length, you know, tool call type of integration. Yeah. Do you see that happening across many more modalities? You know, is there a is there a world in the future where, you know, Gemini whatever pro instead of calling AlphaFold is sort of deeply integrated with AlphaFold such that those latent spaces are actually merged and sort of co-navigated in the way that we are now seeing with language and image? Yeah. AlphaFold's a

Jack Rae: (1:02:30)

good question. I would say, okay, multimodal, you know, I feel a very good design decision for Gemini was that we'd make it multimodal first and it's been incredibly strong with image understanding, video understanding. It had native image generation trained within Gemini 1, actually it's in the technical report. It didn't end up getting released immediately as it was in its first form. But yeah, think to your world model question, having everything deeply multimodal is super important and training everything and getting that world model not just over text but over multimodal video, images, audio. That's been a cool aspect of Gemini and it's great to see now these things are launching, people really liked the native image generation. They loved the fact that suddenly you can edit images, you can do a lot more interactions instead of just calling what would be just a pure text to image model as a tool and then it's very static. So anything that you can bring in to the world model and train jointly, you're going to have a much deeper experience understanding. Think that's very cool. And then it goes to what's the dividing line? Where do you decide when to bring things into the pretraining mix and have them jointly understood? That's a really difficult question. I think right now what you're seeing across the board is a pragmatic choice of almost like the most compressed information sources and large information sources first and then they've been built out. So that was why text I think was a very natural starting place for a lot of these large generative models started with text that's so compressed and so knowledge rich and is available at scale. But then the decision of how to grow this out to maybe smaller scale sources of data or slightly less information compressed is a difficult one. I know in bio for example, genomics, it's very cool to try and co-train generative models with a large language model. People look into that. And I don't know where the dividing line is, but it's going to be something about how much you get from co-training versus just calling as a tool, how much positive transfer there is from all the world knowledge within your text and video and image space to this new task. If there's not much positive transfer, maybe there's not much benefit in co-training it and maybe you just want to learn to use it as a tool. Yeah. I think those are the main decision factors of whether you should bring it all into one world model or leave it as a separate expert system.

Nathan Labenz: (1:04:56)

Well, I'm betting on the one world model, we'll continue to watch the space. Yeah. So last question, and I really appreciate your time and, you know, coming and sharing so much alpha with the community here. But one question people would definitely be upset with me if I didn't ask is yeah. Where is the system card for Gemini 2.5 yeah. Pro. We sort of thought we were gonna get them and it seems like the last couple models we haven't and so I don't know if there's a, you know, a policy on that that, you know, determines like when a model actually gets the full right. Technical report treatment.

Jack Rae: (1:05:30)

Yeah. The kind of approaches with experimental releases, we release these models because we really want to get them into the hands of consumers and developers, get real feedback, understand their limitations. But they are released. This experimental tag means we don't do the full provisioning of these models, we don't necessarily have all the artifacts like system cards. We are moving as fast as we can to get these into a stable state where we feel like they're ready for general availability, there will be system cards when the model is made generally available.

Nathan Labenz: (1:06:05)

Has all the sort of safety testing been done though at this point?

Jack Rae: (1:06:11)

We do extensive, like probably industry, like unprecedented level of safety testing before we release models. But we do have kind of the experimental models may like be kind of, maybe there's like a different level, a tier of kind of testing that we take and partially part of the experimental releases is getting this real world feedback which is also a useful part of testing process. But for these releases it goes through a very standard process in terms of policy team, safety team. There's a lot of red teaming and things. That is happening. Right now we're in this experimental stage and we're racing to get towards general availability which will have even better kind of provisioning and things like system cards.

Nathan Labenz: (1:07:04)

Yeah. Okay. Cool.

Jack Rae: (1:07:06)

Thank you. A lot of the questions I was at a cloud event last week and there was a lot of like, when will it be made available on Vertex? And I'm like, oh, soon. And then it ended up being the next day. So in some of these cases, we kind of under promise, over deliver. Like, these things are happening pretty fast. The technology is also moving very fast. So, yeah, we appreciate

Nathan Labenz: (1:07:27)

Does that red teaming process include the, like, third party red teamers? Do you guys work with, like, anybody like Apollo or Haize Labs or METR? I mean, know the usual suspects.

Jack Rae: (1:07:39)

Yeah. So we publish these Gemini technical reports and we usually detail external red team, but I can't comment on who our partners are at this stage. There's, I think good reasons why we don't always discuss like who our red teaming partners are, but yet we do work with external red teamers.

Nathan Labenz: (1:07:56)

Gotcha. And that will when the technical report comes out, that will have the roster of the external partners?

Jack Rae: (1:08:04)

I'd have to check, but my understanding is in our past technical reports, this is something we acknowledge. Yeah.

Nathan Labenz: (1:08:11)

Yeah. Okay. Cool. Fantastic conversation. I really appreciate you working through all these questions with me. And I guess maybe just in closing, any other thoughts or yeah. Notions that we didn't touch on that you'd like to leave people with?

Jack Rae: (1:08:25)

Yeah. I'm curious if so you've played with 2.5 Pro a little bit so far. Are there any things that kinda you found that it was unlocking that you haven't seen before or any feedback you had?

Nathan Labenz: (1:08:35)

The long context for me was the thing that felt different. You know? I was in a I have a kind of general sort of complaint with, like, almost all RAG apps, you know, regardless of whether it's, you know, a IDE integrated one or otherwise, where I feel like they don't and often this is like a sort of business problem more so than a technical problem. Right? There's like, I pay a flat monthly amount for whatever product. They wanna have some margin. You know? So they sort of set the hyperparameters in a way that tries to, like, give me the best performance they can while also, like, not spending too much money, you know, and just burning off all the cash they have. Right? So that typically, I find, leads to not enough context being included in the model calls. And then I feel like, oh god. You know, so often, there's just something that was could have been there, that wasn't there, that was leading me to, like, you know, not get as good of an answer as I could have. Mhmm. So what I often do is, if I can, I'll just print my entire code base to a single text file and then just paste that into the model. And I do a lot of, like, small, you know, personal projects, proof of concepts. So usually, I can get away with, you know, 100,000 or whatever. I can put that into any of the leading yeah. Models. But this recent one with the research code base, I happen to be the least valuable author on the emergent misalignment paper. Long story, but yeah. I call myself the Forrest Gump of AI because I sometimes wander through these, like, important scenes as an extra, and this happened again here. Yeah. But, you know, I kind of had this research code base and, you know, it's not production code. It's like, you know, folders are sort of like, you know, Daniel folder, Nathan folder. Right? It's like the no. It's not best practices software engineering. We all know that, but, you know, we're just all kind of exploring stuff. This was 400,000 tokens. So it was like unwieldy. Significantly too much for me to put into any other model. Yeah. And the command that it had of it was just incredible. I, you know, really was like, boy. Previous Gemini models obviously could handle that much, but I was never 100% sure if it was really in full command or, you know, only in sort of partial command. But this felt to me like really incredibly strong command of that full context window, and that to me did feel like a real game changer. I you know, without having strong benchmarks or anything, you know, to really ground myself in, my feeling is that I can take dumps of information and have much higher confidence. I know I still don't wanna be overly trusting, of course, but I feel like I can take dumps of information that I don't even really necessarily know what's in there yeah. And be much more, if not, you know, still, of course, not fully confident that the 2.5 model will latch on to what is actually important and help me navigate this yeah. Super deep context even if, you know, I myself, like, don't have a good sense of what's in there at the start. So yeah. That to me does feel like a huge difference because it's one thing to be able to sort of help you navigate long context if you know the long context yourself, but it's a very, very different thing if it can help you navigate long context that you don't have great command of. And yeah. I think there's more work to do to really, like, validate that for myself and obviously, you know, the community at large and you guys all working together, but it feels different. I can say that for sure.

Jack Rae: (1:12:14)

I mean, that's great to hear because I know, like, I worked with a lot of the long context people last year when we were kind of in the run up to the original breakthrough, and I communicate a lot with because I used to be in pretraining for a long time with some of the people that have been particularly focused on making long context really good for 2.5 Pro and yeah, there was a lot of work not only just in the initial phase to make 1 million, 2 million and we'll see more happen but also to make it really effective. So with the 2.5 Pro release, I actually forget the name of the external leaderboard but I think there is an external leaderboard, it's shared on X a lot where it's like 128k context, Gemini 2.5 Pro, it's using it way more effectively than basically any other model out there right now, which is cool. So it's not only that it can go to 1 million but now especially we're seeing it with 2.5 Pro, it feels like it's read everything and it's not dropping things, it's not missing out key details, it feels like it's read and studied all that information. And that kind of gives people a bit of a AGI feel where it's like within a second you feel like you've studied a very large code base and know every kind of detail in quite good understanding level. That's quite a remarkable kind of thing. But, yeah, that's great to hear.

Nathan Labenz: (1:13:35)

Yeah. Well, it's well deserved praise. I mean, these step changes I'll never forget where I was when I first tried GPT-4, and there aren't that many moments in the last, you know, 2 and a half years where I felt like, oh, this is, like, qualitatively different than everything I had used up until, you know, that particular moment. But this was one. It really, it did have that quality where it was like, okay. I can feel a new level of unlock. I'm gonna have to kinda recalibrate myself a little bit to what this makes possible. So that's awesome. Definitely an exciting time. This has been fantastic. I really appreciate it. The final send off, of course, Jack Rae, principal research scientist at Google DeepMind, thank you for being part of the cognitive revolution.

Jack Rae: (1:14:20)

Great. Thank you so much for having me. Cheers.

Nathan Labenz: (1:14:23)

It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.