In this episode of the Cognitive Revolution podcast, Logan Kilpatrick, Product Manager at Google DeepMind, returns to discuss the latest updates on the Gemini API and AI Studio.

Watch Episode Here

Read Episode Description

In this episode of the Cognitive Revolution podcast, Logan Kilpatrick, Product Manager at Google DeepMind, returns to discuss the latest updates on the Gemini API and AI Studio.Logan delves into his experiences transitioning to DeepMind and the restructuring within Google focusing on AI. He highlights new product releases, including the Gemini 2.0 models, and their implications for developers. Logan also touches on the future of AI in text-to-app creation, the impact of reasoning and long context in models, and the broader industry trends. The conversation wraps up with insights into fine-tuning, reinforcement learning, vision language models, and startup opportunities in the AI space.

SPONSORS:
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive

Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive

CHAPTERS:
(00:00) Teaser
(00:54) Introduction and Welcome
(03:10) Exciting Customer Success Stories
(03:56) The Future of Text App Creation
(05:15) Multimodal API and Real-World Applications
(10:37) Sponsors: Oracle Cloud Infrastructure (OCI) | NetSuite
(13:17) The Evolution of Long Context and Reasoning (Part 2)
(19:19) Vision Language Models and Passive Applications
(21:50) New Launches and Future Prospects (Part 1)
(28:35) Sponsors: Shopify
(29:55) New Launches and Future Prospects (Part 2)
(31:29) Cost Management in AI Models
(31:55) Flashlight Models and Cost Efficiency
(34:36) Pro Models and Frontier Applications
(36:53) Model Naming and Scaling Challenges
(39:52) Evaluating AI Models
(48:57) Fine-Tuning and Reinforcement Learning
(51:42) Opportunities for Startups
(55:52) Conclusion and Final Thoughts
(56:59) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

PRODUCED BY:
https://aipodcast.ing

Full Transcript

Transcript

Logan Kilpatrick: (0:00) We released the experimental first iteration of Gemini 2.0 Flash back in December. Today, we brought Gemini 2.0 Flash, an updated version of it, into production so that developers can actually continue to build with it. We announced pricing: 10¢ per million input tokens, 40¢ per million output tokens, which is a huge accomplishment for us to pull off. We're gonna have the world's best coding model at Google. And I still believe this deeply, and I think Pro is going to be that model. A bunch of the reasoning work that we're doing is gonna be that model that continues to push the frontier for us. The world needs a platform that's hosting all of the publicly available benchmarks and leaderboards and stuff like that. I find it incredibly difficult to just navigate and get a snapshot of how good is this model? There's 20 random benchmarks here and 50 random ones here. They're all split out over the place, and it's just hard to keep track as a developer.

Nathan Labenz: (0:55) Logan Kilpatrick from Google DeepMind, product manager of the Gemini API and AI Studio. Welcome back to the Cognitive Revolution.

Logan Kilpatrick: (1:03) Thank you for having me, Nathan. I'm excited. I'm hopeful that I'm getting close to the record for the most times on your podcast. So I appreciate you for that.

Nathan Labenz: (1:10) I think this might be setting the record at 4 if my count is correct. So, yes, congratulations. That's a rare honor and well deserved. So it's launch day. We'll get to everything that you've launched and what we should be thinking about building with it. Quick little detour though before we get there. You're now part of DeepMind. So, Google obviously is a vast company and is continuing to align, restructure, streamline itself to focus more and more on AI. What's the story from the inside on what it's like to be at DeepMind now specifically?

Logan Kilpatrick: (1:49) Yeah. I'm super excited about this. So we've been—I joined Google 10 or 11 months ago. Literally from day 1, it's been a deep collaboration with DeepMind. DeepMind's gone through all of these evolutions over the last few years transitioning from an organization doing fundamental research to actually productionizing models. And then within the last 3 months with the Gemini app moving over and then AI Studio and the Gemini API, it's now actually an organization that, end to end, does the research, creates models, and then actually brings them to products inside of Google. And I think that's been a shift for them. But from my personal vantage point, I think this is the thing that makes the most sense. Being really close to research—and we already were really close to research through this collaboration we've had—but removing as much friction as possible for us to bring the researchers who actually know how to get the most capabilities out of the models, bringing those 2 things together makes a lot of sense, and it's gonna be a ton of fun. So as an external person who doesn't care about Google reorgs, which is most of the world, the thing that you'll hopefully see is an acceleration of model progress, but also an acceleration of product progress because we bring these 2 teams together.

Nathan Labenz: (3:00) Well, it sure seems from my vantage point on the outside that everything is accelerating, and we've had previews of some of the stuff that is now going general availability today over the last few months. And, of course, there's been just 1 advance after another from DeepMind and others over the last few months. Looking back a little bit, what would you say are the customer success stories or just coolest apps that you have seen come online that have been built with the Gemini API in recent time?

Logan Kilpatrick: (3:32) Yeah. I think the thing that I'm most excited about—and also feels like we have the biggest opportunity here still—is around all these text-to-app creation softwares. And there's a bunch of examples of these. Bolt.new just went live, I think, yesterday with Gemini support. Cursor has Gemini support now and is using 2.0 Flash. Hopefully, we'll see others like Lovable and v0, etcetera, have that support as well. If you look at just the economics of running those products, it's extremely cost intensive, especially as the number of people who know that you can actually do that use case today—put in a text prompt and get a basically working app slash website for free—is a very small number of people. And it feels like that's this new frontier use case slash product paradigm that I think is going to be picked up across all the big players, but I think also there's gonna be a ton of startup activity in this. How do you just build domain specific software for people without those people actually having to know how to code? So I'm really excited about that use case, and I think we have a lot more model progress to still do to become the world's best model at doing coding. But I think even for 2.0 Flash and 2.0 Pro from where we were 6 months ago, it's just an incredible amount of progress. And I think trying to keep pushing on progress in the context of Flash without increasing the price in any dramatic way has been the biggest win for us. I've seen tweets of the LLM usage cost for some of these startups, and it's on the order of 40 or $50,000 a month, and you can imagine what those costs would be with Flash. It's probably $1,000 or less or something like that. It's a 40x cost reduction, which is just crazy. So that's 1 of them that I'm excited about. I think we also previewed in December the multimodal live API, which allows you to do this collaborative real-time conversational video, text interface with the models, and that feels like it is getting us closer to the future that I think we are all been promised with AI, which is this co-presence that's able to see the things that you do and interact with the services you do. So I'm really interested, and I've gotten the entire spectrum of outreach from people who are using it to help them do coding to people who aren't developers who are just using this product experience and they're blind and they're actually just trying to navigate their daily lives using this tool. And it's crazy that the product that we're building as a demo for developers to drive adoption of the API is actually helping people who are just trying to live their daily lives with this product. I think it speaks to where we are in the adoption of this technology, which is everyone is waking up and trying to find the best way to use these tools, which is really interesting to see happen.

Nathan Labenz: (6:19) Let's put a pin in the Lovable and Bolt discussion because I actually—you're teeing me up perfectly because I think immediately after this episode, the next 2 episodes are gonna be with the founders of those 2 companies.

Logan Kilpatrick: (6:32) I love that.

Nathan Labenz: (6:33) We're calling it software supernova because it does seem to be—we're kind of mid-transition, mid-tipping point right now, I feel like, where it is becoming pretty realistic for people that don't know how to code to create at least basic full stack applications. Obviously, they're not gonna create enterprise platforms just yet, but the models continue to advance. So it feels like we're in a very different world quite soon with products and paradigms like that. In terms of living my daily life, I have tried both the AI Studio version, which is the desktop experience where I can share my screen with the multimodal Gemini, and I've also been using on the mobile. And for I don't know why, but for whatever reason, OpenAI only has that experience enabled on the mobile app as far as I know. So I've been trying both of them. This is too stupid, but it's also too good of an example. So I got my kids the Nintendo Switch for the holidays, and we're going through historical video games. Right? Because I feel like they're young and the modern games are too overstimulating and plus, let's burn through the old catalog first. We'll work our way up. So we're doing, original Nintendo games, Nintendo 64. So we're playing Mario 64 for the original Nintendo 64, which is this open world game where you go around and hunt stars and whatever. But the frustration for me is I often don't know what to do. So I've been sitting there with the advanced voice mode on and sometimes, showing it the screen and just telling it what level I'm in and having it tell me what to do in the game. What is my objective? Where should I go to find the star? And my kids are really getting used to this. And my even my 1-year-old, sometimes now, comes and wants to take the phone, and he's trying to talk to AI. So it is, I think—and for seniors too, I think about my grandmother all the time. This switch from you type to it and it gives you text back to it really can be co-present with you everywhere. I feel like I've only still, only dipped a toe in that world, but, man, does it feel like a very different world for people that are not anchored to their desk all day doing computer work. It's like, wow. That could really just be—I mentioned putting it in glasses too. I'm sure from the company that made Google Glass, I'm sure there's a lot of thought going into that sort of thing. So yeah, that's my experience.

Logan Kilpatrick: (9:00) I feel like—and my guess is we're actually gonna see a lot of this—if folks have been watching this closely, it takes time for this progression to actually happen. I think text was the best example of this. It was kind of a toy demo a year and a half ago, and now at large scale, text LLM applications are broadly being used throughout the world at billion-user scale. I think this co-presence—to be humble about the place that we're in today—we're gonna have to iterate on the API. We're gonna have to bring the cost down. We're gonna have to make it so that it's actually something that developers can build because you can imagine the co-presence cost of having AI with you all the time is probably pretty expensive. There's a reason that we limit the sessions to only 10 minutes right now because there's a whole lot of challenges to scale this up beyond that and have it maintain memory and state and context of all the things that you were just talking to it about. But to me, it's very clear that that's the direction that we're going to go in, and I think it'll be interesting. It does feel like everyone—2 years ago was talking about, oh, everything's a wrapper. There's not a lot of value to be created. And it just feels like that continues to be wrong. It feels like there's a new thing and then all of a sudden, all of these new things that were not possible before can now be created, which continues to just get me excited about the future and also making sure that we're enabling those next things so that people show up and create the experiences that don't yet exist today. So it's gonna be fun. I'm excited. And, yeah, send over feedback as you keep using the real-time mode.

Nathan Labenz: (10:36) Cool. Yeah. Will do. Hey. We'll continue our interview in a moment after a word from our sponsors.

Sponsor: (10:43) It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever. If your business can't adapt in real time,

Nathan Labenz: (10:54) you

Sponsor: (10:54) are in a world

Nathan Labenz: (10:55) of hurt.

Sponsor: (10:56) You need total visibility from global shipments to tariff impacts to real-time cash flow, and that's NetSuite by Oracle, your AI-powered business management suite trusted by over 42,000 businesses. NetSuite is the number 1 cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR all together into 1 suite. That gives you 1 source of truth, giving you visibility and the control you need to make quick decisions. And with real-time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's 1 system, giving you full control and the ability to tame the chaos. That is NetSuite by Oracle. If your revenues are at least in the 7 figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.

Nathan Labenz: (12:08) Just taking 1 more beat on the cool stuff with the Gemini API. Last time we talked about a couple things I wanted to check in on the status of. 1 was really long context. Another is just insane affordability, as you mentioned, a huge drop compared to other options. And then there's sort of unclear exactly how natively the video is being consumed, but you can just feed video audio multimodal inputs into the Gemini API as well. What would you say are kind of the status of those things right now? Are you seeing use cases where people are like, 200,000 tokens is not enough. I really need to provide more, and therefore, Gemini is the 1 thing I can use? Or, cost-wise, this just wouldn't be affordable if I didn't have this 10¢ per million input token pricing?

Logan Kilpatrick: (13:07) Yeah. This is a great question. I had a long conversation with Jack Rae yesterday, who's 1 of the co-leads for the reasoning efforts, and he previously worked on the long context breakthroughs with Gemini and enabling that from a research perspective and is now 1 of the co-leads with GNOME on the reasoning models. And we opined for a long time about how it is funny that maybe the real unlock for long context ends up being reasoning. Because long context is extremely impressive and it's really useful and we do see people using it in production. And I do think there's cases where 200,000 makes a lot of difference versus 1,000,000 or 2,000,000. But 1 of the inherent challenges is just, how many things can the model attend to in the context window? And it works really, really well if you're just asking questions about a couple of things that are in the context window. But if you're trying to put together 1,000 different things that are in the 2,000,000 context window, it gets really hard to do that just because of the inherent nature of how the models are trained and set up. I think reasoning is maybe where this starts to change. So I think us having this really long context window where the models can actually just think through and perhaps in the future, use tools and bring information in and out of the context window, I think is where long context is going to start to make a lot more sense and truly become an enabler. That's where my head's at from a long context point of view. I think video stuff—video, audio still happening natively, image still happening natively inside of the models. With the 2.0 release, we showcased some of the next steps of this native multimodality, which is the models actually being able to output images and audio, and that's available to early testers. I don't know if you're in the early access program or not, Nathan, if you've played around with it yourself. It's pretty good. I think there's more quality work that needs to happen still. But I think where this really starts to—and we're about to actually, at the time of this recording tomorrow, we'll roll out Imagen 3 in the Gemini API, which I'm super excited about. And if you sort of look at, why do we still have state-of-the-art image generation models when we know that the models can natively have this capability themselves? There's definitely this quality trade-off in some domains where you trade off quality for world knowledge. And I think the world knowledge case is actually where this is really interesting. There's a ton of models out there that generate really, really pretty pictures and can do all types of cool stuff, but they lack this world knowledge that the Gemini models have because it's this native capability that's coming as part of the training process. So I think there's gonna be this whole new onslaught of use cases which didn't work before because the models weren't smart. They were just good at generating images that we'll see happen with native image generation. So hopefully, we'll get that out soon as we continue to hill climb on quality.

Nathan Labenz: (16:01) Cool. Yeah. That's interesting. I think your point about the need to have long chain of thought in order to really fully take advantage of super long context is quite interesting. I have been noticing for myself in application development, I do find I want the hardest thought. Right? So I'll go to the model that's gonna think for me the longest and then I sometimes have to contort myself or contort my inputs to get it to fit into the context. But I do feel like, yeah, I wouldn't 10x dump context into a model that is gonna immediately jump to the answer. And I hadn't really put together why that might be, but I think that's a pretty interesting hypothesis that I could imagine. I look forward to dumping my full million token code bases into the Flash reasoning sooner rather than later.

Logan Kilpatrick: (16:59) Yeah. I'd love to see if you have use cases that haven't worked well historically for long context. I'm super curious if folks in the audience have—what—you can do the compare mode right in AI Studio, try it with 2.0 Pro with long context, and try it with reasoning with long context and see whether or not the extra reasoning steps actually make a difference. And my intuition is that it is going to, which is exciting.

Nathan Labenz: (17:22) Yeah. Cool. Okay. I like that. Have you seen anything passive? This is another area where I feel like Flash—you very minimally famously said more people should try to spend a dollar a day on Flash, and that is a lot of tokens. And so it's kind of like you need passive applications for most people to get there. A couple of people have asked me about could I use a vision language model to monitor my factory floor for safety incidents or policy violations? Or my grandmother lives in a senior living community and the seniors hate to wear their fall monitors so that's a constant battle. And probably a lot of them would—to the degree that they have, and my grandmother does have the ability to make her own decisions on this—probably would accept not having to wear that thing if there was another sort of visual monitor that could keep track of her state at any given time. Have you seen anything like that where people are just kind of truly passively, almost Internet of Things-like, sending signals into the API?

Logan Kilpatrick: (18:33) This is a great question. I think it is so core to my thesis of what's going to happen for a lot of these domain-specific use cases because if you take a step back, how would people solve that problem today? You would need to go and buy some custom software that does that, which probably is expensive. It might not generalize well. In my past life, I was a machine learning engineer and we did a bunch of stuff with security cameras and what it would look like. How would you track someone moving from 1 frame where a camera is visible to another frame and how to keep object permanence of that person? It's incredibly hard. It is not an easy problem to solve with traditional computer vision technologies. And I think vision language models just do this task incredibly well. And the cost basis is now with Flash so low. I haven't talked to anyone who actually has this in production, but I have to imagine that this is the opportunity that people are going after. And it's not actually not just the bounding box use case and image understanding use case, which I think is really, really powerful—being able to know, here's where an object is. We have a good demo of this in AI Studio. Folks haven't tried the bounding box capabilities. If you go to starter apps, there's, I think it's spatial understanding or maybe it's just called bounding boxes. I forgot what the name of it is, but there's an example in there and you can put in images and ask to identify the objects and it'll throw bounding boxes pretty much just like you would get out of 1 of those custom bounding box models that you could probably find on open source or something like that. I think these use cases just take time, but I also think my guess is as vision becomes more and more prominent, we're gonna see the whole YC startup wave go after all of these ecosystems and industries where they're using domain-specific vision models and not using a general purpose model and your cost basis is just gonna be wildly different. And also you unlock all these use cases which those models are just not actually capable of doing. They're very, very rigid and can't be fault tolerant in a lot of those cases. So I'm super excited about this.

Nathan Labenz: (20:41) So let's get into then what you're launching today. I saw an interesting tweet. You mentioned YC, and the idea was basically that every time the frontier advances, all the YC companies go and kind of see, can this work for my use case? And the report was every time a new model comes out, some subset of the current YC batch company's products start to work. I saw others are kind of just waiting and continuing to build all the other stuff with the expectation that they're gonna get a model that's gonna tip them from kinda not quite working to working. So what are you launching and, to the degree you can speculate, what is it going to make work that wasn't previously working?

Logan Kilpatrick: (21:27) Yeah, it's a great question. I think—and just to draw a broader point, I think 1 of the really interesting observations I've had is thinking about, because of how much excitement there is about AI, how the resource constraint has not made people think as deeply as they need to about this problem. And I think maybe this is me somewhat, because of Gemini models being at the frontier of cost per intelligence if you look at that as a ratio. But it is really—the YC companies are just so well funded that we bring the cost of intelligence down some reasonable factor. And it actually, in a lot of cases, doesn't move the needle for these startups, just because they have millions of dollars. And I think it's interesting to see what the outcome of that trend is going to be. And if I had to guess, actually, I think it gives a lot of power to these individual developers who don't have this large amount of financial backing from tier 1 VCs as an example that can actually push the frontier of some of these use cases and capabilities, which I think is a really interesting and cool phenomenon. But to answer your question specifically about what we're launching, it's a whole suite of Gemini 2.0 models. So we released the experimental first iteration of Gemini 2.0 Flash back in December. Today, we brought Gemini 2.0 Flash, an updated version of it into production so that developers can actually continue to build with it. We announced pricing: 10¢ per million input tokens, 40¢ per million output tokens, which is a huge accomplishment for us to pull that off. And then announced a preview of Flash Lite, which is the smaller variant of the Flash model, which we'll intend to make available for production use very soon and the pricing for that. And then released the experimental variant of 2.0 Pro, which is the most capable frontier model we have, rounding out the full offering with the Flash reasoning model. So now we have the reasoning Flash model. We have Flash Lite, the smallest model, Flash, which is the most performant cost trade-off-wise, and then Pro, which is the most capable model.

Nathan Labenz: (23:39) And only the—well, wait. Let's go down the availability too because you guys have—in my typical work, which is usually focused on proof of concept type stuff, I'm going to honestly live my life these days. I never have to really worry about the harder work of making something actually production ready and much more get to focus on that easier, faster ascent portion of make the proof of concept work. So when I go to the AI Studio and I grab code and go do stuff with it, with 1 exception which I'll mention maybe in a minute, I basically never hit rate limits. So even when something's still in kind of preview for my purposes, there's enough headroom there for me to do usually all the testing that I want to do. But if I'm Bolt or Lovable or certainly, Cursor or whatever, then they would be hitting those limits. So what is super scalable now versus is still in kind of experimental access and just give us the concrete on how much of these different models we can use. Logan Kilpatrick: (24:44) Yeah. So 2.0 Flash during experimental, I think the AI Studio UI is different—we don't publish the limits because they can change dynamically based on how much capacity we have. We don't usually change it, but we try not to publish it just in case we need to and don't want to make people sad. But in the API, I think it's 10 or 15 requests per minute on the free tier and then I think 4 million tokens per minute, which is a lot of tokens per minute, to be honest with you. So I think that's probably why a lot of people don't hit it. Unless you have users or are trying to run—people have internal evals and things, or people who are running leaderboards as another example. Those are usually the use cases where people reach out and are getting rate limited. So with the production availability, you can now, if you're on the paid tier—the free tier stuff all stays the same, you can keep using the model, keep tinkering, doing all that. If you're on the paid tier, there's no daily RPM limit. You can send as many requests every day as you want. I think it defaults to 2,000 requests per minute and still stays at 4 million tokens per minute. And then we're rolling out tonight at midnight new quota tiers. So as you continue to scale usage, you can unlock things like 10 million tokens per minute, 10,000 requests per minute as well to help those who need to keep scaling up.

Nathan Labenz: (26:11) It's wild. The infrastructure behind that is truly an incredible accomplishment.

Logan Kilpatrick: (26:17) Yeah. Lots of TPUs to make all this stuff happen. And lots of—it's, I think for people who aren't at LLM companies, you don't think a lot about this, but there's just a lot of complexity in how many models there are. And we, you know—I see a lot of memes online about the bad naming conventions that we have with our models and stuff like that, which I appreciate. But I think it actually hits at a different point, which is there's just so many different model variants, and it's really difficult. And this is why a lot of the feedback about our experimental model release chain has been like, "Hey, we love these models. Let us use them in production." And the challenge is we have to be really picky about what models we use in production just because the compute footprint that it takes to actually make it so that the Cursors, Bolts, the other YC startups, the developers trying to scale and build companies can get the compute they need is just—it's hard and it takes a lot of compute. So we have to be a little bit more intentional about how we do it. Ideally, we would just GA every model and everyone would take every model to production, and we wouldn't need to worry about it. But yeah, there's just a lot of constraints with doing that.

Nathan Labenz: (27:25) Hey, we'll continue our interview in a moment after a word from our sponsors. Yeah, if you're gonna support millions of tokens a minute, I can imagine why you'd need to be prepared for it. So maybe help me develop my intuition for how should I think about Flash Lite as it relates to Flash. Because my totally candid initial reaction was Flash is already so cheap and pretty fast. Flash Lite comes in 25% cheaper and presumably faster, but also slightly weaker. You've got the table, of course, of all the benchmarks, and it's a little less on most of them. Have you been getting demand for an even cheaper, faster model than Flash? That seems almost hard for me to wrap my head around, to be honest.

Logan Kilpatrick: (28:14) Yeah. So I think the positioning is around two-fold. One, we wanted to give—technically, by default, the 2.0 Flash price is more expensive than the 1.5 Flash price. And especially given how much we had leaned into the low cost per intelligence of the models historically, we wanted to give people a one-to-one. If 7.5 cents per million tokens was the thing that was enabling your business and we showed up and said, "Hey, by the way, now it's 10 cents," it just didn't feel like a great story for developers given how much we leaned into that narrative. So being able to have an option where it really was a direct—it was not only a better model, but was the same exact cost as you were getting before. And I think for the 2.0 Flash model, it just—because of a bunch of constraints—it wasn't gonna be possible for us to keep that same level of price. So really about wanting to make sure that we didn't mislead customers into thinking that they were going to be able to continue to push on whatever the Pareto frontier is that they care about with cost intelligence. And yeah, there's also some other things that as we think about how do we make the cost lower for these models—there's a bunch of features that aren't supported, more of the high-end things. Those Flash Lite models, for example, will never be able to do native image generation or native audio generation. A lot of things that we can do to keep the cost down as far as serving those models at scale, which the 2.0 Flash default version doesn't have. It's also kind of similar to—I don't remember when we talked last about this, but we also put out that Flash 8 billion parameter model. So you can kind of think about, we're not releasing the size of the Flash Lite model publicly, but you can think about that as another version of the small model train that we'd previously done with Flash and with Flash 8B. Yeah, Flash 8B was the most—folks have looked at OpenRouter before—was the most high token volume usage model on OpenRouter, which again is a reasonable proxy for model usage in some context. So the clear feedback for us was developers love low-cost models and there's a huge amount of new use cases you can unlock by continuing to reduce the cost down. So I think if we could have made an even cheaper model, I would have pushed for that as well.

Nathan Labenz: (30:36) Yeah. Interesting. I would love to hear if anybody's listening who fits that description of a 7.5 to 10 cent price change per million tokens would have made a meaningful difference for what you're trying to do in the world—reach out to me. I want to hear that story. I really find that pretty hard to imagine. I can easily imagine how people might just look at a menu and say, "I'll take the cheapest one" because I'm just processing whatever and extracting addresses out of whatever stream of data. So I can imagine choosing the cheapest by default, but I have a hard time imagining how a business model gets disrupted by that kind of change. But I'd welcome that.

Logan Kilpatrick: (31:18) I think you're right about this. I think a lot of this is just about how we tell the story to the world of how we show up. And I think it'd be easy for that to become the narrative we don't want, especially given how much we've pushed on reducing the cost for developers—you know, "Google's raising the price for developers." That's not the narrative we want. So it was really important to us to preserve the continuity of making sure that we have that low price point available for developers who care about this. But I think, generally, I agree with you. Especially given how much the other signal that we've gotten is if you have better models, the story of a lot of the models in the ecosystem is if you have great models, people will pay for them. That's actually not the limiter in a lot of cases.

Nathan Labenz: (31:58) For sure. They're all cheap compared to human labor. That's, I think, such a striking two-by-four to the forehead, the fact that it is often kind of glossed over. Let's do the Pro side. So how should I think about Pro? You compare it to Flash. You compare it to other frontier models that are out there, but how should I kind of understand it in the increasingly busy constellation of available models?

Logan Kilpatrick: (32:24) Yeah. I think Pro is where you're not bound by cost in a lot of ways. We didn't release the price of the Pro model yet, because it's still experimental, but it's gonna probably roughly follow the similar patterns to what some of the Pro models have been in the past, which is just a lot more expensive. So it has to be. And I think there's actually—if you look at what is the traditional advice for developers as they're building sort of the frontier applications, it's go with the model that's best, even if it's a premium, make your use case work and then figure out a way to bring the cost down over time by switching to a smaller model or optimizing, doing fine-tuning, whatever it is. So I think it's important for us to continue to honor that flow, which I think actually works. There's a lot of—this is the default experimentation path that developers go on today. I think specifically, the use case where we're seeing the best performance relative to the other domains is coding. And I think we're gonna continue to. And I had a glib tweet a while—I don't even remember when it was, the day before o3, I think, got announced—about how we're gonna have the world's best coding model at Google. And I still believe this deeply, and I think Pro is going to be that model. And a bunch of the reasoning work that we're doing is gonna be that model that continues to push the frontier for us in coding. And it's a domain we need to win, especially if you believe in—and we were talking about the text-to-app creation stuff before—especially if you believe in that sort of trend continuing and the acceleration of developers continuing from an internal software engineering productivity standpoint. So I think that's probably the best use case for it. It still has 2 million context. Maybe it'll have longer context in the future, so that might be something as well.

Nathan Labenz: (34:09) Yeah. We've heard whispers of infinite context.

Logan Kilpatrick: (34:13) You and I were sitting in that room together, I'm pretty sure, when Jeff Dean said we were getting infinite context at some point in the future. He didn't put a date on it at I/O last year in that session we were in together. So, yeah, hopefully, I think that's going to be a huge unlock for folks once we land it.

Nathan Labenz: (34:28) How would you guide people to think about the non-release of—it was Ultra, right, was the largest scale. We've kind of seen the same thing obviously from Anthropic where it's like, "Wait a second. What happened to Opus?" Is there anything you can share to help people just generally understand why we seem to have gone from small, medium, and large to now just small and medium?

Logan Kilpatrick: (34:55) Yeah. That's a good question. I think the historical context on Ultra was basically trying to prove out the research direction that this scaling was going to continue. The reality is they sort of proved with Gemini 1 when the original Ultra model candidate came out that that was the case. But also then there was just all of this continual rapid innovation of, like, all of a sudden, the Pro model was better. Then all of a sudden, now I'm pretty sure Flash Lite is better than the original Ultra model. So it just becomes this question of what are the cost trade-offs, infrastructure equation to keep in mind. And you could imagine a world where we have an Ultra model and it's 5% better on every benchmark and it's five times larger and costs five times even more than Pro. And then you just start to do the math and think about from a research perspective, where does it make sense to spend our time and energy? And also from an infrastructure footprint perspective, where does it make sense to spend our time and energy? And we continue to see gains on making models much higher quality at the same size or even a factor of a size less than previous models. And I think now especially with reasoning, there's even more question marks in my mind of how much—if we could get a model that was even 10% better on every benchmark, does that make sense given a world where reasoning has so much scaling and there's just a lot of low-hanging fruit work that we can do there. I think back to the conversation I had with Jack yesterday, we talked a lot about this—pre-training scaling and reasoning scaling. I think Google is still scaling pre-training. So this isn't that we're not going to keep scaling pre-training. I think whether or not we officially release an Ultra model, I think, is an open technical research question just because of all of those constraints. And it's also kind of tongue-in-cheek where we could rename all the models and say that we have the Ultra model if we wanted to, and maybe that would have been the right thing to do historically and make Pro Ultra and then remap all the other names. But I think trying to honor the essence of what was intended through that Ultra model, I think, is part of the thinking.

Nathan Labenz: (37:16) Yeah. I'm on the side of anything that maintains any clarity in naming scheme. So I'm with the consistency—that's the only reason I even know to ask this question. Right? I mean, if it was all renamed, I would be even more confused than I am. So, okay, let's go back to coding and maybe think about the likes of Bolt, Lovable, and Cursor, and Devin, the increasingly many, many options. I can't even keep up with all my AI coding paradigms these days. And they probably can't keep up with all the models. So the question I have for you is, okay, we got all these benchmarks. Probably the new one versus the old one kind of wins on some benchmarks, loses on some benchmarks, whatever. An interesting thread that I wouldn't say is dominant, but has definitely got some weight behind it over the last, say, month or two has been, "Yeah, o1 is great. o3 mini is great. Gemini Pro is great, but for some reason, people still seem to think Claude 3.5 Sonnet new is the best coder and the best coding assistant." And they haven't, in fairness, had time to compare it against necessarily everything. But how would you suggest people think about this? You know, the simplest thing to do would be swap in a model in exactly the same situation that they currently have another model performing the best, but you could very easily say, "Well, you're probably leaving some performance on the table because these two models are probably not going to have maximum performance under the exact same conditions. They're each gonna have their different conditions that elicit the best from them." But how do you find it? And how do you know how much time to invest in that? And, yeah, I don't know. It seems very difficult, and it's even difficult for me as just an obsessive hobbyist. So what's the best practice, or how should one think about absorbing a new Gemini Pro and comparing it to a Claude 3.5 Sonnet and an o3 mini in today's world?

Logan Kilpatrick: (39:14) Yeah. This is such a tough problem space, and I just have an incredible amount of empathy for developers and founders and people who are building stuff because there is no silver bullet. I'll give a couple of reactions to this general thread. First, my own personal example that I think underscores the second point that I'll make. I was doing a bunch of—I was doing the normal web developer thing the other day of trying to get the corners of a table rounded. And I was smacking my head on this problem because of a bunch of weird constraints in the environment that I was working in and using the Gemini models. And I became fed up at one point, and I was like, "You know what? Maybe the Gemini models just aren't that good." And I'm gonna go try Claude and see what it does. And I went out of the environment—I went out of AI Studio, went into Cursor, tried it on this very simple prompt and it worked. Single shot, everything just worked. And I started to ping a bunch of people frustrated and I was like, "This is, you know, we need to keep making coding better yada yada yada." And then I went in just for my own sanity. I went and reran the exact same prompt and stuff that I did again with the Gemini models, and it also worked. And I think this was a good example of—I think just the prompt was really, I was just doing a bad job prompting, and I pulled myself out of this environment that I was in of this iterative loop with the model. And I started over from scratch again and formatted the question and the context in a different way. And it worked with both models. And I think this point underscores this vibe slash incredibly unstructured scientific way in which people make these decisions today. And there's a lot of—I won't name names of products, but work super closely with lots of teams who have LLMs in production. And you would be surprised if—insert your favorite LLM product—in a lot of cases, how few evals people actually have to understand what are the metrics that matter for us as we build our product and service? And I think there's two ends of this coin. I think one, the world needs a platform in which it's hosting all of the publicly available benchmarks and leaderboards and stuff like that. I find it incredibly difficult to just navigate and get a snapshot of how good is this model? There's 20 random benchmarks here and 50 random ones here, and I gotta go look at Minecraft because that benchmark's really cool. I love that benchmark, but they're all split out over the place, and it's just hard to keep track as a developer. And to your point, this is my job. This is what I spend every waking moment of my life doing, and it's still difficult to keep track of this. And you could imagine for people who have much less time and are doing other things, it's just hard to keep up with it. So I think someone needs to build this platform and maybe this is the Y Combinator call for startups thing. And that is the call for startups—build this platform to help people really bring all this stuff together. And then separately, I think—and the folks at Kaggle actually are pushing on this notion of having personal evals and being able to build a platform in which as new models are released and made available to the world, you can just run your personal eval will just run behind the scenes on those new models, and you'll just get an email that says, "Hey, this model based on what you've told us might actually be one that you should spend your time checking out because it's really good at these things and those are the things that you indicated you care a lot about." So I think that type of platform and product experience—taking the burden off of developers to just do this—is gonna be awesome. The challenge in that case is you still have to make your personal eval to begin with, but I think that one-time cost is a lot less than having to do all the spin-up cost every time a new model hits the market.

Nathan Labenz: (43:01) Yeah. It almost seems like if we're gonna take one sort of practical recommendation out of all that for developers,

Sponsor: (43:10) I might say let your users

Nathan Labenz: (43:12) choose because at least that way you can get some data and they can sort of feel a little more agency and, you know, maybe there's something good to find there. I think the personal benchmarks too, but the things I care about typically don't have a right answer. You know, at Waymark, it's always the same challenge—what makes a good video? We can have a language model judge, but now we're in hall of mirrors territory, and we just don't have any—I'll cop to being like, we don't really have evals that work for us. You know? We can detect outright failure to follow the structure or, you know, it's too long. So we have some guardrails that could be the clear "thou shalt nots," I call them. The thou shalt nots of this task—we can detect those. But past that, it is really still vibes. And, yeah, it's interesting. We don't actually let our users choose, but that could be interesting. Our users also don't even know—you know, they're not in this world. So the developer's use case is more what I had in mind when I was thinking, well, let your users choose.

Logan Kilpatrick: (44:14) I think that's a trend that you hit the nail on the head. And I don't even know how much of this is a conscious decision that founders are making. But if you look across—again, some of the, you know, choose your favorite LLM products—with the exception of Waymark, developers are, and actually end users have this choice today. You go in, there's a model dropdown in most of those products and a lot of them that are not created by large model providers have most, if not all of the models available. And the timely example of this is Copilot. Copilot now was historically just powered by GPT models two years ago. And I think the developer community and where the world has moved, it now has a model dropdown, and you can choose the Gemini models or the Anthropic models or the OpenAI models. And I think more and more products are going to go that route. Also, as—this is maybe corollary to a point that a lot of people make, but I think, you know, people talk all the time about the commoditization or the contraction of the delta between these models. I think that's actually not true. I think there's a lot of this weird nuance that is going to continue to sprawl out over time where you will get substantively different answers from, you know, insert whoever your favorite model provider is over time. Even if the capability is similar on paper, there's still all of this stuff that will be different. And I think it'll be important that people continue to have the choice because I think that that subtle difference makes a lot of impact on the end product experience people are going to have.

Nathan Labenz: (45:46) Yeah. No doubt about that. Even DeepSeek over the last couple weeks has, I think, shown a very different profile and is probably the most—and perhaps not surprisingly given its source—it's closest to base model is probably my best description of it. And in that, for that reason, it can write sometimes in ways that are extremely compelling, but it's also much less behaviorally refined in the way that the Western, you know, at least top-tier model provider-created models are. So, yeah, a lot is—and that's all with very similar benchmark results. Right? So there's a lot beneath the surface that remains to be unpacked with any major new model release. Let's talk for a minute. I know you have to go before too long. Last things I wanted to cover were briefly fine-tuning because as a developer, that's something I'm always interested in and would love to know the status there. Then maybe what's up with reinforcement learning at Google? I mean, we've seen the thinking model. It struck me that there wasn't much actual mention of reinforcement learning, whereas the other developers are coming out and saying, "We did a reinforcement learning model." It seems like DeepMind has kind of positioned it differently, although I assume that a lot of the same techniques are happening under the hood, but I don't know. And then I was actually gonna ask for your call for startups too because I know that you have recently raised a solo venture fund, and we'll see if we can get you some deal flow. So, yeah, fine-tuning, reinforcement learning, your fund, take as long as you have.

Logan Kilpatrick: (47:22) Yeah. Fine-tuning is something—I continue to be incredibly bullish on fine-tuning. I think the future where everyone in the world is using their own fine-tuned model, a version of the model that has the context that they needed to have and the context isn't being overly impactful on the priors of how the model makes decisions—I feel like that's the rough estimate of how I think about fine-tuning. So we don't have it yet for 2.0 Flash. I think we need to. I think we've been having a lot of internal discussions about what is the size of investment we want to make in fine-tuning. To me personally, I think this is one of the biggest opportunities. Your point about developers wanting this, I completely agree. So we'll keep pushing on it. It's not available yet. Hopefully, it will be available soon. And even more than it is—there's a lot of limitations on how we do fine-tuning at 1.5 Flash today. You can't do images. There's a bunch of rough edges. So we need to solve all those things and make it so that it's really a first-class experience. Your second question.

Nathan Labenz: (48:26) Reinforcement learning. Logan Kilpatrick: (48:26) Reinforcement learning. Yeah. RL is part of making those reasoning models. Part of this historically is why have we not talked a lot about it? We did this sort of normal low-key approach to doing the release, which is what we've historically done for our experimental models. I think as the world is getting more and more excited about what's happening with reasoning models, we're going to start talking a lot more about that work. And yeah, I'm excited for us to tell more of the story about the work that we're doing. Actually, the reasoning narrative inside of Google is one of the narratives that gets me most excited about the direction that we're going in. There's just been so much progress and a ton of breakthroughs. So hopefully soon, we'll tell that story to the world then. Yeah, we'll get you some people on the podcast to talk about it hopefully soon.

And then, yeah, startups. This is the moment to build impactful, interesting companies. Vision is one of those things that I'm really excited about. I still think, take the entire ecosystem that's built on domain-specific computer vision models. I think all of that is up for grabs with vision language models. I think there's gonna be a huge amount of startup activity in that space.

I think reasoning is going to make agents work, which I'm excited about just because of how many... I'm not an agent investor in a lot of ways, but I think there's just so many companies that are trying to tackle problems with agents. And a lot of them just don't work today. And I think that's actually going to be the biggest unlock. We were talking about before, the new models enabling some percentage of Y Combinator startups to actually work all of a sudden. I think reasoning is going to be that continual breakthrough that more than anything else that happens in the next two years is going to make those companies' products actually work, which I think is really exciting for them. I think it's exciting for the world as we sort of figure these things out.

I think there's a whole class of startups which don't yet exist, and I've started to talk to some of the people building these companies around how the fundamental nature of the internet changes in a world where you can't assume the only thing that's visiting your website is a human, with the exception of index crawlers to make your content more discoverable, which historically was the only non-human thing that was visiting your website. The social contract of the internet is not set up for that, and I think it's going to be very, very interesting to see how fundamentally things change with websites being locked down or what the protections that are in place are or all of that type of stuff.

I think there's just all of this new, very basic how we engage with the internet experience that is going to change over the next few years. And I think there's a lot of people who have businesses and websites and companies that are going to have to... they're not going to be able to solve that problem themselves. They're gonna need somebody else to solve that problem for them. And I think there's some interesting companies that can be built to enable that sort of, yeah, how your website talks to agents, attribution happens, how you're able to capture your slice of the value creation that's taking place. So I'm interested to see what happens in that space.

Nathan Labenz: (51:48) Yeah. That seems like a really good candidate for what's gonna change the world most in the not too distant future. First of all, it seems like it's really likely to work with a reinforcement learning paradigm because you're gonna get a pretty clear signal from a lot of tasks as to whether or not it's succeeded. And yeah, I've been watching Payman a little bit recently. Any specific companies you would suggest people look at or any specific problems you think are most... come pitch Logan if you're working on this?

Logan Kilpatrick: (52:19) Yeah. I think evals is the other one. Actually going back to this point, even for you as somebody who I would classify as incredibly sophisticated in this space, you understand what's happening in the ecosystem, you understand why evals might be important. The fact that you can't... the fact that it's so hard for humans to articulate what is the taste and your perspective on some of these problems into something that can programmatically happen, I think is really, really interesting. I don't know how that problem gets solved. But actually, if you've spent too much time in the evals rabbit hole like I have, then one of the big things is actually most of the problems in life end up being eval problems. It's just very interesting if you follow that chain of thought, how, yeah, how things end up happening.

Nathan Labenz: (53:10) Yeah. For now, I just do expert demonstrations. For Waymark, we just go and say, creative team, write us a bunch of good stuff, fine-tune on that, or put those into a few-shot examples and hope for the best. And from there, it becomes vibes. But yeah, it would be nice to have something better. So yeah. Well, where can folks find you if they wanna pitch you or if they have... and you're incredible, as everybody knows, online with responding to questions, concerns, issues around the API. I don't want to bring more of that to you than you already have, but where should folks find you if they want to either point out an issue or pitch you a startup?

Logan Kilpatrick: (53:49) Someone responded to one of my tweets the other day and they said, I miss the old Logan. He used to reply to all of his replies on Twitter, and I then went and looked at how much time every week I spend on Twitter. And I was like, my time doesn't scale here. I'm already putting in... Yeah. You're gonna need an agent. Yeah. Way too many hours. But yeah, Twitter, LinkedIn, everywhere that the internet exists. Hopefully, I'm there helping people with Gemini stuff.

Nathan Labenz: (54:16) Cool. Well, Gemini 2.0 Flash is out today. General availability, it is good, fast, and cheap. So definitely one to check out. Logan Kilpatrick from Google DeepMind now. Thank you again for being part of the Cognitive Revolution.

Logan Kilpatrick: (54:31) Thanks for having me, Nathan. This was fun.

Nathan Labenz: (54:33) Always a pleasure. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

Mathematical Superintelligence: Harmonic's Vlad & Tudor on IMO Gold & Theories of Everything

Approaching the AI Event Horizon? Part 2, w/ Abhi Mahajan, Helen Toner, Jeremie Harris, @8teAPi

Gemini 2.0 Flash Goes Live: Inside Google DeepMind's Latest Release with Logan Kilpatrick

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

Mathematical Superintelligence: Harmonic's Vlad & Tudor on IMO Gold & Theories of Everything

Approaching the AI Event Horizon? Part 2, w/ Abhi Mahajan, Helen Toner, Jeremie Harris, @8teAPi

Gemini 2.0 Flash Goes Live: Inside Google DeepMind's Latest Release with Logan Kilpatrick

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

Mathematical Superintelligence: Harmonic's Vlad & Tudor on IMO Gold & Theories of Everything

Approaching the AI Event Horizon? Part 2, w/ Abhi Mahajan, Helen Toner, Jeremie Harris, @8teAPi

Approaching the AI Event Horizon? Part 1, w/ James Zou, Sam Hammond, Shoshannah Tekofsky, @8teAPi