New in Nature: Google Agents Beat Human Doctors, Make Scientific Discoveries – With Vivek Natarajan and Anil Palepu

New in Nature: Google Agents Beat Human Doctors, Make Scientific Discoveries – With Vivek Natarajan and Anil Palepu

In this episode, Nathan Labenz speaks with Vivek Natarajan and Anil Palepu from Google DeepMind about their groundbreaking work on AMIE (Articulate Medical Intelligence Explorer) and Co-Scientist.


Watch Episode Here


Read Episode Description

In this episode, Nathan Labenz speaks with Vivek Natarajan and Anil Palepu from Google DeepMind about their groundbreaking work on AMIE (Articulate Medical Intelligence Explorer) and Co-Scientist. The conversation reveals how these AI systems are already outperforming human physicians in diagnostic accuracy and treatment recommendations, with AMIE now entering clinical trials at a Harvard Medical School teaching hospital. Even more remarkably, the Co-Scientist system demonstrates genuine scientific discovery capabilities, independently proposing the exact same mechanism for bacterial drug resistance that human scientists had recently discovered but not yet published—signaling a threshold moment where AI is becoming a legitimate thought partner in humanity's most complex intellectual endeavors.

SPONSORS:
Oracle Cloud Infrastructure (OCI): Oracle Cloud Infrastructure offers next-generation cloud solutions that cut costs and boost performance. With OCI, you can run AI projects and applications faster and more securely for less. New U.S. customers can save 50% on compute, 70% on storage, and 80% on networking by switching to OCI before May 31, 2024. See if you qualify at https://oracle.com/cognitive

Shopify: Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive


PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(06:26) Introduction and Welcome
(07:49) AMY Medical Intelligence Overview
(11:29) Chat-Based Medical Interactions
(13:59) Specialized Medicine Results
(18:17) Co-Scientist System Headlines (Part 1)
(18:21) Sponsors: Oracle Cloud Infrastructure (OCI) | Shopify
(21:36) Co-Scientist System Headlines (Part 2)
(26:32) Bacterial Drug Resistance Discovery
(31:41) AI Scientific Discovery Process
(35:35) Hallucinations vs. Creativity (Part 1)
(35:37) Sponsors: NetSuite
(37:10) Hallucinations vs. Creativity (Part 2)
(42:04) Agent Design Architecture
(49:17) Long Context Benefits
(55:35) Computational Requirements
(01:01:10) Specialist Models Integration
(01:07:01) Future Model Integration
(01:12:31) Tournament Evaluation Methods
(01:19:09) AI Question Generation
(01:22:42) Real-World Deployment Plans
(01:25:58) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...


Full Transcript

Nathan Labenz: (00:00) Hello, and welcome back to the Cognitive Revolution. Today's episode features an eye opening conversation with Vivek Nararajan and Anil Palepu from Google DeepMind. Their groundbreaking work on Amy, the articulate medical intelligence explorer, and Coscientist represent what seems to me an important threshold moment in AI capabilities. I always say that if people truly understood what AI can already do today, many would be fundamentally rethinking their plans. And these projects provide perhaps the clearest evidence yet that AI systems are beginning to outperform highly intelligent humans in domains that require years of specialized training. Remarkably, this work was accomplished without special continued pretraining or extensive custom post training that could only have been done within Google. On the contrary, these approaches could have been developed and can be replicated by Google's API customers using commercially available models, advanced prompting techniques, and thoughtful agent design. We begin by discussing Amy. A year ago already, Vivekan coauthors showed that Amy was able to outperform human general practitioners in diagnostic accuracy. Now with just a few important caveats remaining, Anil and team have demonstrated that it also beats human primary care physicians in analysis and treatment recommendations. The implications for health care access are obviously profound and are beginning to extend into specialized medicine too. The second AIMY paper we cover shows that the AI system is already surpassing medical fellows in both cardiology and oncology and closing in on, but still falling a bit short of, attending level performance. Notably, when cardiologists have access to AIMY, their performance dramatically improves across almost every metric, suggesting a short to medium term future in which AI doctors have the potential to both raise the floor for access to quality care globally and also raise the reliability ceiling even for those of us fortunate enough to have access to first world specialized care. This is, to put it plainly, crazy, and I am super excited that Google is moving Amy into something like a clinical trial. In partnership with Beth Israel Deaconess Medical Center, a Harvard Medical School teaching hospital in Boston, for real world validation. All that said, somehow, in VivekanTEAM's coscientist paper, we see something equally, if not even more, amazing. This multi agent AI scientist system, which is capable of accepting human input and feedback at any step in its process, was tested in fully autonomous mode on 3 increasingly complicated scientific challenges. First, drug repurposing, an advanced but reasonably well defined task amenable to combinatorial analysis. Second, therapeutic target identification, a more open ended challenge requiring the AI to understand and or make quality hypotheses about causal relationships within cells. And third, and definitely most dauntingly, the wholly open ended challenge of understanding the process by which bacteria achieve drug resistance. As you might have guessed, Coscientist, which, by the way, Google is now making available to trusted partners, succeeded on all 3 of these tasks. And on the challenge of understanding drug resistance in particular, it blew everyone's minds by proposing the exact same mechanism that Google's independent scientific collaborators had recently discovered experimentally but had not yet published at the time of CoScientist's analysis. Overall, coscientists demonstrates that AI systems are now capable of generating novel insights by connecting the dots between far flung bits of hard won human knowledge. This system is not simply regurgitating its training data. On the contrary, it's performing meaningful synthesis and proposing novel hypotheses that even human expert scientists recognize as both insightful and significant. If all that's not enough alpha for 1 episode, the implementation details behind these systems offer valuable lessons for AI engineers everywhere. First, structured reasoning proves far more effective than simple chain of thought approaches, especially when working with lots of input context. Both of these systems demonstrate the value of thinking carefully about exactly how you want your AI system to reason about specific types of problems. Second, finding ways to add new information or even just a bit of entropy, such as by giving the model access to search, is key to making self critique and self improvement schemes work over many rounds of successive iteration. And third, for now at least, the tournament style evaluation process used to surface the best candidate hypotheses out of the many that were generated seems to be an industry best practice that you can and should use in your own work. What's most amazing to me about all of this is that it was achieved before Gemini 2.5 Pro was available to use, meaning that everything we talk about today is still subject to a step change improvement that should come more or less for free with a simple model upgrade. With this level of performance already established and core model progress continuing, the path to an AI doctor in your pocket and data centers full of AI geniuses is honestly becoming quite clear. AIs are no longer just tools for routine tasks. They are becoming legitimate thought partners in some of humanity's most complex intellectual endeavors, from diagnosing disease to expanding the very frontiers of scientific knowledge. As always, if you're finding value in the show, please take a moment to share it with friends, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. And please know that I sincerely value your feedback and suggestions. Whether we get to live in a post scarcity society in which we all enjoy instant access to superhuman AI doctors, or perhaps on the other extreme end up going extinct due to some crazy AI driven scientific accident, seems to me to depend largely on how responsibly we handle the upcoming AI transition. And I take my role in AI discourse very seriously. If you think I can be doing better, please contact us via our website, cognitiverevolution.ai, or feel free to DM me on your favorite social network. For now, I hope you enjoy this conversation on the emergence of genuine AI expertise, which you almost certainly would have considered to be AGI just a few short years ago and which I think you should still find absolutely mind blowing today. With Vivek Nararajan and Anil Palepu from Google DeepMind. Vivek Nararajan and Anil Palepu, authors of Coscientist and Amy from Google DeepMind, Welcome to the Cognitive Revolution.

Vivek Natarajan: (06:35) Thanks for having us.

Nathan Labenz: (06:37) Hi, Chris. Guys, again, so Vivek, this is your fourth time. What an unbelievable heater you and the team at Google have been on. I always say people, if they just had a little bit better sense of what is already out there today, know, they would be like updating their plans in many ways that I just don't see people doing. So this is really an unbelievable example of that. I came away there's 3 papers we're gonna go down the rabbit hole on today. And I came away from this feeling like it might be time to call it. The AIs might in fact now be clearly smarter than me. And, you know, it is sort of, and we can get into the nuances of, you know, what they're still missing a little bit, but, you know, I think almost everybody would read this, these reports on what you've been able to get the AI's to do and feel like they would have a very hard time, you know, matching that and to to get to this level that like a single AI model, you know, scaffolded in in different ways and and put to different purposes. It would be like years of undertaking, you know, for for me to get there for sure. So let's do some headlines. So there's 2 papers with Amy. We covered this once before. This is the what is it? The Articulate

Vivek Natarajan: (07:51) Articulate Medical Intelligence Explorer.

Anil Palepu: (07:53) Yeah. Sorry

Nathan Labenz: (07:53) about The Articulate Medical Intelligence Explorer. So I've been using a graph in some slides that I occasionally present for like the last year or so since that, the first Amy paper came out showing that basically when a patient chats to the AI, the AI is more accurate in its ability to diagnose the person than human primary care physicians are as judged by other human doctors. So now we've got 2 new extensions to that. The first 1 is, and there are some caveats here that I think are definitely worth unpacking, I'll give you a chance to do that. It's basically now outperforming general practitioners not just on the diagnosis part but also on reasoning through what to do about it and ultimately recommending treatments, which is obviously a big part of what the, you know, the doctor is meant to do for you. So, yeah, just unpack that headline for me a little bit. Mean, it's crazy that that is out there in the world today. And again, always say like, I swear when I was a kid if something like this happened it would have been like headline news, you know, everybody would be talking about it and there's just so much going on that some of this stuff even, you know, dramatic breakthrough as it is doesn't seem to crack the consciousness. So tell us more about AAMI outperforming now, not just on diagnosis, but also on recommending treatments.

Vivek Natarajan: (09:12) Yeah. Yeah. I mean, like prior to those, I think first 2 kind of AAMI papers, it was a lot of the work in this space was, you know, on medical question answering. And, there was some notion that these language models do encode clinical information well and they have a lot to offer. I think with those papers, we were trying to start to ask that question of like, okay, but that's not clinical practice. Like in clinical practice the doctor is interacting with patients, they have to kind of gather this information themselves. They're not really presented with all the information up front. And so that was the study, right? And it was doing this objective structured clinical examination, this format. And it was basically trying to see like, can the doctor or the AI in this case, you know, interact with patients, gather that information, and still get to that diagnostic endpoint? And of course I think we do a really good job in the paper, and I encourage people to read it, of like describing the many limitations, right? Like this is a text based chat, that's not how doctors talk to people. And kind of our future, like our direction from there was really about starting to unpack some of these limitations. And so 1 of those limitations is this idea that it's more than just, you know, from the first visit you see a patient diagnosing them. Like there's a lot more to, you know, clinical care, right? And it's about managing a patient over, you know, multiple visits. You know, the endpoint at the first time you see a patient might be really like we need to just order the right test, and we need to kind of set them in the right direction. It's not always that you know exactly what to do with the patient after seeing them 1 time. And so I think the Management Reasoning paper is really trying to unpack that. And you know, we'll talk more about like how that study was designed. Think it's like super interesting, right? But, and there's other aspects to it as well. Like rather than kind of more general recommendations, like can we get really precise and can we kind of ground in accepted clinical practice guidelines? Can we ground in medication labels and start to turn these into like slightly more actionable things. And similarly like with the specialty papers, right, it's trying to expand beyond kind of the bread and butter common presentations of common diseases. Like how does this work in more niche areas of medicine? And so I think, we have a lot more work kind of just trying to expand on some of these limitations that we've identified in the first paper.

Nathan Labenz: (11:35) So how would you summarize the, I mean, if I understand correctly, this is still a chat based interaction today. Right? So 1 major possible extension would be to go to multimodal. But you guys have also done work on that separately. Right? So maybe, like, why is it not multimodal in in this particular study? Was there a reason not to just like let people, you know, throw in selfies into the chat?

Anil Palepu: (12:02) Yeah. I think it's more like when you're doing research, want to isolate components that you're studying and do that well. And so when you add multimodal, I think artifacts add, like, necessarily more confounders over here. So we try to avoid that, and so it's just easier to study a text based system to begin with. But clearly, we've done work on multimodal before. I know that, like, on 2 podcasts we've spoken about, like, MEDMUM M work and MEDGEMINI. So all those components and pieces exist, and so very soon you'll see, the multimodal 1 also come out. So yeah.

Nathan Labenz: (12:34) Yeah. Okay. The pace is relentless. That's for sure. So just to bottom on this 1 more time, basically, have with the caveat that it's not yet in this particular paper, although coming soon, a multimodal, We have multi visit, you know, kind of longer time horizon interactions between patients and doctors where the AI doctor is outperforming the human doctors on both the diagnosis and the reasoning through what to do and ultimately landing on sort of standard of care, you know, accepted proper treatment for these conditions. Else that, you know, that we should, should I soften that at all or is that like a good

Vivek Natarajan: (13:16) I mean, I think,

Nathan Labenz: (13:17) because that should be like on billboards, right? I

Anil Palepu: (13:19) mean Yeah.

Vivek Natarajan: (13:19) I think the biggest thing, right, is you know, and we'll talk more about how we're trying to test this in the real world, but right, these are simulated consultations. Like they are patient actors, they're not real patients. And obviously there's a whole new set of challenges when you get real patients who, you know, all kinds of things can happen and they're not necessarily going to stick to their script. So I think that's a whole another thing that we need to test and validate that our results truly do translate to the real world setting. That being said, I mean, I think simulated consultations do show that we do have promise in this setting. And we personally are very optimistic that these results would translate at least to a certain degree.

Nathan Labenz: (14:02) Yeah. Okay. So the next AIMEE headline is moving to specialized medicine and there you look at cardiology and oncology. And here we basically, I would summarize the findings as the AIMES system is surpassing fellows and closing in on, but not yet hitting the level of attending physicians in these specialist domains. So what further, you know, complications or caveats, you know, should we have to understand that?

Vivek Natarajan: (14:35) Yeah, I think largely I'd agree with the notion that, you know, we have a lot of improvements in terms of being as consistent in all domains as, you know, the most experienced attendings and there's room for improvement. I think the real headline for me, especially if you look at the cardiology paper, is that you know the types of errors they make are pretty different, AI and the general cardiologists. And I think the really exciting thing is we see that they're quite complementary. So the comparison we made was when we compared Amy to the general cardiologist head to head, there was some uncertainty about what was better. Some were better in some areas, some were better in others. But when we compared the general cardiologists with access to Amy's assessments to the general cardiologists alone, it was like a landslide, right? And in that case, in almost every aspect it was considered superior when they had assistance. And I can talk a little bit more about maybe why I think that might be obviously I think there's more investigation needed, but I think that is a really exciting aspect of this that it seems to be just a helpful system in use by these experts.

Anil Palepu: (15:43) And maybe to just contextualize that work a little bit more. I think if you look at like the access to specialists, like in the country today, I believe like for getting, consultations on neurologists, for example, it's, like, 12 to 18 month wait time. And that's simply not sustainable. Right? And so the question is, okay, like, I mean, clearly that we have, like, better reasoning AI systems that seem to show promise in medicine. So why do we have like, how how can we do better over here? Can we improve the status quo? We should be able to do that radically. Like, no 1 should be waiting for 18 months to get, like, consultation on, like, a rather serious thing. Like, neurology is, like, not straightforward. And so so that's kind of like the motivation. Like, there's like a lot of like access issues around specialist care, cost issues around specialist care. And so how can we do better? And that's like the key question that we're trying to address. And then I think the second thing is generally like if you look at like how medicine has evolved, I mean that has led to these silos or these compartments and specializations. And so you have like this primary care, which is kind of like the front face to everything else, that's the door, and then you have all these silos. But the way that it has evolved, I think is primarily because of like the limitation of the cognitive aspects of the human mind. Like there's only so much expertise that we can cram into our given brain. So like we have to so because of those limitations, have to go like study neurology or cardiology or, I don't know, internal medicine, but not everything together. But, like, AI systems don't need to have that kind of, like, limitation. They should be able to, like, do I mean, like, given what you're seeing, like, they should be able to integrate knowledge from multiple different sources, multiple different disciplines. And so so that's also, like, there's a fundamental rethinking that is happening. Does, like, the new age of AI and power like, AI powered health care, does it need those silos? Can it it could be possible that you not only have, a PCP in your pocket, but, like, like, an expert neurologist in your pocket, obviously, with caveats and things like that. So that's that's another class of, like, question that we are trying to address over here. And then maybe the third point I'll just add on is this study is, like, what, 3, 4, 5 months old now?

Vivek Natarajan: (17:52) Yeah.

Anil Palepu: (17:53) Yeah. So that that's primarily done with the palm version of the models, if I'm not wrong.

Vivek Natarajan: (17:57) The cardiology 1 was done with flash, I think.

Anil Palepu: (18:01) Flash. Okay. 1.5?

Vivek Natarajan: (18:02) Flashers. Yeah. Yeah. 1.5.

Anil Palepu: (18:04) Okay. Yeah. So it's 1.5. And so since then, had 2 and 2.5. So yeah. I mean, in that study, obviously, what we saw was that 1.5 was not as good as, say, attendings, but who knows with the new

Nathan Labenz: (18:17) models? Yeah.

Vivek Natarajan: (18:18) Yeah. Yeah.

Nathan Labenz: (18:21) Yeah. That's an important caveat.

Hey. We'll continue our interview in a moment after a word from our sponsors. In business, they say you can have better, cheaper, or faster, but you only get to pick 2. But what if you could have all 3 at the same time? That's exactly what Coher, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workload. Right now, with 0 commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Nathan Labenz: (19:36)

Being an entrepreneur, I can say from personal experience, can be an intimidating and at times lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just 1 of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right 1, and the technology can play important roles for you. Pick the wrong 1, and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in The United States, from household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.

Nathan Labenz: (21:33) Okay. Well, let's do the headlines then for the coscientists because these are, I would say, you know, similarly striking headlines. So coscientists is basically, I guess I would describe it as an agent, you know, scaffolding type of setup where you and we can get into, you know, the the granular details. But, you know, a lot of we've seen different things like this, you know, before. I I did an episode with James Zhao who has kind of a similar thing with the AI virtual lab and the original coscientist, you know, that was made just within the first couple weeks of g p v 4, which is crazy to think that they, you know, got anything out of that with, 8,000 tokens of context. Yeah. But, you know, we are even then, we're seeing some interesting stuff. But this I would describe as just kind of taking sort of all of the lessons learned about how to make agents work from the last 2 years and hitting the gas on all of them and then coming up with a system that basically can do science. Right? I mean, it's it's not executing the actual physical experiments at this point, but I was really amazed by the different things that you tested the system on. The first 1 is kinda like 3, you know, 3 levels of challenge of the of the problem that the system is given. The first 1 is like a relatively well scoped thing that you could kind of grind through. That might be the 1 that I might be able to do with some real effort, and that was drug repurposing. Right? So take a take a drug that's out there, look for other things that it might be useful for, and, you know, you have kind of a combinatorial approach that's, like, available to you. So if you set that up and and go through it, you know, systematically and you have, like, decent judgment around each individual, you know, sub question that you ask, you can sort of imagine how an AI system would be able to do something like that. But then you go up a level to the second scope of task, which is identifying new therapeutic targets within a particular kind of diseased cell. And this is like starting to get at what I've often called, you know, 1 of the grand challenges in biology, which is just like, what causes what? Know, we've got, obviously we know that there's like super complicated causal graphs, you know, going on in the cell of, you know, this promotes this, but inhibits that and yada yada yada. And we don't I don't know if you guys would would venture a number, but I've kind of understood broadly that we have, like, a long way to go in terms of really mapping that out. Maybe we understand, you know, 10% of of what the graph is in today's world. So the challenge here is to basically go into, you know, go into a cell, so to speak, and try to figure out, can we identify something that we can target with a drug that will actually make things better? You know? And, obviously, this is hard given the vast complexity and the and the many, many unknowns in cells. So okay. That was just the middle 1. Then the third 1 is, can you figure out why or how bacteria are becoming drug resistant? There was like a little bit of a hint, think, because there was there was 1 sort of observation, right? That there was something that was conserved across a couple different species that sort of was was a notable observation that, you know, sort of served as a seed to unpack the challenge. But beyond that, go figure it out. Right? Super open ended, really, really tough. And that is, like, a daunting question for me to consider and really would have you kind of, you know, spinning your wheels, I would think, in just an absolutely vast literature for a long time before you would even, you know, at least for me, before I would even have any, you know, sense that I might be able to start to contribute to the discussion. Okay. Those 3 problems. Bottom line is AI's able to do all 3 of them pretty well. And in that last case, it actually surfaced as its number 1 candidate idea for the mechanism of the drug resistance, something that had been discovered experimentally but was not yet published. Right? So you guys partnered with Mhmm. An academic group who was doing this research, and they had the answer, but nobody else had the answer. It's not in the literature. And the system was able to grind through all this content that it had available, know, the whole kind of vast body of of medical and biological literature. And I mean, crazy. Right? Landed on the exact number 1 hypothesis that turned out to be the actual answer. I was really blown away by just how successful it was. So tell me more, I guess, you know, in terms of what what else are there any caveats that I glossed over or, you know, are there any other eureka moments that you would wanna highlight from those results?

Anil Palepu: (26:31) I think the last 1 was kind of like interesting and also like funny because I don't think Jose and Tiago who are collaborators at Imperial would take offense at like me saying this because I don't think they really believe that AI could do this thing. So I I think we we we've been trying to, like you know, we said, okay. Like, we have the system, Jose and Deb, but you wanna try it. And then it took us, like, a few months before we got, like, enough time with them. And then I think after, like, enough pestering, they were like, okay. We have some experimental results in the lab. We'll just try and challenge your AI system. Let's see what your AI system can do. And so, yeah, I think it was roughly around, like, Thanksgiving. And so they sent us this prompt, and I think that's detailed in another preprint along, which was co timed with the co signed paper. And as you said, I think there were some clues in there, but, like, it was not totally giving it away. And so we take that prompt. We set out a system to it, and it runs for, like, a couple of days. It spits something out, and then, like, Euro are, like, the first author on that paper, and he's an amazing, like, technical fellow. He's probably the least well known technical genius at Google in some ways because he likes to keep a low profile. But yeah. So he just sent it over to them. And then I think it was Thanksgiving. It was, like, late in the evening, and so Yur and Alan were both both based in Europe. They were, like, offline. And then, like, within 10 minutes, like, Jose was also based in London. He sends us, like, an email, and he was like, I need to talk to you right now. And I was like, okay. I don't understand the seriousness of it, but I I was like, okay. I mean, I'm not doing anything better. We we can talk. And then and he was like, Vivek, are you, like, reading my email? And I was like, I'm not sure. I was like, what are you asking? And he was like, no. No. No. It seems like you're lead reading your email. And then I was like, we do many things at Google, but reading your email is not 1 of them. And then he goes into it. He was like, I've not published this anywhere, but your AI system came up with the same set of results that, like, we hypothesized and we found in our experiments. So I'm, like, really, really surprised. And so he was like, okay. Do you guys get responses from, like, ChatGPT? I was like, no. No. No. That's not possible. That doesn't happen. And so he was like, okay. So if you're not reading my emails and then if you're not if you don't have, like, any information from ChatGPT, then it's likely that you have something, like, really, really magical. And that was kind of his response. And then he said, like, yeah. The first 1 is great, but we also, like, send him 4 more. And he was like, all the other 4 also make a lot of sense. And I think immediately after the Thanksgiving break, he was like, okay. I'm gonna, like, set a few of my postdocs to work on this, and they've been, like, working on validating those other hypotheses. And so it was kind of that moment, like, where someone who's, like, very pragmatic, very experienced has spent, like, a decade, in fact, several decades on the field. Like, when they have these kind of reactions, so that was kind of like, okay. That told us, like, okay. We might be onto something over here. But, again, I think, like, it it was not, like, 1 single moment, but it was like, okay. It's it's a very hard thing to do, right, I mean, to get, like, AI to not just, like, synthesize and integrate and summarize information, but help traverse history of knowledge and, like, uncover new original things and knowledge and facts about the world. And to do that reliably, That's a super hard problem. In some ways, that's like the holy grail of AI. And to think that, okay. Like, a system which is relatively simple in nature, like, we we'll probably talk about this more. It's like, it's probably the simplest version of the system that you can imagine. And, like, throwing a bunch of compute at it, you're already seeing evidence of that. Like, you're seeing evidence of this happening reliably. That felt like, okay, super magical to us.

Nathan Labenz: (30:02) Yeah. It's crazy. So this has been something that people have been discussing recently online a bit. Like, Jorkesh has sort of advanced the idea that, like, why don't we see AIs coming up with these, like you know, they they have this, like, incredible breadth of knowledge. Shouldn't we be seeing more connections made, you know, more sort of insights across this, like, super diverse knowledge base? His contention is like, if I could have all that knowledge, you know, surely I would come up with, like, more insights than the AIs currently seem to. And I think he even kinda went as far as to say, like, I haven't seen a single example of this. And I feel like some of the things that I've covered definitely seem like they could be said to count, but there, you know, there's always a lot of details and caveats and sort of eye of the beholder type of stuff. But this seems pretty clear to me that, you know, the fact that this was essentially independently discovered by human scientists in a lab and an AI system in a data center over a couple days in parallel to the point where the, you know, the scientist accuse you of reading his email. Presumably not super seriously, you know, nevertheless, you know, is there anything is there any reason that we should sort of not take this as a genuine, like, discovery, you know, a sort of qualitative eureka moment from an AI system?

Anil Palepu: (31:39) Yeah. This is not the first evidence. Like, I think, right, in 2023, like, right after we did our medpalm, and that's where actually the genesis of Cisco scientists work is actually. Right after we published medpalm, there was this professor from Stanford, doctor Gary Peltz, who reached out to us, and he was like and I think you probably remember Tau who came to 1 of these previous episodes before. He called both of us up, and then he was like, Vivek and Tao, you don't know me, but your AI system can potentially help, like, millions of people with rare diseases. I was like, okay, Gary. That's a nice introduction. Please go on. And then he was like, I know that your models are trained on a lot of scientific literature, and I think they can help me discover useful facts about genetic diseases. And, so, I mean, that felt interesting enough for us. Mean, if you were able to help him, probably we could help a lot of people. And so we started working with him and on this problem of, like, genetic discovery. So can, like, language models, can they come up with, like, the right kind of causative factors that are responsible for, like, given combination of phenotypes or symptoms? And, so we started doing that with Metformin. Even with Metformin and later with MedGemini, we saw that these systems were able to do this. In fact, with Metformin, we like, Gary was working on, like, mouse models and, like, a very specific kind of hearing loss for which he had this NIH grant back then. I don't think he'll ever get that kind of a grant again, but it's a separate discussion. But so, yeah, he had that grant. And so 1 of the hypotheses that the model came up with, it was, like, a biogenic model for hearing loss, which Gary had not thought of before. And so he went ahead and did, like, these CRISPR knock in experiments in in his lab, and, he was able to reverse the cause of the disease. And so we had written that up and still under review in in actually a prestigious venue. And then later on, we used our more advanced version of our LLM, which is my Gemini, and we extended the work extended that that work to, like, human variants of unknown significance. And even there, like, based on retrospective data, we see, like, these systems are able to to do, like, pretty interesting work in genetic discovery, which is, like, 1 of it can be cast as a hypothesis generation problem. But the key thing is at that point of time when we were using these LLMs in, like, a pretty much, like, a crude single shot fashion, It was very unreliable in this act of, like, hypothesis generation. So literally for that medpalm to to come up with, like, 1 hypothesis that was very, very helpful, it had come up with, like, thousands of things that were utter garbage. But, I mean, we were very grateful to, like, work with Gary who had the expertise to, like, very quickly discard away things that were nonsense and and also had the patience to, like, work through all of them because could easily imagine another scientist or maybe someone inexperienced who would, like, say, looking at the first 5, or this is other garbage. It's not working. And then we would not have, like, pursued this line of work at all. And so, yeah, it's about, like, getting together and working with the right people who believe in this. Right? And so that big like, you know, put us on this journey. Like, okay. How do we make this more reliable? Like, we should not be sampling, like, thousands of times to get something useful. Rather, every single generation should be something useful, and the system should be, like, well calibrated over here. And so so that kind of, like, led us to this. And so the the simplest way to do this would be to, like, you know, call, like, an LLM repeatedly and hope that it leads to something useful. And but that very quickly fails, it leads, like, these degenerated solutions, which has this mode collapse. And so it became important to, like, you know, introduce, like, net new knowledge into the world and also think about how you can, like, gamify that process in some ways and, like, introduce, like, repeated new helpful feedback that can help the systems help improve. And so that kind of, like, led us to the design that we eventually had. And in hindsight, it should be quite obvious. I think it, like, very naturally follows how, like, the scientific method works. Or if you ask, like, a scientist how do they come up with new ideas, you'll see, they will just roughly compartmentalize into the set of agents that we had, and they're, like, just doing the same thing. But it was just like this iterative process of, like, okay. We tried something. It didn't work. Okay. We tried to improve it, and then we eventually end up ended up with this design, which is, like, I think remarkably intuitive.

Nathan Labenz: (35:34)

Hey. We'll continue our interview in a moment after a word from our sponsors. It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever. If your business can't adapt in real time, you are in a world of hurt. You need total visibility from global shipments to tariff impacts to real time cash flow, and that's NetSuite by Oracle, your AI powered business management suite trusted by over 42,000 businesses. NetSuite is the number 1 cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR all together into 1 suite. That gives you 1 source of truth, giving you visibility and the control you need to make quick decisions. And with real time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's 1 system, giving you full control and the ability to tame the chaos. That is NetSuite by Oracle. If your revenues are at least in the 7 figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.

Nathan Labenz: (37:04) Yeah. How do you there's a couple of, like, maybe quasi philosophical questions that come to mind there. 1 is, how do you think about the relationship between hallucinations on the 1 hand and creativity or hypothesis generation on the other hand? It seems like we have very people have very different intuitions about that. And I'm also not quite sure myself, like, do I you know, when when we do, you know, sort of extensive post training to try to minimize hallucinations, does that help with hypothesis generation? Because maybe it, you know, makes the models more, like, disciplined in their reasoning, or in some ways, does it hurt because they're maybe less willing to, you know, come up with a very random idea that, you know, every once in a while, it's these random ideas that are the big ideas. Right? So how do you guys think about the, like maybe that's a false trade off, But, yeah, how do you think about the relationship between those behaviors?

Anil Palepu: (38:08) Yeah. Keshia, hear hear your take.

Vivek Natarajan: (38:10) Yeah. I mean, I think intuitively there's some aspect of like, hallucination does foster creativity. In some ways it's interpolating between the data it's seen. It's kind of necessary for hypothesis generation to deviate from you know, the the script in a bit. But yeah. I mean, like, I don't have a I don't know if I have a clear understanding or, you know, answer to that. But intuitively, at least, that's how I feel.

Anil Palepu: (38:42) Yeah. I I always used to think that, like, hallucination and creativity are, like, 2 sides of the same coin in some ways. And I don't think this no longer holds true, but, like, what we used to see with the previous generation of models was it was actually much more helpful to use the models without post training for this task because they were then most likely to come up with these crazy ideas. Right? But now I don't think that's any longer the case because we've been able to, like, systemize. We we have been able to, like, put a structure around that process of, like, coming up with new ideas. And so it feels like now the process is much more reliable, but maybe we are sacrificing, say, some, like, crazy new things that would require maybe, like, an unnerve to, like, non post trained model to come up with. So we don't know. So I think we've gained, like, liability in the process, like, in terms of, like, consistently coming up with, like, new original thoughts. But, like, we don't know if you're sacrificing something else over here. Yeah.

Vivek Natarajan: (39:42) And I think 1 advantage of our system too is we allow diversity of models. So we can kind of get the best of both worlds. And through this tournament process, through this ranking, we have a diverse set of hypotheses and those can be re ranked and hopefully we see high quality hypotheses kind of bubble up to the top.

Nathan Labenz: (40:04) Yeah, mean I think maybe it's both is kind of what I take away from the hallucination reflections there. It's like if you're in a, you know, in the context where there's a rich literature and it's more about, you know, really working your way through it, then maybe you want a disciplined reasoner. And if you're doing something where there really isn't much to go on, you know, maybe you want a sort of

Anil Palepu: (40:27) more

Nathan Labenz: (40:27) whimsical hallucinator. And as you said, like, maybe get the best of both worlds with some of these setups. So let's describe the the different setups. I mean, there's there's multiple different systems here that, you know, they they have their sort of intricacies, but it seems like at a high level, if I if I understand correctly, what you're doing to design these systems is basically just kind of introspecting or, you know, maybe interviewing people and sort of saying 1 of my favorite questions in in AI, you know, automation in general is how do you think about it? So it seems like you're kind of doing that and saying like, okay, how do you think about it? What do you do next? Then once you do that, what do you do? And you're basically turning all these steps of a process, whether it's the scientific method or it's the diagnostic process or the reasoning to treatment process, you're basically just kind of mapping that out with a subject matter expert, yourself or somebody else, and then creating little sub agents that are sort of prompted to do those subtasks and then kind of scaffold them together. And also some interesting details on giving them tools. You know, literature search is obviously a huge 1. I was also interested to see that in the 1 project, the model did have access to AlphaGo and maybe some other things. So, you know, you're increasingly, like, giving it something like the full complement of tools that a human Yeah. Could use. And then it seems like after that, it's like turn all the hyperparameters up, you know, kind of do more rounds, more generations, more, you know, more evaluations, more rounds of feedback. And I guess my impression is that if you do that and you have the budget for it, like, in today's world, you could probably be successful at almost anything. But maybe, you know, tell tell me if that's wrong. Like, do do you are there things about the designs of these systems that you think are, like, actually very, you know, sort of important hinges where, like, if it had been if you designed it a little different, like, it it wouldn't work? Like, how how sort of within that general framework that I outlined, could you go, like, a bunch of different directions, or do you feel like it it actually is sort of a narrow design space that actually works?

Anil Palepu: (42:55) Yeah. I mean, I think maybe there's a a high level narrative change that's happening over here. So, yeah, this is my fourth time on the podcast, and I think previously we discussed MedPalm and MedGemini. And the key with all of them were all, like, med. Like, we were taking some generalized model, and then we were trying to, like, fine tune them and specialize them. Think the key differentiation with the current version of Amy and Coscientist is we're no longer trying to do that fine tuning step or that specialization step. And part of it is because, like, okay, some of the data that went into that fine tuning and specialization step is now upstream and part of Gemini. But it yeah. I mean, it it just feels like the a better approach to set things up is by simply having agents with, like, specialized prompts and chained up together in, a nice manner so that yeah. It just gives a lot of, like, flexibility and control, and so it does away with this need for like fine tuning and specialization. I'm not sure what you think about that.

Vivek Natarajan: (43:48) Yeah, I kind of agree. Not that there's no role for post training, but certainly in our first Amy paper and MedPalm, all these papers, was really about this medical data we're curating or creating and how we're making that. That's what was driving the model's success. In these latest papers, that's really, I think, taken a backseat to simply how we're designing the system maybe to perform these sort of tasks at inference time. And I would say I don't think anything is particularly hyper optimized. There's probably certain ways that someone could design a system that does this stuff better. I'm sure there is. I think our goals have been kind of largely to build to where it's like a functional prototype for these and kind of get there quickly and do the kind of tasks that we're interested in doing in each study. So I think there are there's so much room for improvement in all of these.

Nathan Labenz: (44:46) How many rounds of iteration did you go through as you built the system? Were there any moments when it was like, oh, we tweaked this prompt or we switched this or we put this agent before this agent and kind of re scaffold it that way and all of a sudden you saw a big leap or

Vivek Natarajan: (45:07) Yeah.

Nathan Labenz: (45:07) Yeah. So tell me about

Anil Palepu: (45:09) those.

Vivek Natarajan: (45:09) I mean for like the Amy work, think we rely like really heavily on like auto evaluations. So I think we do try a lot of different prompting, a lot of different like configurations. But ultimately like it's only finite how many things you can try before you're like, This seems good, this is good enough. So I say we use that as like a rough signal. We also like honestly rely pretty heavily on like vibe checks and just like And you know, I think like you can pretty quickly start to see differences. Like I think with our manager reasoning agent, for example, we saw a big difference when we started drafting concurrent plans and then refining them together. We saw a big difference when we did this top Can

Nathan Labenz: (45:52) you unpack that just a little bit more? Like what was the before and what was the after?

Vivek Natarajan: (45:56) So like before we were just generating 1 plan. And basically now like we're generating 4 different plans. They might have some similarities, some differences. But we found basically like in a sort of like self consistency style manner, right? Like that the model was able to combine these plans in a way that it kind of was able to take the good stuff from each plan and leave out the bad. And that's just something like once we tried it, it was like our internal auto evaluation signal was like very clear, like this was making a huge difference. But like, you know, in terms of like hyperparameters, it's like we went with 4 plans. Like we could have gone with 8, we could have gone with 2. Like I don't think we optimized every hyperparameter in that sense, but I think like the big things we were able to pick up from this kind of signal.

Anil Palepu: (46:41) Yeah. I think on the coscientists side, it a much longer iteration. It was almost like close to an 18 month project in some ways, and it was, yeah, driven by this need to, like, okay, make this process of hypothesis generation, like, more reliable. And so, yeah, it was like, okay. We decompose these tasks. We try to, like, get, like, individual models to work on them. And I think the good thing is, again, like, what it has shown was how good, like, individual LLMs are getting at instruction following. And that also does the way they need to, like, fine tune and specialize and whatever because simply, like, if a model has the knowledge and if you give it, like, a precise set of instructions, it can just follow it. And so that just makes it much, much easier to, like, create agents and chain them up together where these agents are specialized to do, like, specific tasks, but are, like, just prompted versions of, like, the general purpose models. And so yeah. I mean, like, it was a process of, like, okay, setting up some eval, seeing how well the system does, where the weaknesses are, and then, like, again, iteratively going in and fixing those weaknesses by adding, like, specialized versions of agents or, like, fixing the prompts. And then ultimately, like, there was a process of, like, okay. Let's just simplify the architecture. We don't need all of these things. And so, yeah, that ultimately led to this design. So I would say, yeah, it was a lot of iteration, like, figuring out where the weaknesses are, trying to cover those weaknesses, and then at the end, like, okay, let's do a simplification push.

Nathan Labenz: (48:02) Yet, I I've been through that on the less world changing project myself multiple times where you're like, a new model can do wonders for the simplification of your system. Yeah. I've definitely lived that. So I guess, you know, in terms of what is driving the the improvements, you know, 18 months ago, we maybe, like, couldn't I mean, it's all happening so fast. Right? Like, I even have some, you know, lingering sympathy for the stochastic parrot people because it's like, as of GPT 2, you know, that was probably still mostly true. At this point, you know, it's it's pretty clearly not. But 18 months ago, you know, you maybe couldn't you know, models just couldn't do certain things that now they can. It seems like just core model progress is just is the tide that is, like, you know, lifting all boats dramatically. Is there anything else that you think is just, like, super important, or is it really just down to foundation models getting better and just grinding out the process of figuring out how to use them?

Anil Palepu: (49:12) Yeah. Again, I think on the coscientist side, I feel like long context was an important part of it as well, because we don't have an explicit memory store in the system. But the fact that Gemini models instantiated can take up to, 2,000,000 tokens, it means that we can just, like, you know, generate ideas, generate reviews of them, run debates of these ideas, and, like, generate these walls of text essentially, which can just become feedback. And this just put all of them back into the context of the model in the next round. And the model just figures out how to make sense of it and use the feedback in a very, like, implicit manner to improve. Like, if you did not have these long context abilities, like, the ability to reason over, like, reliably over, like, millions of tokens, then you would not be able to do that. Like, you would have to engineer, like, Rack based systems, and maybe you have to, like, end to end train them. And that would have been a lot more complex. But, like, all that has been made, like, remarkably simple just by these long context ability of Gemini, which I think is a little bit underappreciated in the field because we don't have, like, enough of these, I don't know, Studio Ghibli style viral moments with long context, but it just enables a lot of these practical applications. And I think that's the same with Yeah.

Vivek Natarajan: (50:18) With the management reasoning, it's like yeah. I mean, like, our recall for for that paper, right, like, the the management agent, its its whole point is it's trying to, take in clinical guidelines and reason over those to produce management plans. Yeah, I think if we were solely reliant on always picking out the right guidelines, we would struggle, right? But with the long context, we don't really need to worry about it, right? We take in a bunch of guidelines and with 2 56 ks, we're going to catch something relevant regardless of whether our retrieval system would have been able to do that. So yeah, I definitely agree on the long context. I think just like with coscientists in particular, we're talking about a really long time scale. We had this thing running for days. I think just that inference compute increase is offering a lot of benefit as well.

Nathan Labenz: (51:08) I'm just looking up a friend sent me just in the last day or 2, fiction LiveBench for long context deep comprehension, you know, 1 of obviously many benchmarks that look at these things. And the Gemini 2.5, even relative to Gemini 2, is absolutely just crushing on its command of long context, and I've definitely felt that in my initial testing of it. This does feel like something that's, like, hard to go viral. It's, like, hard to go viral with a notion of, like, you know, I had hundreds of thousands or 1000000, you know, tokens of context that's all, like, very idiosyncratic or whatever I'm working on. You know, hard for people to, like, even know what you're talking about, right, when you post that on Twitter. The the contrast between those context window and the, you know, the length of a tweet is, like, pretty pretty severe. But, I mean, it's amazing that that is that that was working that well because just looking at these benchmark results, it's like 2.5 stands out in a massive way relative to everything that had come before. Did you I guess, like, qualitatively or vibes or, you know, your own sort of sense, I would have said up until 2.5, I would have felt like, yeah. I'm not so sure if I can just dump, like, hundreds of thousands of tokens in and, like, yes, it can handle it, but does it really handle it? Does it really have command? And especially if it's, like, material that I don't have full command of myself, you know, it can be very hard to evaluate that. So I guess how did you handle that, and how did you how did you know if it was, like, actually, making effective use of the super long context?

Vivek Natarajan: (52:50) Yeah. I mean, in the Amy setting, I think, like I mentioned before, we really relied on auto evaluation. I have to shout out Valentin, my team member on the paper, did a great job setting this up. But essentially we also didn't know how well would this work if we stuff a bunch of guidelines into context. And we found initially we were going to go all the way up to 1,000,000. We found that actually we seemed to be getting better performance when we dropped it down to 2 56 ks. But you know ultimately like there was a clear difference between the presence of you know that much knowledge versus you know trying to do a 0 shot or with you know 1 guideline or some retrieval or something. So was just we were also uncertain, but we, you know, we tested it internally, least, and that seemed to work.

Anil Palepu: (53:34) Yeah. I think on the coscientist side, it it might be a little bit more unscientific. I think what we relied on was primarily the redundancy aspect because, like, okay, you generate some ideas, you review them, there's a tournament that happens, you get, like, a bunch of feedback from the tournament. And then you when you put that back into the state of the system, like, you're not generating only once. You're, like, generating, numerous ideas. And so your hope is because of that redundancy that creates, like, at least 1 of the generations would catch the key elements of the feedback that has been generated and propagated back over here into the system. So, yeah, I wouldn't say we have, like, any specific that target measuring how well the long context is doing, but rather by engineering in this redundancy, we are hoping that it would be effective. And and, yeah, and then the other distinguishing factor for us in this work was at least, I feel like there's a lot of, like, science assistant style, like, scientific discovery style projects. Not just I I think in a lot of different places, but where they do get a little bit hung up is in, like, curating these really nice, cozy benchmarks where you can hill climb on. And that was not our philosophy at all. For for us, we feel like the key deal is, like, if a system does something useful, we should rather sprint straight ahead so that, like, we can validate it in the lab, and then hopefully take that onward towards, like, a real meaningful discovery. And so that was what we were most focused on. Like, if we engineer a system, we go straight to the scientist who's, like, an expert in the field. We show them the idea, if they like it, we, like, try to convince them to, like, validate it. And then if they, like, validate it, becomes a discovery, then, yeah, great. So so there was that thing where we, like, really wanted to, like, not micro optimize on, like, specific benchmarks and hill climb on them and, like, you know, wait for, like I mean, if I think if we did that because there's, like, so many different components in the system, think I this work would have taken easily another year if you try to, like, you know, make it the best. Rather, we were just focused on, let's just them all up together. Let's get it to do something reasonable, and then we'll do, like, end to end validation.

Nathan Labenz: (55:29) Yeah. Okay. That's it's a brave new world out there. 1 thing that I had seems to be changing now, like, not too long ago, I would have said I could have cited, you know, a handful of papers that sort of showed this that, like, typically 3 to 6 rounds of self critique and, you know, sort of auto self improvement seem to be where, like, GPT 4 would kind of max out. And if anything beyond that, usually, I felt like I would see that it you know, performance would decline after if you kept running it longer than that. You guys are talking about running these things for days. Maybe you could tell us a little bit of, like, how you know, what are the budgets for this? Like, how many tokens are we talking about? How many you know, if we if we translate that to retail price, you know, what is a you know, what would the inference bill be for, you know, finding the the mechanism for micro you know, drug resistance? And is there any limit at this point to how many rounds of this you can run, or is it you know, are we already at the point where you can just, like, run the thing for longer and longer sort of potentially indefinitely?

Anil Palepu: (56:45) Yeah. I think this is it's a fascinating question. It's something that, like, has also, like, kind of, like I don't I wouldn't say bothered me, but it's, like, intrigued me as well because I remember reading 1 of these post where he talks about, like, leaving, like, a CNN training over, like, the winter break, and then it magically led to, like, set up the outperformance on some benchmark. I forget which 1 that one's was, all that he had to do was, like, ready to run for, like, 40 days, which is a lot of compute back then. But, like, yeah, it was, like, unprecedented in some ways. So, yeah, I think for us, like, with the coscientist, I think the the key thing is the fact that the system is not, like, closed looped. The fact that it has access to these different kinds of tools means that in every round of round of, like, self critique or, like, iteration, it can bring in new information to the system that increases the entropy. And when that happens, that prevents us the the possibility of, like, mode collapse and, like, degenerate solutions from happening. So I think that is the key thing. So the fact that, like, the system can go and browse web search, browse, like, interesting parts of the world web, take information out of it, and integrate that, like, the knowledge that it has and do that in, like, an effective manner. I think that is the the part that leads to, like, okay, more computation being spent, like, efficiently and effectively over here and helping. And it's not just websites. Right? I think, like, increasingly, we'll be able to, like, get feedback from, like, other kinds of, like, knowledge bases, specialized tools, alpha fold. And so more as the the quality or, like, the the surface area of the hypothesis, like, increase, like, the more different kinds of feedback we'll be able to plug that into the system. And I expect that it wouldn't then, like, you know, more collapse, and there's likely that there's gonna be, like, even more increasing value to, like, spending more time on computation in this setup over here. But if you were to, like, strip away that, then I think it comes down to, okay, like, the quality of, like, published information in any given domain and also down to the complexity of the problem. If, like, the prop again, I don't know how to put, like, a precise definition to it, but maybe, you know. But if there's a problem where there's, like, a clear unknown that is, like, impossible to solve, then I think it's very likely that no amount of computation reasoning test time computer is going to be able to, like, get you that information if the system does not have the capacity to get that information. And on the other hand, there can be, like, problems which are, like, very trivial that it doesn't matter. Like, if you you probably get it in the first or, like, a few tries, and it doesn't matter. So I think it there's there's that sweet spot over here where that problem is, like, within limits, but it's also not trivially easy where spending this computation helps. And my hypothesis is I feel like a good chunk of problems that we as a humanity care about today actually fall in that sweet spot where I think we can spend a lot of computation with in silico and, like, get, like, very useful, interesting ideas and answers, which is, I think, exciting. I know what you think.

Vivek Natarajan: (59:31) Yeah. I mean, just, like, a small thing to add is, like, I think, also our system allows for humans to input ideas. There's other avenues to kind of add to this entropy. I think 1 other important thing to think about is we're comparing pairwise every combination of hypotheses. There is a lot of variability when you compare it to like maybe other papers that are talking about self critique in a more like, we're just going to keep trying to improve the same idea. Like there's many directions where we're getting this variance.

Nathan Labenz: (1:00:04) Yeah, that's quite insightful. The and I don't know if So do you actually know the, like, total number of tokens for the microbe project? I'm gonna guess it was, like, 10,000,000,000 or something.

Anil Palepu: (1:00:16) No. No. I don't think it's that bad. It's probably it should fit in within the context limits of these models. I haven't done an exact analysis, but I would think it's less than 10,000,000.

Nathan Labenz: (1:00:26) 10,000,000 total inference tokens for the whole thing?

Anil Palepu: (1:00:29) Yeah. I think so.

Vivek Natarajan: (1:00:32) For the whole tournament? Oh, for the

Anil Palepu: (1:00:34) whole no. No. Okay. Yeah.

Nathan Labenz: (1:00:35) I mean, like, from the time you give it the question to the time that it spits out your answer, like, what would my

Anil Palepu: (1:00:40) API be? Bit more than it might be a little bit more than that. But we did some back of the math calculations over here. And based on, like, current prices on GCP, we expect that most queries would be just a few dollars, less than $10. Most queries, including all the tournament and everything inferences. So it should be fairly feasible. And it's probably just gonna come down more and more in the next 6, 12, 18 months.

Nathan Labenz: (1:01:04) Yeah. That's really especially if the performance per token that that was also, you know, continuing to go up. There's, yeah, there's a lot of tailwinds.

Anil Palepu: (1:01:11) And and maybe the other word thing I would say is that this is probably the most dumb and inefficient version of the system that we have. And I think we can there's, like, so many things over here which we could, like, improve in efficiency. And also from, like, an intelligence perspective, we can make, like, much better. And so I think the bang that you would get on each token generated and dollar spend is gonna be much, much more as we keep on improving the efficiency and capabilities of the system.

Nathan Labenz: (1:01:35) Are there other, you know, narrow specialist models that it has? And how important is that now, and how important do think that will be in the future as, like, a source of new entropy? Because that seems like a potentially dramatic unlock. Right? I mean, the model itself is already trained on, like, most of the literature to search the literature Yeah. Again, at runtime, like, helps with grounding, helps things that, you know, maybe wasn't in the training data that it can find after the cutoff date, what have you. But, like, the ability to actually go do simulations and bring that kind of information back that literally maybe nobody's ever run that simulation before at all. Yeah. That seems like a potentially pretty big step change.

Anil Palepu: (1:02:16) Yeah. I I totally agree. And that's why I feel like this is probably, like, day 0 or day 1 of this journey in many ways because we've primarily scratched the surface of information that is just written down in papers and peer reviewed and published, but, like, a lot of, like, scientific information is just not in that format. So for example, we don't, like, publish negative results, which, because of, like, just the incentives around scientific publication. So that's, like, kind of, like, some dark matter that's hidden away. And we'll have to figure out an ins like, a good incentive mechanism to have that kind of data also flow into the system, but I think that's going to be important as well going ahead. But, yeah, I think more excitingly is the fact that, like, a lot of these papers and publications will have the supplementary data file, for example, which contains these, like, giant datasets of experimental data. And they contain a lot of, like, useful nuggets of information, which, again, like you can imagine, like, a system like this paired up with, like, another system that came up, like, data science agent. I don't know if you saw. Like, it could, like, go and analyze the data, and then this can, like, generate hypothesis. The data science agent can go analyze the data and, like, get the right kind of feedback back into the system to improve. So I think certainly at, like, small scale datasets that can happen automatically, but I'm even more excited about, like, what we might be able to do when we pair up the coscientist elements of the data science agent and, like, go after, say so ARC Institute recently came up with this virtual cell atlas. It's, like, 300,000,000 for like, gene perturbations for, like, 300,000,000 cells. That's, like, such a vast space. It's even for humans, it like, teams of humans, it would take, like, years to, like, explore and look at the data and see the richness in there and come up with interesting insights. But we could set up these agents to, like, generate hypotheses, look at the data, analyze, come back with feedback. And so I'm just super excited about the insights that it would unlock into, like, basic biology, you know, target discovery, and things like that when we pair up these systems and set them to go on, like, these giant datasets that we are generating right now, which, like, even teams are few minutes, like, like, very impractical to, like, go after and analyze right now.

Nathan Labenz: (1:04:19) Do you think okay. Here's a very high concept question that you may think is totally misguided or you might think it's the future. Obviously, over the last 3 years, we've seen a dramatic convergence of, like, a couple core modalities, language, vision, speech. Right? For me, 1 of the, like, early eureka moments where I was like, I think I'm gonna study this, you know, subject for the rest of my life or at least until the singularity was when I realized that it's basically the same architecture doing all these things. And at the time, I was, like, trying to make, you know, a video generation product work, and I had all these, like, different specialist models. But I could see pretty clearly that if these, you know, fundamentally similar architectures can do these different tasks independently, then there's going to be some integration that's going to happen and they'll, you know, the single model will be able to do them all. Now, you know, and this has come you know with a very sort of you know viral moment both from flash and from gbt 4 point in the last 2 weeks. Now it seems like that might happen again with the reasoning models and the narrow specialist models. Like right now you have a model calling AlphaFold, getting results, but there's not you know, this would be akin to an earlier language model like calling Dolly, you know, to generate a an image. And there's this sort of very lossy language bottleneck that happens there that, you know, was definitely a point of, like, major frustration for people trying to generate images that looked like they wanted them to look. And now that this integration has happened, it's like that problem is is basically no more. So the question is, do you think that that will happen for these other modalities as well? Like, my sense is that an early superintelligence might be reasoning models akin to what we have today that actually have these other modalities integrated in a more deep way so that they are not bottlenecked through like API calls, but are actually able to start to do some of this, like, reasoning in biological space or in material science space or in, you know, transcriptome space or whatever. I mean, there's a lot of spaces out there, obviously. And it it seems like we already see, like, pretty superhuman performance by the narrow models. It's like, I'm not aware of any human that can look at a amino acid sequence and, like, intuit what the shape is gonna be. There may be a couple savants out there, but certainly, you know, it's not common. Do you think that happens or, you know, what's your reaction to that possibility?

Vivek Natarajan: (1:06:58) Yeah. I'm not sure. I'm not sure. I mean, to be honest, I think that's I think the API call thing is like clearly, I think we're very close to being able to do that pretty well. I think a deeper integration yeah. I don't I don't know exactly how easy that is for for all of these kind of specialists type areas.

Anil Palepu: (1:07:18) Yeah. I think it actually depends on some of the incentives of, like, people who are developing these Frontier Labs models. And it makes a lot of sense to combine speech and vision and language. But if you still look at the data that's going into these models, it's primarily nonspecialized public data. And then when you're thinking about, like, biology, modalities, and datasets, they are not as close to, like, the kind of datasets that are going in over here from the public Internet over here. They're they're very, very different. And from, like, some personal experience, what we have seen is when you try to introduce, say, some of these more special modalities, even if it's medical imaging modalities. Right? I it's like images and natural images and then you have medical images. They lead to, like, regressions and, like, benchmark performance. And so, like, then the question becomes, okay. Are you willing to sacrifice some regressions on, say, LMSYS or, like, some other, like, benchmark that is gen generally, like, considered, like, an important benchmark to have your system have, like, a little bit more capability in, like, medical images, but, like, it's gonna be obviously not be as big a fraction of your users. So I think it's a it's a question of, like, incentives right now, and I can see this tension in, like, many of the frontier mobile companies where when you introduce these interesting new modalities of information, it is it leads to sacrifice in other areas, and so you sometimes like sacrificing benchmark performance. I argue would that's a good thing. It doesn't matter so much, these benchmark performance. I think you should aim for, like, practical utility. But in the absence of, like, clear measures of, like, utility, and it becomes difficult to convince. And so so I think it's not a question of, like, can we do it? I think the architectures exist The even the compute exists at most of these places. But do the incentives exist? That part, don't know, and I don't know when that'll happen because it's unclear. I think today, cannot articulate. Like, at a high level, I think, like, we all agree that if we were to, you know, encode the biomedical u universe, that model should be able to do a lot of interesting things. But that is sometimes conflicting with, like, benchmark performance on, like, I don't know, like, LMCs or whatever else that you want to use right now. And then so it just becomes, like, question of incentives.

Nathan Labenz: (1:09:30) Yeah. Okay. That's it reminds me of I've I've brought this up a couple times as I'll keep it brief, but I heard Yitay from he was at RECA at the time on the Latent Space podcast talk about how, like, the separation between vision and language models was sort of a reflection of, like, the research history. And, you know, you at 1 point would have, like, a language team and a vision team. And then it

Anil Palepu: (1:09:54) was like, maybe we can

Nathan Labenz: (1:09:55) bridge these together, but then you would have like late fusion models because those things would already be sort of done and baked and, you know, now can we like get them to sort of talk to each other via cross attention or whatever. And then it sort of became like, well, if this works, know, it'll probably work even better if we do it all just kind of, you know, interwoven datasets from the beginning. And I can see how that, you know, that same thing might be about to play out again. I hadn't really heard so much of the the benchmark thing. You you're saying, like, you've observed that in, for example, adding image capability, like, benchmarks do decline?

Anil Palepu: (1:10:36) Yeah. I I can't give you full details, but, like, when you've tried, like, adding medical images or, like, say, genomic information and try to, like again, it depends on how you train these models, but, like, we've tried to do things, like, in a pretty standardized, but maybe also, like, the easiest way. And we've seen, like, okay. Like, while you obviously, like, on the the benchmarks that are reflective of the new modalities that you're adding, over there, the performance goes up. You're sacrificing performance on, like, the main original benchmarks that are, like, language understanding focused or vision focused. And so, yeah, you have to do this prior to optimization. And, again, that also requires compute in its own ways. And so But that's I think that that's

Nathan Labenz: (1:11:11) that's all in, like, a late fusion paradigm. Like, you're starting with a a model that's already, like, scoring on benchmarks and then trying to kind of

Anil Palepu: (1:11:19) Yeah. I mean, the easiest thing to do is, like, do continue pre training or SFT. So it's yeah. I mean, we like, the experiments that I'm talking about are not likely at fusion, but it's, like, more continued pre training and SFT over here. But, obviously, the better thing to do would be to put all that data back into the pre training and train all of them together. But to motivate that kind of, like, undertaking would require you to show some benefits. And so the typically, how this works is you take, like, a pretrained checkpoint to, like, continued pretraining or SFT. And then if you show that, okay, like, your new data is helping improve performance on, like, benchmarks that everyone else cares about, then that data goes back in. But it's a yeah. And so that is where, like, I think there's these tensions because these are, like, super esoteric modalities. And so, I mean, we could spend all the time doing that parameter optimization, throwing computer and trying to figure out the best combination, Or we take this other approach where we say, okay. Doesn't matter. We'll train our own models. We'll train our own agents, and then we'll, like, system up together. And so I think that's where we are at right now. It's a little bit of a local optimum, but I think it's still fine because it still allows us to do a lot of these thing a lot of these interesting things.

Nathan Labenz: (1:12:19) Yeah. The proof is in the pudding. That's really I that's a I'm gonna remember to come back and, you know, ask a follow-up on that next time you're here. So let's see. Not too much time left, and, you know, there's so much in these papers that we could cover. 1 thing for kind of practical utility maybe a couple of things for practical utility for just people that are building their own systems out there. 1 is that this tournament style head to head evaluation, I've seen like it seems to be kind of becoming like a industry standard. I don't if if you'd go quite that far, but I there's a strong trend, it seems, toward trying to surface the best ideas by doing pairwise comparisons and having some sort of, you know, World Cup style round robin approach to doing that. I don't if you'd add anything else to that. And then I also wanted to talk about structured reasoning as opposed to chain of thought because I think that is 1 that a lot of people listening could probably go apply to their projects and get a boost, you know, like, tomorrow. So, yeah, maybe unpack those 2 things.

Anil Palepu: (1:13:29) Yeah. I think the tournament 1 is interesting because, like, we were actually motivated by alpha star where we had you have these tournament of, like, agents competing, and that led to, like, a lot of, like, strong results in that setting. I don't know if it's an industry standard because I also feel like it's somewhat inefficient, and it, like, hints at, like, the limitations of these models in some ways where they're not maybe able to, like, independently, like, score and verify ideas, but rather have to do these, like, n square pairwise comparisons. And so we do have to do some optimizations where we have to, like, cluster them up together, group them up together so so I should reduce the computation and not, like, do, like, thousand cross thousand idea comparison because that would be, like, too expensive. So I would think that would become a little bit, like, computationally efficient going ahead. But it like, I think that overall ranking of things, the more you can do that in latent space, I think it's gonna lead to, like, more interesting results, better reasoning, and so on and so forth. So, yeah, I expect that that idea to stay, but to not happen as explicitly as it's happening right now. But more of it happening in, like, the latent space of, like, the reasoning of these models. I'm not sure if you

Nathan Labenz: (1:14:33) know that.

Vivek Natarajan: (1:14:34) And, I mean, I think regarding, like, the structured reasoning approach, like, I think there's practical reasons for it, right? Like when we have 2 agents that we want to talk to each other, it helps to have a data structure that we're passing. And that just makes the engineering of it itself a little bit easier. But I think beyond that, comparing to just asking the model to in a train of thought style, just reason about this versus defining a certain reasoning structure, I think the advantage there is we can actually kind of better enforce it to follow a certain path. So like in our case, we wanted it to do this long analysis before kind of going into higher level management goals, before then finally forming its management plan. And being able to define that structure forced the model to take that path through its reasoning rather than, you know, free form. You know, if we if we allow it to do it free form, like, maybe it starts to form its management plan before it is done these kind of, like, higher level steps that we want it to go through.

Nathan Labenz: (1:15:38) Yes. It's it's basically to to just try to describe this for people that might wanna implement it. I mean, 1 thing, it's it takes advantage of another, you know, notable feature that models have gained over the last year or so, which is that we can now specify as part of an API call. This is the exact JSON data structure that you are supposed to return. So that's huge because it makes it really easy to set, you know, set that up and then get something back. It kind of is a little bit like airline checklisting sort of where you're just like, okay. I want you to absolutely every time go through these steps. And if you do that, you know, we're confident you're gonna get better results in the end versus just kind of, walking out there and randomly walking around the plane and coming back and saying, yeah, it all looks good to me. So the intuition for that is pretty simple. How dynamic did you make that? Because I've never actually done a dynamic structured

Vivek Natarajan: (1:16:38) output. Yeah. I think in our case, think we tried a lot of different structures and strategies for generating the ultimate management plan. I think there were like when we tried to get a little too fine grained with like, Okay, first summarize the patient's chief complaint, then summarize. When we tried to get a little too granular with that, having the flexibility of just these kind of higher level things like analysis management goals, things that are pretty general tended to lead to better management plans. Of course, this is all under our own auto evaluation and vibe checks, as I mentioned. So that's maybe up to for debate. I think yeah. And I don't think it was like super so it was dynamic in that sense. It can have any number of analyses items. It's like you know, a list of however many items, any number of management goals. But I think we try not to constrain it too much, just constrain to the point of, like, we want it to go through a certain, like, reasoning structure.

Nathan Labenz: (1:17:47) Yeah. Okay. So structured outputs, people, don't overdo it, but definitely use it. I mean, that that was 1 of the things that seemed like it really drove a pretty pretty big lift. Going back just to the comparisons for for 1 more second, Vivek, are are you basically saying that, like, you think in the future, it won't have to be so head to head, and instead, it'll just be like, here's 10 things. Pick the best?

Anil Palepu: (1:18:09) Yeah. I would hope that the so so in some ways, what we are trying to do is, like, force the model to explicitly do the tree search. Right? You know, come up with the new ideas, like, go to different nodes, and then do the comparison over here. But maybe what I'm so it's almost like, okay. Like, you come up with ideas, you write them down, then you review them, and then you figure out what's the best that's happening. But, like, the question is, okay. Can all of that happen in your head in some ways? Like, does it have to be explicitly written down? Do you have to explicitly generate all those tokens? Can you do something in the way you set up the architecture itself or yeah. In some other mechanism where that all of that happens in, like, the latent space itself so that you are more efficient with the tokens that you are generating. So I feel there's an inefficiency right now. I mean, sure, it helps within, like, interpretability and other different other aspects over here, but I think there's, like, lot to begin by encoding that research within the latent reasoning of the models, and we don't do that as well right now.

Nathan Labenz: (1:19:04) Yeah. Okay. Cool. That's helpful. Maybe the last 2 things. What do you think would happen if you just prepended a question identification agent to the coscientist and just had the thing, you know, kind of, you know, run-in a loop where, the first thing it did instead of taking a question from a human scientist is just, like, go out on the Internet and search around and come up with an interesting question for itself and then just try to answer its own question. Is there anything about that that you think wouldn't be effective?

Anil Palepu: (1:19:46) I feel like in some ways, like, we like to talk about the concept of, like, root note problems at Google and DeepMind, and we felt like, like, once you have a question, like, sharing novel original solutions to that is a root note problem. But in some ways, what you are describing over here is an even upstream root node problem. Like, how do you ask the right question? And I feel like the day we get AI systems to reach that, then I think that is the day we can truly say, like, okay. We have geniuses in data centers. I think that is, like, the gonna be the most impactful and important unlock. And, I mean, my feeling is, like, there already should be, like, decent capability in these models to, like, you know, go surf the Internet, read information, and figure out what are the right questions to ask. So, yeah, I mean, if it's okay, we'll go ahead and give it a try and get get back to you.

Nathan Labenz: (1:20:34) Have you had a chance this is a a a bonus. How have you had a chance yet to try this with Gemini 2.5? It seems like from my qualitative assessment, it would be a lot better.

Anil Palepu: (1:20:47) Yeah. No. I think that's the exciting part because whatever we've described in the paper was all Gemini 2, so it should be coming up very soon. But, yeah, we're super excited about that.

Nathan Labenz: (1:20:57) Feels like the path to geniuses in a data center is honestly pretty clear at this point, which is a crazy thing to say. But, you know, do do you see are there any, like, you know, big barrier questions that you feel are, you know, just fundamentally unanswered still? You know, programmers often call it a simple matter of programming. Like, it's gonna be work, but, you know, we can make it work. Is that kind of the mindset right now for you guys, or are there questions where you're like, we really don't have a good answer to how we're gonna get over that part?

Anil Palepu: (1:21:32) Like, we have all the building blocks over here, and so we'd probably build something that will look very close to what you're talking about over here. Whether that's the most beautiful 1, the most elegant 1, we don't know. But does that even matter? I I think it doesn't. So so I think that is why it's truly exciting where I think we have line of sight to 1 solution, which feels like will get us where we want to, and that in turn is gonna, like, lead to a lot of new unlocks. And so, yeah, I would say I think for the next couple of years, at least, to me feels mostly an engineering challenge rather than, like, trying to answer some fundamental unknowns.

Nathan Labenz: (1:22:05) Yeah. And, of course, the other big challenge is gonna be the social challenge of introducing this stuff to the world and, you know, getting scientists to pay attention. Maybe to close, you want to talk a little bit about what you're doing in that regard? Because I was excited to see that we're now getting to the point where you're inviting scientists to reach out and partner with you on this and also going into, I don't if know you would officially call it a clinical trial, like something in the actual field of medical practice with real patients too. Tell us what you're doing on the deployment side.

Anil Palepu: (1:22:37) Yeah, I think the coscientist is a little bit easier for us to deploy. I think there are maybe less questions on, like, like, regulatory and things like that. And so it's so I mean, the the only thing that maybe, like, bothers us us a little bit is that the system is highly capable, and so there are also, like, many ways in which it can possibly not do so well. Right? And so we just want to ensure that, like, as we scale up the system, we do that in a responsible manner. And so that's why, like, in a know, we have the Structured Tested program. We've already been working with, like, close to a 100 scientists right now across the world, and these are all, like, world leading experts. And with the trusted tester, we want to invite, like, more organizations. And so our hope is that, like, we can, you know, do this in, like, batches and in waves. And then every batch, we get, like, feedback. We identify, like, the weak points of the system. We improve it and make it better for the next next batch of scientists. And so, yeah, this shouldn't take too long, I think. Like, by the end of the year, I'm hoping that yeah. If I'm optimistic, like, millions of scientists around the world will have access to this tool, and, hopefully, it, like, raises the bar and, like, the ceiling for all of them and, like, helps them to do, like, more creative and interesting work. I think the the 1 with AIMY is a little bit more tricky. It's it's obviously a more complex space. But, again, there's a path there's a, like, known path to taking such systems out there in the real world that can like give diagnosis and treatment recommendations. So we are very excited about the clinical trial that we have coming up. So it's going to be, I think, 1 of the first studies of its kind where an LLM based system is going to be interacting with real patients. And the nice thing about this setup that we have right now is we are deploying it in a clinic where there is sufficient presence of clinical experts who can oversee the system and provide oversight. And so that is a very safe environment for us to like deploy the system where there's like sufficient number of doctors who can take over if something goes wrong. And our hope is in that study, like, not a lot lot, like, goes wrong. So that allows us to dial down the amount of, like, oversight that's needed. And so, yeah, think if things go well, then we'd probably scale it out to more centers, like, reduce the amount of expertise, and also introduce, like, more net new capabilities into the system that we are going to trial and make make them more patient facing. So, yeah, I think that's that's the exciting part where I feel like like the research has progressed quite a bit, and so it's now time to, like, see like, take it for a drive in the real world.

Nathan Labenz: (1:24:52) It's exciting times guys. Really outstanding work. People should be paying more attention than they are and hopefully we'll, you know, put a little, you know, dent in the consciousness by bringing, you know, some more attention to this, but just really mind blowing, really mind blowing stuff and and, you know, quite a series of work that you guys have put out. Anything else you want to share in parting or any other thoughts you want to leave people with?

Vivek Natarajan: (1:25:15) No, mean, it's been a pleasure to work on these projects obviously, and we have a lot cooking still. I'm excited for the future of this.

Nathan Labenz: (1:25:24) Yeah.

Anil Palepu: (1:25:25) Yeah. And likewise, I think for me, like, talking to Nathan, I don't know. It's like just just a lot of fun. And, yeah, I mean, it's a real pleasure, like, full time over here. I think there's gonna be at least, like, 1 more where it's big enough that we'll come back again and talk over in the next few months. But yeah.

Nathan Labenz: (1:25:38) I just hope you don't get too big for me. That's my

Anil Palepu: (1:25:40) Nah. That's my hope. Obviously here. I think it's so much fun.

Nathan Labenz: (1:25:43) Cool. Well, really appreciate it. Again, fantastic work. Vivek Nadarajan and Anil Palafu, thank you for being part of the Cognitive Revolution. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.