Dive into the future of medicine with Google researchers Khaled Saab and Vivek Natarajan, discussing the breakthroughs in AI for healthcare.
Watch Episode Here
Read Episode Description
Dive into the future of medicine with Google researchers Khaled Saab and Vivek Natarajan, discussing the breakthroughs in AI for healthcare. Discover how AI doctors may soon provide high-quality medical advice, enhancing global access to healthcare. Learn about the innovative use of large language models in medical applications and the potential for AI to outperform human doctors. Listen as we explore the significance of AI's role in healthcare and its implications for future medical practices, featuring insights from leaders in AI research.
SPONSORS:
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/
Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.
RECOMMENDED PODCAST:
Byrne Hobart, the writer of The Diff, is revered in Silicon Valley. You can get an hour with him each week. See for yourself how his thinking can upgrade yours.
Spotify: https://open.spotify.com/show/...
Apple: https://podcasts.apple.com/us/...
CHAPTERS:
(00:00:00) About the Show
(00:04:34) Introduction
(00:06:35) Flamingo
(00:11:50) Importance of data quality
(00:13:29) Amy: AI doctor
(00:18:44) Simulation Learning Environment
(00:23:26) Sponsors: Oracle | Brave
(00:25:34) Training the Agents
(00:27:29) Tens of thousands of data points
(00:30:35) How to incorporate new knowledge
(00:33:21) MedGemini
(00:34:51) Sponsors: Omneky | Squad
(00:36:38) Uncertainty guided search
(00:39:29) Generalist models
(00:41:16) MedGemini, Gemini, multimodal, medical images
(00:44:57) Future work, integration, consolidation
(00:46:00) Cost of AI
(00:52:38) When will AI Doctors be Deployed?
(00:58:22) The Speed of Trust
(00:59:58) Societal Acceptance of AI Doctors
(01:02:06) Uncertainty-Guided Search
(01:06:02) Med-Gemini: Chest X-Rays and CT Scans
(01:17:38) AI Scientist
(01:20:01) What are the principles at play?
(01:22:58) What should we do about AI?
(01:28:24) LLM for democratizing access to care
(01:34:22) Final Message
(01:35:38) Closing
(01:36:45) Outro
Full Transcript
Transcript
Nathan Labenz: (0:00)
Hello and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my co-host Erik Torenberg.
Hello and welcome back to the Cognitive Revolution. Today, I am thrilled to share my conversation with Khaled Saab of Google DeepMind and Vivek Natarajan of Google Research. This is Vivek's third appearance on the show, the most of any researcher, and for good reason. As regular listeners will know, when I'm asked why we should be excited about AI, my standard response is to point to the incredible potential value of AI doctors. In a world where access to medical professionals is all too scarce, even in rich countries and so much more in poorer parts of the world, the prospect that anyone, globally, one day soon, could have access to high quality medical advice from their personal device anytime day or night for less than 1% the cost of current first world access, that is simply too valuable to ignore.
And the good news is that Vivek, Khaled, and their colleagues at Google who focus on the medical applications of the latest AI models have made extremely impressive progress over the last 18 months, demonstrating that with a mix of techniques, including strategic data curation and filtering, repeated fine-tuning, uncertainty modeling, and painstaking evaluation, large language models can be effectively adapted to medical applications, including radiology, diagnosis, multimodal understanding of medical records, and many more, often rivaling and increasingly even surpassing human doctors' performance on the exact same tasks. That is exciting stuff.
But what's even more exciting about their work today, to me, against the backdrop of continued hyperscaling and the prospect of an international AI arms race, is how their results demonstrate the transformative value that we can already achieve with current models if people are willing and able to put in the hard work needed to dial in and validate performance. Of course, it's undeniable that each new generation of model brings greater capabilities and makes application development easier, but it's worth noting that some of the human competitive results we discuss today are based on the Flamingo model, which Google originally published more than 2 years ago now in April 2022.
And I was fascinated to hear Khaled and Vivek tentatively forecast that they could probably achieve their vision of a high quality general practice AI doctor even if Gemini 1.5 Pro were the most powerful model that they ever had the chance to build on. This to me strongly suggests not only that current models are indeed in some sort of a sweet spot where they're powerful enough to be extremely useful, but not so powerful as to risk catastrophic harm, but also that we can afford to move with caution through further orders of magnitude of scaling, confident that we won't be leaving all the value we hope for on the table. In other words, there really might be solid ground from which to defend my adoption accelerationist, hyperscaling pauser position.
While Khaled and Vivek are proceeding very responsibly and methodically, cognizant of the fact that people generally need overwhelming evidence before they'll be comfortable trusting AI systems in critical contexts, I personally would advocate for a warp speed project for AI doctors powered by 10 to the 26 class models, while the new sciences of interpretability and AI control are given time to develop. In my wildest dreams, that might even be a joint project that we could work together on with China.
I'm really grateful to Khaled and Vivek for joining me and for the amazing work that they are doing. I truly believe this technology will save many lives in the years ahead, and I hope that by spotlighting it, I can help inspire others to work on high value applications of today's technology rather than waiting for further scaling to solve all of our problems.
As always, but even more so for this conversation, which I think is extremely important, I would appreciate it if you take a moment to share the show with friends. Know too that your feedback and suggestions are always welcome either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. Finally, I'm still interested in connecting with AI engineers and AI advisors who are looking for new opportunities. Earlier today, I collected all the responses I've received into a spreadsheet, and soon I will be circulating that with a few friendly companies. So if you want to be part of that, please ping me ASAP.
Please enjoy this deep dive into the medical applications that are already being built on today's foundation models with Google's Khaled Saab and Vivek Natarajan. Khaled Saab and Vivek Natarajan from Google DeepMind and Google Research. Welcome back to the Cognitive Revolution.
Khaled Saab: (4:40)
Thank you for having us.
Vivek Natarajan: (4:41)
Yeah. It's a real delight to be here. I think there are very few podcasts that I listen to in the AI space, and I can positively say that you're in the top two, Nathan. So, yeah, real pleasure to be back here.
Nathan Labenz: (4:50)
Thank you. That's kind of you to say. And I don't know how you have time to listen to any AI podcast given the incredible tear of first party publications that you guys are involved with. So what we're going to do today is just pick up where we left off a handful of months ago, and that is, there have been at least 5 notable publications that I've landed on that I want to walk through in terms of the results that you've delivered, specifically in the application of large language models and the fine-tuning and adaptation of large language models to the medical domain. I think people are still largely sleeping on what is out there in today's world. And so, hopefully, we can call them to attention on all the great work that you guys have been doing, get into a little bit of the techniques, and then in the second half, we can get a little bit more philosophical and figure out how and where these things should be used and where all this is going. How's that sound?
Khaled Saab: (5:45)
Sounds perfect. Yeah.
Nathan Labenz: (5:47)
Cool. The 5 papers that I have queued up are one on radiology, one on diagnosis, then you have the most famous one, which is the Med-Gemini, which is bringing the native multimodality to the medical domain, an immediate follow-on extension of that brings even more modalities to the Gemini or the Med-Gemini family of models. And then one from just this week that is starting to look at the interweaving of chemical structure and notation in with natural language as well. I think that's a super fascinating extension of all this work. So we've got a lot to cover. I think it's important to flag as we go what the sort of base was on this and we'll touch back on these different base models later when we get to the philosophical section. Let's start with the radiology paper. This one was based on Flamingo, which was funny for me to read because I hadn't thought about Flamingo in a while. Though that was one early for me when I saw the results from the Flamingo paper, I was like, oh, okay. We're not just stopping at language. Basically, in that paper, you started to compare the ability of a fine-tuned vision language model to do radiology reports against what human radiologists can do. You want to summarize those results for us?
Vivek Natarajan: (7:07)
Yeah, sure. So again, I think the common strand among all the work that we've been doing here is we're building on the shoulders of giants in this field, and I think Flamingo is one of the OG multimodal models and papers. And it was probably one of the first to demonstrate few-shot capabilities in the multimodal domain generalizing beyond language. And so this work has been, I would say, a couple of years in the making. And radiology is one of those spaces where I think a lot of the medical AI research has been focused on for a long period of time. And I know that Dr. Geoff Hinton famously had this quote on AI replacing radiologists, has turned out to be a little bit infamous right now, but I think he's not very far off.
Nathan Labenz: (7:47)
I guess, you know, the question of replacement of radiologists is an interesting one. I was just with my wife's cousin not long ago, who is a doctor, and said that in the hospital where she works, they often wait as much as 60 days to get a scan read and get that report back. And I was like, man, 60 days. That's a long time. And she said, yeah. We're all very frustrated by it. So in some ways, I think some of these replacement or non-replacement questions, at least in the short term, are like, we have a massive shortage, and we need to just do something about that before we'd even start to worry about replacement.
Vivek Natarajan: (8:28)
Yeah. I think the results in the Flamingo paper, I think, are for the first time showing that for the clinically relevant task of report generation and not just classification of pathologies, we are starting to have models that are approaching clinical utility. So I think that was a really cool result. And, obviously, the way we got there was by taking, I would say, a relatively modest and small vision language model compared to the sizes that we have today, but bringing in a lot of high quality data. And, again, a lot of it is not proprietary, but rather open source. And, again, credit goes to folks who have been behind the MIMIC data repository who put that together, and so we utilize that one there. And so when we combined a very strong multimodal few-shot learner and fine-tuned that with high quality radiological data, we start to see this very strong performance on radiology report generation and it starts to approach that of what radiologists can do. So I think that was a very cool result. I think it's one of the first papers to show that. And since then, there's been several other papers, including some of our own work that has shown more progress there.
But I think the more interesting bit in that paper at least was the human AI collaboration bit because we had an arm or a study there where we show that if we passed the AI reports to a human radiologist and they did edits, then when we look at the composite result or the composite systems results, those tend to be better than the AI alone or the human alone. And so that's an interesting nugget. And this goes back to something that I think Dr. Curt Langlotz at Stanford who heads up the AIMI group, he said that AI is not going to replace radiologists, but radiologists who use AI will replace radiologists who don't use AI. And so this is one of those first results that shows what he was saying is maybe not hallucinating, but rather he's speaking the truth. So this is one of those first milestones towards that direction. And again, there's been a lot of other incredible work in this space, not just from our group, but from folks at MIT, Stanford, and beyond. But I think those were the two key or interesting results there in that paper.
Nathan Labenz: (10:26)
Yeah. It's striking to me that, first of all, I think it was April 2022 that Flamingo was first published. It's striking to me that it is relevant even 18 months later. This paper, I believe, was either very late last year or January, but more than 18 months from publication of one paper to the other. That's a long shelf life for a foundation model these days. There's some interesting stuff in there too that I think we've covered in probably enough different episodes, including some of the past ones with you where there are people, if you're listening to this show, you're probably well familiar at this point with the distinction between the narrow classification models of putting an image in and does it or does it not have this particular condition, whereas here what we are getting out is a full natural language report of the sort that your radiologist would ultimately provide back to a patient and their general practitioner. And there's some interesting nuances there too around how data has to be manipulated and preprocessed. You have, if you just train on sort of naive reports, you have a lot of references back to history, which may not always be available. And so you have to clean up that kind of stuff to teach the model exactly how you want it to behave, which is to reason just from this thing and not to infer or hallucinate previous medical history that it doesn't actually have access to. So some of those techniques I think are increasingly well established.
Vivek Natarajan: (11:50)
The question on data quality, it's one of those important things. Again, the last time we two were talking on a podcast, we were talking about Med-PaLM, and we used the same datasets, but we did not end up using the cleaned up version of the report generation data that we have, which did not have references to prior reports and things like that. And that cleaned up version was curated by Dr. Pranav Rajpurkar and others at Harvard. So there's this notion that in some ways progress is happening in silos between academia and industry, but that's actually not quite true. Right? I think academia does a lot in terms of investing in datasets and curating and providing it and also creating benchmarks and beyond. And I think what we can do from an industry perspective is releasing our frameworks and sometimes open source models that academia can bring on. And so there's this already symbiotic relationship that has been existing for a long period of time. But people sometimes tend to miss that when you look at the discourse on Twitter. So that's maybe one of the things that I would stress.
And then, again, the importance of the right kind of data. And so we had missed that in the prior work on Med-PaLM, but then when we used a highly well curated cleaned up dataset, and when you combine that even with a relatively small language model, the performance can match up with larger sized models and things like that. So I can never understate the importance of high quality data here. These are all building blocks of a recipe, and so it's very well possible that the final outcome completely depends on how you combine them together. And so that's a lot of the know-how and knowledge that you gain from working on these systems for a long period of time. And then, you know, okay, what are the ideal combinations of things that you can bring forth together to make them more optimal.
Khaled Saab: (13:29)
Yeah. And I think I'll also add to that. I think you mentioned something interesting there where it took around 18 months from Flamingo to then Flamingo CXR to come out. And there's this general challenge now with generative AI on evaluation. Right? So we're not outputting a single label anymore, and so we can't use our favorite metrics like accuracy. So I think evaluation is a big challenge now. And one of the really cool things I thought about in the Flamingo CXR paper is that it really goes in-depth with doing the gold standard when it comes to evaluation, which would be evaluating these model outputs with a panel of clinical experts. And that's what takes a lot longer is doing those thorough evaluations, but they're necessary in this generative AI space.
Vivek Natarajan: (14:23)
And especially in the medical domain.
Khaled Saab: (14:24)
In the medical domain especially.
Nathan Labenz: (14:26)
Yeah. That's a huge theme. And I think partly applicable to the next paper as well, which is known as AMIE. The headline of this, and this is one that I've clipped one of the main figures from this paper and included in a number of presentation slides. Anywhere where I'm trying to alert people to the fact that there are big things happening in AI that they may not even have heard of, I pretty much always include the AMIE result at this point because, again, for those in the know, this is based on PaLM 2. So we're still, you know, a generation behind the latest and greatest that's come out of Google. But it's funny, right, because I always say too. While it seems AI is moving so fast, it seems like it's old, but at the same time, if you'd gone back to 2020 and dropped PaLM 2 in, it would have been absolute revelatory, mind blowing. Even 2022 probably would have still felt that way. So PaLM 2 is the base. And this one is, for me,
Vivek Natarajan: (15:23)
just an
Nathan Labenz: (15:23)
incredibly compelling result, which is that, and there is one major caveat, which is that through a text-only chat based interaction system, an AI doctor, if you will, the system, the AI, can diagnose through a differential diagnosis process more accurately than the human doctors did, human general practitioners, if I understand correctly. And I think that's evaluated probably in multiple ways. Essentially they come out ahead, the AI comes out ahead on all the dimensions, right? It's evaluated as more accurate, the other doctors are giving it higher marks, and the patients are even giving it higher marks in terms of how it scores on empathy and making people feel heard and feel important. I can't get that one out of my head. What other aspects of that result would you highlight for people?
Vivek Natarajan: (16:16)
Yeah. I think in many ways, we were also surprised by the progress on that one. I remember a bunch of us after the Med-PaLM paper came out, we were thinking about this kind of a task or a setup. And we were having a chat. And we were like, okay, how long before we are able to crack diagnostic medical conversations? So the AMIE project had two papers that came out. One was the diagnostic medical conversations paper. The other one was the New England Journal of Medicine Case Challenges. And so we were roughly having this discussion, okay, how long would it take for us to solve them? I think everyone was just unanimous, it will take at least 2 years. And then literally 6 months down the line, we had AI that was exceeding general practitioners on these case challenges, but also on diagnostic medical conversations.
And maybe the first thing I would highlight is this is a very different task in a couple of ways. Right? So until this paper, we had very well curated vignettes that summarized very cleanly the clinical case descriptions and reports and things like that and presented that to the model. And then you're asking you to do a diagnosis. And while that's a challenging task and has been a grand challenge in the field for decades, it's not reflective of everyday clinical practice. Right? So when you go to a doctor, they have to, I mean, they'll first see you. They'll ask you about your symptoms and medical history and things like that. And then they'll embark on this investigative journey, and then they'll ask you to maybe do a few lab tests or give you some suggestions on interventions like medications or things like that, and then gradually come to a differential diagnosis and a treatment plan over a period of time. And so there is this decision making under uncertainty and efficient acquisition of new information. And I would say pretty much until this paper, I haven't seen in any field that we've shown that AI or LLMs are capable of doing that kind of work.
And so this kind of behavior is very different from what you see even with general purpose LLMs out there today, like ChatGPT or Gemini, in the sense that when you have a conversation with them, ultimately, at the end of the day, the way it's set up is for you to guide the conversation and be the driver of the conversation, and the model is out there to assist you. Whereas in a medical context, it's very different. The model has a task and a goal for sure, which is to help you. But the model has to drive the conversation and figure out the optimum plan in terms of asking the right sort of questions and helping you in your care journey, basically. And so it's a very different task. And to me, at least, it was not clear and obvious that AI systems can be trained to do that. But turns out we can. And I think the key reason behind that is the simulation learning environment that we set up there for diagnostic dialogue learning. And I think Khaled can talk a lot more about that. But turns out when you do that with the progress in generative AI, you can actually set up self-play based systems that can learn over several orders of magnitude more data than any human doctor can possibly see, at least in the text realm and text domain. And when you bring that power of simulation into the learning process, then you're starting to see some of this magic starting to happen here in the learning process. Right? And so that's the key thing.
And, again, I wouldn't want to overhype the results too much. And so the last bit I would state is I think the comparisons are not totally fair to human doctors because typically they are not used to text based conversational interfaces. Rather, they're much more used to doing it in person or video calls where you can use other cues to convey empathy and rapport and things like that. Right? And then the second thing is, obviously, AI systems don't tire. So they can, every interaction, they can bring their best possible selves, give you very detailed answers, and things like that. Whereas it's very possible that some of the doctors that we were working with in the course of the study, they might be coming after a long day of clinical practice and they would be perhaps a little bit tired. It's those nuances that we also need to account for, but it is undeniable that significant progress has been made here. Khaled, do you want to talk about the simulation learning environment?
Khaled Saab: (20:02)
Yeah. Yeah. Yeah. Our goal here is to train an AI model to do diagnostic conversation, and so that requires the model to try and get the information needed, what's the patient going through, their symptoms. Once it understands those symptoms, it asks follow-up questions on those. And then the final goal is to come to an accurate and confident diagnosis and perhaps even a treatment plan. When you have this goal in mind, you want to train an AI model to do this, the first thing that came to our mind is, okay, the training data, right? Let's collect training data on actual doctors interacting with patients in the real world and then transcribing those and then training on those interactions. So we did that, but in the first version of AMIE, we saw it wasn't having the high quality conversations that we were hoping for. And when we tried to dig into why that was, we saw that it's because the transcribed conversations from the real world interactions weren't high quality because as we have conversations with each other, unless we're very well rehearsed, we're going to have some awkward pauses, utterances, might be pointing to something. And so all those things add noise, and a lot of the information gets lost when we're transcribing those to text.
We realized we had to come up with something where we curate our own training data, and we were very inspired by a lot of the works that show with these capable LLMs, you can actually curate high quality synthetic data. So that's where we came up with this multi-agent synthetic data framework, where we had one agent acting as the patient, and we would give that agent a patient profile. Then we'd have another agent acting as the doctor and give that agent doctor specific instructions. And then those two agents go back and forth in a dialogue. I want to highlight two things here. With training data, quality is important, as we've been mentioning, but also coverage of medical conditions. In a synthetic simulation environment, you can solve that with what we call a vignette generator. A vignette is like the patient profile. It describes what the patient condition is, what the symptoms are, and maybe social history, medical history, family history, all that information you'd need. We generate tens of thousands of these patient profiles using web search and other tools. So now that solves the coverage issue. So now we have tens of thousands of patient profiles. And now we do this back and forth between the patient agent and doctor agent. But then to improve the quality, we also have this third agent called the critic. And so we give the critic instructions on what a really high quality dialogue is. And so the critic gives feedback on those synthetically generated conversations, and then we integrate that feedback to do another round of that back and forth generation of dialogues. And so that really helped us scale the data coverage and improve the conversational quality and allowed us to train AMIE to see a lot more of these conversations. And I think that was, yeah, one of the key drivers there.
Nathan Labenz: (23:26)
Hey. We'll continue our interview in a moment after a word from our sponsors.
So just a couple questions on exactly how that works. Are you starting with the same base model for all of those different agents within the system? They all start off as, in this case, PaLM 2 with just different prompting, and then do they diverge then over time? Are you fine-tuning the patient agent model distinctly from the doctor agent model distinct from the critic?
Khaled Saab: (23:59)
Yeah. We so, yeah, we start off with a base LLM. In this case, it was PaLM 2. And there's some prompt engineering here to make sure that the responses from the patient agent and the doctor agent are as we would expect. So we do some prompt engineering there. And I guess the more capable your LLM is, the less prompt engineering you probably have to do. I think it still is possible to do this with some other LLMs, but maybe some more effort would have to go in there. But the better instruction following your LLM is, the easier it is to set up these kinds of synthetic generation frameworks. But then to answer your question, we train AMIE. So after we generate those synthetic dialogues, we train AMIE for both the patient role and the doctor role. And so you can do this by having a different instruction prompt. You're the patient, and then you train on the patient turns. And then also have another instruction: You're the doctor now, and then train on the doctor turns. So this AMIE model becomes better at both simulating the patient and simulating the doctor in this case. So then, as AMIE gets better at playing both roles, we then generate the synthetic data again. And so that's what we call the outer self-play loop, where we did this a few times and, yeah, we're using the same model to do both roles. All three with the critic as well.
Vivek Natarajan: (25:26)
Yeah.
Nathan Labenz: (25:27)
It sounds like it's remarkably not that much data, and maybe I'm not sure how many rounds of fine-tuning. But, again, this sort of foreshadows some of the more high level discussion I want to get into a little bit later. But tens of thousands of data points is obviously pretty minuscule in a world where Llama was trained on 15 trillion tokens. Right? We're talking a very small fraction of kind of pre-training scale. I don't know if you can share any data around how many times you had to turn that crank and what the sort of filter percentage was if you had tens of thousands of cases. Did you have to run a million conversations to get the next 10,000 best out of those? What does the enrichment kind of process look like in a little more detail, I think would be of interest if you can share that.
Vivek Natarajan: (26:21)
Yeah. So I think the key thing is we are not training the model from scratch. We're already building on top of PaLM 2, which has seen roughly the same order of magnitude of tokens that a LLaMA 3 or GPT-4 has seen. So there is that base, and you don't have to reinvent all that.
And I think the second useful comparator is: if you're trying to match and compare with human doctors, how many encounters do they have over the course of their career? If you do back-of-the-envelope estimates, that comes out to roughly in the order of tens of thousands. With AI systems, as Carl mentioned, we have tens of thousands of patient profiles, and we can scale that up very easily with net new data. So you can cover a wide range of disease presentations and symptoms, and not just that, but also socioeconomic statuses, medication history, travel history, and things like that. Very quickly, when you do this combinatorially, you can have millions, hundreds of millions of patient profiles.
And then maybe the other interesting bit is when you're doing the simulation, you can also simulate patient personalities. You might have some patients who are very talkative, some patients who are very worried, some patients who are a little bit adversarial in nature. And a human doctor has to deal with all of them in a composed, rational manner, and we would expect the same with AI systems as well. So you can start simulating a lot of variety, a lot of diversity. And very quickly, when you do the math and when you do the rollout of the conversations, that goes into the order of millions. And we haven't stopped since the paper. Obviously the checkpoint was stopped at some point in time to do the study, but you can imagine the number of conversations that we have easily going into the hundreds of millions of realms and even beyond.
But the key thing is every time you're generating data, that needs to add net new information to the model that it had not previously seen before. Otherwise, it would end up saturating. And so the trick is how do you add that net new information so that the model is learning something new and reducing its overall uncertainty in terms of solving this task? That's the key bit. And specifically for this paper, I believe we had seven rounds of iteration of the outer self-play, and that's where we stopped. But again, the project has not stopped. We're just getting started. So you can imagine a lot more work that's going on since that paper, and this is now several months old.
Nathan Labenz: (28:32)
Yeah. Okay. Two really interesting points there. One, I think seven is maybe the most I've seen. There's been obviously a bunch of papers that have demonstrated this self-improvement via critic dynamic. And I feel like they usually top out around five. So even to get to seven is going past what I've mostly seen. That's interesting.
Another interesting question is—this is top of mind right now because I've just been looking into this ARC challenge that's just been announced for these little visual puzzles that today's language models can't do. And there's a big debate as to is this something that you have to have if you're going to qualify something as AGI, or is it not necessarily so important? But the key question I'm wondering about is the world changes. That was one of the big comments—something like COVID pops up. How do you think about incorporating new knowledge or new circumstances into something like this? Do you have to go back and rerun all the loops and fold it in from the beginning? Can you just add it into the eighth loop when a new disease or a new technique or whatever pops up onto the scene?
Facts, in my experience, at least with LoRA low-rank type techniques, have been really hard to get models to learn. Patterns of behavior seem to be a lot easier, but actual concrete facts—I could not get a fine-tuned GPT-3 to know my name. It would know that I was Nathan, would not know that I was Nathan Labenz, even though I was trying to train it to write as me. I could not ever get that fact seemingly deeply learned, at least in any reasonable time frame. So you've done all this, but something changes in the world. How do you respond to that in the context of this sort of system?
Vivek Natarajan: (30:17)
Yeah. I think there are multiple tools at our disposal. I think one of the key things with every round of self-play was the sampling of the patient vignettes that we had. It was not from a fixed set, but rather we were also expanding that. And so in that sense, it was adding net new knowledge already. But we can do that in a more principled way. For example, if a new COVID variant—hopefully not—but if that pops up, then we can start doing that, and we can add it to the fine-tuning mixture. And hopefully, that can be learned well by the model already.
But then I think this is a nice segue into our latest work on Med-Gemini, where we are now starting to give these models access to tools such as web search, which can retrieve real-time information. And when you combine the ability to retrieve real-time high-quality information with strong and advanced clinical reasoning capabilities, then I think we're starting to get towards solutions to the problem that you have outlined, which is a very important and key problem. But I think we have multiple tools at our disposal that can help us address a lot of these challenges already.
Khaled Saab: (31:18)
Yeah. I totally agree. We saw that there are some interesting tricks to get LMs to interact with tools like web search. If the model isn't confident about something or doesn't know something, it can generate questions and then use web search in order to fill in those potential gaps for short-term updates in knowledge. But I do believe that we should probably do both—always continue to do the self-play loop we were talking about, where we continue training on new information to have the model be more confident about the new incoming information, but also for maybe quick short-term things, if something came out the next day and it needs to know about that, to use tools like web search as well.
Vivek Natarajan: (32:03)
Yeah. And maybe the last thing, and I think it's perhaps the most important in the medical context, is this notion of scalable oversight that we are building towards. And that requires models to have better inherent uncertainty estimates. And so that means more reliable behavior, such as when a model doesn't know about something, then it needs to be able to communicate that back to the patient itself, but also maybe back off and ask help from experts in the loop. And so that's the other kind of behavior that we are building towards. That will hopefully enable more scalable oversight of these powerful systems and allow us to safely deploy them in the real world. So that's one bit that we haven't yet spoken about completely yet because we still need to do some studies, but I think that's the other missing piece of this puzzle.
Nathan Labenz: (32:46)
Gotcha. Hey, we'll continue our interview in a moment after a word from our sponsors.
Okay. So with Med-Gemini, the part of that sort of previews what you're describing there is the uncertainty-guided search. And let me just take even one step back and try to summarize Med-Gemini. It's a hard-to-summarize paper because it is a family of models, first of all, that you're reporting on there, with multiple different base models. Interested to—you can maybe give a little insight into why some of them are based on Gemini 1 and others are based on 1.5. I think there's even a Flamingo one maybe still that is part of that mix.
If one were telling a naive story of, okay, here's AMIE, this thing can have a chat interaction and it can do diagnosis, what comes next now that there's Gemini and it's multimodal? You would think, geez, probably chat with video or chat with the ability to put in images. But I would have guessed that the next step would still be this sort of holistic patient-doctor interaction. And that's not really what this paper was about so much. It's much more a lot of different models very narrowly scoped and dialed into very particular tasks. A lot of different tasks, a lot of different state-of-the-art results, a lot of different modalities, whether it's pathology imagery, or scans of various types, or even some genetic information. And then with the extension, you've also got 2D and even 3D scans, so just a ton of different stuff.
I guess maybe why the Swiss Army knife approach there is one thing, and then we can get back into a little bit more of the uncertainty-guided search.
Vivek Natarajan: (34:30)
Yeah. Again, great questions. So I think a good way to think about Med-Gemini is it's the analog to the Med-PaLM paper, but with obviously a much more capable base model. And so if I were to summarize, okay, what is the key takeaway here? I think for me, it is the fact that now we have native multimodal understanding over millions of tokens and million-plus context windows. And that seems like a very big advance.
And so in some ways with Med-PaLM, the key thing that we relied on in order to capture the imagination of the public and show how much progress we've made was the USMLE analog. I think with Med-Gemini, maybe it was a little bit difficult for us to say, okay, this is this one thing that it enables. Because when you have multimodality over millions-plus context windows, a lot of new things now become possible. And so that's why maybe there's not just this one thing that you can say, oh, this is the one thing that's happening, but rather it's a huge array of possibilities that now suddenly becomes feasible and possible. And so maybe that's the takeaway for me. It's hard to distill and say, oh, yeah, this is only this one thing, but rather it's a platform. It enables a lot more new things.
And coming back to the specifics, why do we not have a generalist model anymore? I think it's also a little bit of an evolution of our own research theories and which way we want to build towards. And so I think this time last year, we had not yet demonstrated the idea of a generalist model in medicine. There were a few examples of that in robotics and then maybe in a few other domains, but that still felt like something that no one had demonstrated yet. And so it was interesting to do that, and we did that with Med-PaLM when we showed the capabilities.
But then when, over the course of last year, we were trying to think about, okay, how do we deploy this in the real world and enable people to make use of them, we realized that there's a lot of trade-offs that you need to make. Trade-offs in terms of the kind of data that is brought into the mix, what tasks you are optimizing for, what is the latency and throughput requirements, and that in turn ties back to the size of the models that you're serving. And so it felt like this generalist approach where you're trying to cram everything into one model, which in turn invariably means large-ish models in nature—sometimes the cost is not worth it for specific applications. And so what specific applications demand is trade-offs. And so you need to be able to provide a menu of options.
And so that is why we stepped back from this notion of building one generalist model to do everything, because it felt like from a research perspective that we showed that challenge is no longer out there, but rather it's possible to solve. But from an application and real-world deployment perspective, what is really needed is area specialist models developed with different constraints in mind so that we can easily deploy. And so that's the goal. And so some applications might demand not a lot of reasoning capabilities, not a lot of multimodal capabilities, but rather very low latency and high throughput. And for that, we have the smaller-sized models. And then some would demand advanced clinical reasoning, million-plus context windows, and so on. And so for them, we have many options as well that will cater to that. Hence, that's the reason.
Khaled Saab: (37:31)
Yeah. And I agree with everything that Vivek said. And I'll think about the Med-Gemini paper as our first exploration to seeing how well Gemini can do in the medical field and how should we start thinking about specializing Gemini to the medical field. And so in this first exploration, we definitely wanted to look into textual reasoning, like what Med-PaLM and Med-PaLM 2 did, and then also advancing the techniques over there, and that's where the web search and agentic framework came in.
But then also because Gemini is a multimodal-first language model, we wanted to look into how do we specialize it for things like medical images. But then what was really interesting is during this exploration, we had this breakthrough in Gemini of having the 1 million and now 2 million context length. And so we also wanted to start thinking about, okay, how can we start exploring how to leverage that for the medical domain? So it was, okay, Gemini has all these amazing things. We have all these ideas on how we should specialize it to the medical domain. First, what works best, right, and how well does it do? And so let's first report on a broad range of benchmarks. And that's what that paper mainly focused on.
And then we showcased some of the qualitative aspects, having multimodal conversations, which is also a very important part because just having one question and one answer is very helpful, but the ability to interact and have follow-up conversations is what really, I believe, has some clinical translation to the real world and utility. So we showcased some of those qualitative things. And then with the long context, we were looking at new tasks, like if you have multiple EHR logs and visits, some things that are extremely long that couldn't fit into the context window previously, how well would Med-Gemini do in processing that and being able to look up specific conditions or details in that history of the patient? And then also things like surgical video and having conversations with a surgical video, and then asking questions across 12 different genetic papers.
It was just, look at all these things that Gemini could do and look, these are the techniques that worked really well for specializing Gemini. But then again, you're also hitting at the point of why not an AMIE-based style paper? And that's definitely something that we're thinking very hard about. It's just that requires more rigorous evaluations with clinical experts and takes a bit longer. That's why the first paper was more benchmarks, qualitative examples, and we're working hard on the more rigorous evaluations with clinical experts.
Vivek Natarajan: (40:22)
Yeah, and maybe just to add on to that, I think we're at the stage right now where maybe our evaluation setups and benchmarks are no longer keeping pace with capabilities advancements. And doing that in a rigorous manner takes time. And so we still have to rely on imperfect measures like existing benchmarks, but that clearly don't capture the full set of capabilities. And sometimes you say, oh, look at this. On this benchmark, say on MMLU, you just got 0.6% improvements. But that's just one tiny fraction of the capabilities of the entire system. And how do we showcase that? So that's an open challenge for us as well. We've resorted to doing a mix of quantitative evaluation on benchmarks plus qualitative demonstrations. But clearly, I think we need better measures of progress than what we have right now, which is largely static benchmarks-driven at this point in time. And that's true, I think, overall for the field, but definitely for progress in the medical domain.
Nathan Labenz: (41:12)
Yeah. That makes sense. So I think one takeaway from that is we can stay tuned for future work that might start to consolidate this. This is the proliferation phase of look at all the many different things we tried, and we can expect a consolidation or maybe integration is probably a better word—a phase to follow that in the not-too-distant future.
Maybe that's what we should expect in general is, new fundamental or new level of scale or new foundation model or whatever kind of comes out. Now it's, hey, we've got to go broad and just try this thing on 50 different things, see in a very broad way, can we characterize it? And then we can bring all that into some form factor that would actually be intuitively usable for whether it's a doctor in the field or even a patient or whatever. And that does make a lot of sense. And I think also certainly the doctors will probably want to see that sort of detailed breakdown as prior work or qualifying work before they would be ready to trust such a system.
One thing that you had said that caught my ear a little bit was cost, and you had said for some things it might be too expensive. I am of the opinion, and I would even say this outside of medicine, that people are overly worried about the price of their AI products today. I hear this fairly often from people that are building general-purpose AI assistant-type products. And I'll say, aren't you using just the very best model available and just stuffing whatever context in there that you need to make it work? And sometimes they'll say, well, it's too expensive. The product would have to be $500 a month or whatever. And I say to that, give me something that works, and I'll pay the $500 a month. It's still probably often order-of-magnitude cheaper than what it's replacing, even if it were $500 a month or whatever. And on top of that, obviously, we have these dramatic price drop trends that show no signs of stopping. Gemini Pro 1.5 is $7 per million input tokens. So it seems like the cost is ultimately not going to be that much of an issue, but maybe I'm missing something. How are you thinking about the cost? What is the current concern, and where do you see that going?
Vivek Natarajan: (43:29)
Yeah. I think the trends are overall in the right direction, as you say. I think the cost per token is going down dramatically, and it's probably going to continue that way as we do more of this hardware-software integration across the entire stack. And I think, yeah, 1.5 Pro is great, but Flash is, I think, hitting the sweet spot in terms of capabilities versus cost trade-off. And we can go even further lower for on-device stuff like the Nano models. And so I think all that's great.
But maybe the one key thing is, at least at Google—and this is something I got used to over a period of time by being here—the tendency is to not just aim for the top 1% of the population, but rather think about at billion-user scale and global scale. And for a large part of the world, I mean, in addition to the high amount of utility that you get from different kinds of Google services, it is attractive that these services are, at the end of the day, free to access. And so there may not be—$500 per year would seem not so high for a human caller here. We're very fortunate to be in California, probably the best place in the world to be in, at least as far as AI goes. But for large parts of the world, in India and Africa, that's a no-go, simply. So how do we bring these technologies sustainably out there in the real world in a very affordable manner? And I think that requires us to go down even further below.
Because one thing that we can end up doing is we can build up technologies, but then the pricing is so terrible that it becomes a barrier for 99% of the people to access, and that's completely the non-goal. I think what we really want to do at the end of the day is to leverage the advancements in technologies to not amplify existing disparities in care, but rather really enable people at planetary scale to have access to the best possible health care possible. And that requires us to continue pushing the boundaries in terms of cost and access and things like that.
And again, I would just caveat that we are still very early. I think we'll make fundamental technological advancements a lot more before, say, overcoming challenges on the regulatory side to deploy these things. And also generally speaking, the societal acceptance of such technologies. But I think it's important to not accept the status quo. I think we can do a lot better, and we will do a lot better.
Khaled Saab: (45:43)
Yeah. And especially if we think about our text-based systems, I guess the cost would definitely change depending on if we're having conversation with text versus conversation with a video or audio, and that would drive the cost up. And just with text alone, I think as we've shown with AMIE, you can do so much. And I've been lucky to be able to interact with AMIE, and it's a really amazing experience with the way it follows up and asks you questions if you're having some kind of worry or condition you're not too certain about. I think that the comparison here to make is chatting with something like AMIE versus doing your own research or talking with a clinical expert. And that cost alone, with the text-based, is already dramatically less. Just excited to try and democratize it.
Nathan Labenz: (46:35)
Yeah, I certainly share the vision and the excitement for the broad accessibility of this technology. It literally is my first go-to answer whenever anybody's just broadly skeptical of what's going on in AI. Like, why should we—are we going to put all of ourselves out of work or whatever? And I'm like, hey, AI doctor. Let's start with that. A lot of people can't see one, have to take off work, drive a long way, make real sacrifices if they could do it at all. This could be a total game changer for a lot of people. So I certainly share that.
I'm just wondering about the path. I'll maybe ask it from the other end. The first question is, why not pursue the Tesla strategy? Because it does seem like there's something weird too with all these systems where self-driving cars are probably the number one example in my mind, but AI doctor probably not too far behind, where it's not enough for it to be similarly effective or even a bit better than the human-provided service—it has to be 10 times safer it seems like for self-driving cars in order for them to be accepted. And we might even be closing in on that or soon to be, but until it gets to the point where it's totally undeniable, people are just not quite ready to embrace the technology. The Tesla strategy is just make an expensive version first and use that to subsidize the development of bringing the cost down. Google has enough resources that maybe it doesn't even need to go that route.
But asking the similarly provocative question from the other end, I feel like a lot of people around the world literally can't do better than AMIE today. And so I wonder at what point does it become almost a societal or global obligation to actually deploy the technology even if it isn't 10 times better than the best doctor? I think we suffer a lot of times from weirdness in our comparison. This isn't necessarily as good as the care that I might get at Stanford Medical Center, therefore it's not fit for anyone, when in reality it's like, that is not scaling. This could scale. And I feel like a lot of people around the world would be very grateful to have it. I'm sure that's something you guys talk about internally. At what point does this hit a point where it almost becomes a moral requirement that we put it out into the world? Where do you think we are on that journey? Because at some point, we're getting there. I can't imagine the 2027 where you guys continue down this path and we don't have something that would be almost a moral imperative to deploy. How close do you think we are to that? And how do you think you approach it as you hit that threshold?
Khaled Saab: (49:21)
Before you answer, I wanted to share something. When I first joined Google last year, I remember we were in a summit and I asked this—I just raised my hand. I'm like, these LLM services are free. People are using them. Why can't we just have the same thing but for these AI doctor-type models? And I guess I learned a lot from that answer, but I was also thinking in the same space. But the medical space just has so many subtleties when it comes to FDA regulations and the way you need to evaluate things. But I just wanted to point out that I was also thinking that too. I just wish it was that simple.
Vivek Natarajan: (50:00)
Yeah, honestly, I'd say it's probably the single biggest question or dilemma that keeps me up at night these days. We are in a very privileged position that we can build out such powerful and capable technologies that can have such societal level impact. But the key thing is to do that in a safe and responsible manner as well. And I think one of the challenges is being able to communicate progress, but do that in a responsible manner. And so a lot of the progress that we've been communicating have been capability advancements. And we've tried to make sure that when we put out our papers or one of our products, we add the necessary disclaimers that capability does not mean reliability in the field. There needs to be more evidence that we need to accrue to show that actually these things are really better than the standard of care. And that requires us to do controlled studies by integrating them into real world clinical workflows and putting them as interventions in people's care journeys. And we are definitely doing that, and we hope to announce something very soon, the kind of work that we're doing in this space. But as a matter of fact, as things stand today, we don't have enough data to say that these things are meaningfully improving the standard of care. Yes, every now and then on Twitter, there's a post that goes viral saying, oh, I used ChatGPT to diagnose my condition, and this is much better than doctors. But how much can you rely on anecdotes? Will regulatory agencies be convinced on that? I don't think so. I think you need to be more rigorous. And there is a well defined process of doing that. And I think in some ways, the way you develop and bring forth these technologies into the real world is not too different from how we bring drugs into the market, for example. And there's a well established process where you first evaluate things in simulation, which is in vitro in the lab settings. And then if things look promising, you progressively step through different phases of clinical trials. And then you finally bring it into the market. And obviously, it's a very painful process. It takes years, costs a lot of money. But there's a reason for things being set up that way, and that is primarily to ensure the safety and well-being of the patient at the end of the day. So in many ways, we are embarking on a similar journey. And so with, for example, the AMIE papers, we showed it in simulation with patient actors, not real patients themselves, that this thing works. And so now the next step for us is to do the actual clinical trials kind of thing, put them into real world clinical workflows, put them as interventions into patient care journeys, and get the data readouts. And if that looks promising, the first thing to check is not even the efficacy of the system, but rather the safety of the system. And once we are confident about that, then we can gradually step up the kind of capabilities that we expose and take them through these different phases. And I hope it doesn't take us 10 years, but rather it takes us much shorter than that. But once we do that, I think then we will have enough rigorous data to say that, yes, this thing now can be safely deployed in the real world. Because the other thing that I would really personally hate is, I come from parts of the world where we don't generally have access to the best resources. And we are almost always further behind. But I don't think it's also ethically the right thing to do to take unproven technologies and untested technologies and dump them in those parts of the world. So I think it's very, very important to test the safety and efficacy of these systems. But it turns out the challenges to test the safety of these systems, you need oversight. And oversight necessarily means availability of doctor resources to immediately step in when something goes wrong. And so imagine if you're having an interaction with a patient, like AMIE having an interaction with a patient, and maybe it says something that's incorrect. There is a liability. And so the only way to ensure patient safety is a doctor overseeing that entire conversation and immediately stepping in with a phone call or something like that and saying, oh, this thing has gone wrong. This is not the right thing to do. Because otherwise, it can even have life and death consequences. And so that's a challenge. And so even testing the safety of the system requires resources, medical resources. And I think as things stand today, those kind of resources are available only in the Western world. And so that's why we are partnering up with well resourced healthcare organizations to be able to test the safety of the system first. And then once we have promising data readouts, then we can step up through the process. But I guess the key thing in healthcare overall is, I guess there are two things. One is there are clearly no shortcuts. I think if we end up taking shortcuts, then we are going to set the field back by several decades. So it's important to do things the right way. And then the second thing is everything in healthcare moves at the speed of trust. And so the more trust we can build up in the system with all these different stakeholders, whether that's the patient themselves or the doctors, and also folks in the regulatory agencies, I think the better it is. And it may be a little bit counterintuitive, but I think doing things the right way will accelerate and get us sooner to the future that I think we all envision, not by taking shortcuts.
Nathan Labenz: (54:37)
Yeah. It's funny. I feel like AI scrambles so many different debates, and even my own perspective I feel is weirdly scrambling. I feel alien to myself sometimes when I think about how for most of my life, I've thought about technology and regulation versus how I think about it now. And I do find myself in this weird place where I'm actually afraid of just full pedal to the metal AI scaling. Let's make the very most powerful systems that we can. And on the other hand, I'm like, but I do want my AI doctor sooner rather than later, even if it's only on par with the human equivalent or only slightly better. It does feel like waiting until it's that 10x better leaves a lot on the table, but that might just be the only way to achieve the trust that you talked about. And that is a great sound bite. Things move at the speed of trust in medicine is definitely an important insight for technology people to keep in mind.
Vivek Natarajan: (55:42)
Yeah, maybe one quick thing I'll also add is the other thing that's also not very obvious to me yet is societal acceptance of such technologies, this quote unquote AI doctor. It's very obvious that as we keep progressing capabilities, we'll have superhuman AI diagnosticians. But the key question to ask is, do people really want that? And to me, as things stand today and beyond the Twitter bubble that we live in, the answer to that is maybe not obvious to me because for most general people's interactions with AI systems, it's actually quite terrible. I was talking to my mom the other day, and she doesn't use ChatGPT that regularly. And the only places that she has encountered AI are every time she's trying to call up an airline or a travel agency, and there's a weird AI wall that prevents her from getting to the human and getting things done. And so for many people like that, the experience with AI, and a lot of it is AI from the previous generation and even things like Alexa or Google Home, they're not the best. And so the expectations that people have from AI are maybe quite low at this point in time, and the trust they have in AI systems is also quite low. And that's primarily because in San Francisco, in Silicon Valley, and maybe in the Twitter bubble that we live in, technology tends to diffuse very quickly. But then in the real world, to diffuse at scale, at billion user scale, it takes a long time. It takes decades. And I would not be surprised that it might only be like one-tenth of the population that has seen GPT-4 or Gemini yet, and we're still so far away from real diffusion and adoption of this technology at scale. And so that's the other key thing. For such a technology to be really useful and helpful, it requires that adoption at scale. And we are nowhere near that yet in terms of societal acceptance of such technologies.
Nathan Labenz: (57:29)
Yeah. That definitely is going to take time. I have similar conversations with my parents as well, not infrequently. And it is striking. A lot of times they'll be like, oh, my mom is a Gemini Advanced or Google One subscriber or whatever. And she said, oh, I tried asking Gemini to plan me a trip, and it didn't do very well. And I was like, oh, yeah. You probably, they've got something coming for that for one thing, but also you might want to go to a specialist thing. It needs to tie in APIs. And so there are a lot of things where even just to select the right tool in today's world is...
Khaled Saab: (58:00)
Yeah.
Nathan Labenz: (58:00)
Not easy. I just want to go quickly through a couple final things here and then get back to this question of where are we going, hyperscaling versus domain specialization? Because I think that might be the most important question for the big picture over the next couple of years. I guess just briefly on the uncertainty guided search because we had teased that up and hadn't closed the loop on it. That would seem to be, first of all, if I understand it correctly, it's basically just you run the generation multiple times, see if you're getting a consistent answer. I don't know if you're looking more deeply at perplexity scores or something like that, but basically saying, do we appear to be confident here in what we're saying, or do we have enough uncertainty that we need to go acquire more info? That would seem to be also a basis for a flag up to a human oversight or even just a disclosure to the user that this is off the happy path here. Any more that you would want to highlight on the uncertainty guided search?
Khaled Saab: (59:09)
Yeah. So I think two things that we realized we had to get right for this to work. Because if you just use web search where you just tell Gemini for each question, generate some search queries that will help you solve the question, and then you go and you fetch those search results and then add it to the input. Then in fact, sometimes you might confuse the model because you might be getting some irrelevant facts, or the more information you're giving the model, the more possibilities of something going wrong or it latching onto something that's incorrect. And in order to really get that right, there are two parts there. One was generating the search queries, and then two is how do you integrate the search results. And so in order to generate the search queries, we realized we had to do it in a way that was very specific to what the model was confused about, not just about the general question itself. So the way we did that was by telling Gemini to generate search queries by looking at conflicting responses. So as you were saying, we would generate multiple responses. And then when we had a lot of conflicts or disagreements among responses, we would get those conflicting responses, then from the conflicting responses, generate search queries. So the search queries are targeting what the model is specifically confused about. So that was the first part. And the second part with how to integrate the search results, that's where we saw the self training helping, where we, and the training part of that algorithm, we train with search results in the context so that MedGemini was used to seeing search results. And in the training, MedGemini would know what the right answer is, so it also learns how to just extract specific parts of the search results and not over rely on them. Those two, I think, were the key ingredients to getting it to work well.
Nathan Labenz: (1:01:11)
Yeah. Cool. Yeah. Those are good tips. That makes sense as to why it wouldn't just be a token level perplexity indicator, but an actual contrast of full responses. So that's interesting.
Vivek Natarajan: (1:01:24)
Yeah. Maybe one thing, I think it was an incredible amount of work, and I think Khaled and Thao, I think Khaled's wife, Kristen, has a video of the amount of effort that Khaled put in over a three-month period, videos of him coding up at gas stations and things like that. But yeah, in some ways, I think the work that we did there was maybe not the most optimal, I would say. Ideally, we wouldn't have to regenerate every single time, and we would rather have the model produce its own verbalized uncertainty estimates. And so we're moving towards that. And then the other thing is, again, it's the baby steps of an agent framework that is being put here. Search is one of the tools that the model will have in its arsenal, but it'll have a lot more, including the ability to talk to maybe more specialized systems. So that's where we are building towards. So there are hints and clues of what we are doing. But I guess for this specific paper, a lot of it was optimized towards doing well on the benchmarks. And so that's why we have that specific instantiation of the technique and the approach.
Nathan Labenz: (1:02:16)
Gotcha. Cool. I think this is something that feels to me to be much underappreciated when people talk about, are we going to run out of data? For one thing, it's obviously a big topic recently, and that tends to focus attention on high quality text data. But then I'm always like, man, there's a lot of other modalities out there, and a lot of that stuff is just massive deposits of data that have not been tapped into when you think of all the X-rays and the MRIs and the pathology tissue imagery and even just the chemical information itself. I guess the last two, one is an extension of MedGemini getting into even more modalities. I'll just read a quick quote from that one because I've got a couple notable things. Chest X-ray report generation across two separate datasets by an absolute margin of 1 and 12 percent where 57 percent and 96 percent of AI reports on normal cases and 43 and 65 percent on abnormal cases are evaluated as equivalent or better than the original radiologist's reports. So basically, with this one, with these 2D chest X-ray type things, we do seem to be hitting the point of basically clinical utility. Right? It's hard to read that another way. And then there was another quote that I pulled out. This is the first ever large multimodal model based report generation for 3D computed tomography, a.k.a. CT scans, with 53 percent of AI reports considered to be clinically acceptable. So those are pretty notable results. And I again just think, man, as we start to integrate these other modalities of data into potentially even the pretraining mix, if I understand correctly, all this stuff is done in kind of a post training phase such that the models have seen whatever they saw on the Internet, but they weren't really concentrating on CT scans, for example, in pretraining. You're bringing a relatively modest dataset to this, creating these distinct encoders, basically creating a framework to add all these different new modalities in after the bulk of training is done. Do you think that it continues to be that way, or do you think that where we're going is more like CT scans and all this stuff just get dumped into the pretraining mix and it all is just native at some point in the future? Natively multimodal means literally all the modalities in, say, a year or two?
Vivek Natarajan: (1:04:47)
Yeah. I think it's an open question for me. The part I would like to see is, again, more generalized encoders. So I wouldn't want to see hundreds of 2D encoders and hundreds of 3D encoders, but rather I would want any 2D modality to be processed by one single encoder, regardless of whether that's a natural image or a medical image. And same with 3D. Right? I would hope that we have encoders that are able to process videos and 3D medical images equally well. But as things stand today, that is not the case. You do get quite a bit of boost when you have specialized encoders. And I think that's where maybe also the question or the challenge of whether you throw that into the pretraining mix or whether you do adaptations or post training also comes in. So if we have generalized encoders that are shown to work well across different types of data modalities, then I can imagine a lot of this data starting to be mixed up with the pretraining mixtures. And then you don't have to maybe do a lot of the specialization or the adapters or the post training work that you do. But if we don't get there, then maybe the specialization route would hold on for a little bit longer. So I think that's the interesting question that needs to be answered in the next few months. So can you, for example, train a 3D encoder that does equally well on 3D medical imagery, like CT scans or OCTs or MRIs or whatever, and also video data, for example? And if you're able to start showing that, then you can reduce the need for specialized encoders, and that in turn makes a lot more things feasible at the pretraining space. Because yeah, again, if you imagine the different stages of model development, pretraining is the most general catch-all. And then you gradually have a narrowing of the funnel with more specialization. And so the more general purpose encoders that you have and the more you can show that those general purpose encoders can interpret a lot richer modalities of data, the more you can start throwing in interesting mixtures there. But we are not there yet right now.
Khaled Saab: (1:06:36)
Yeah. I agree. It would be nice to move to just general encoders that can really capture everything really well. But I would also argue that for something like the medical domain, it is definitely worthwhile to put in more resources into that specialization piece. And then even if we have these really broad and good encoders that work very well for the medical domain and we don't need the specialization, I think we would still need specialization for how we want the models to behave in a clinical context, so like the diagnostic conversation and how to follow up with the patient and things like that. So the behavior piece, I still see that as, even if we solve the encoder piece, we would need that specialization.
Nathan Labenz: (1:07:23)
Yeah, that makes a lot of sense. Okay, so the last paper on my list is TX-LLM. This is another interesting one where, again, essentially more modalities. This time, if I understand correctly, representing chemical structure basically just as text tokens in the way that any former pre-med student can recall the Cs and the Hs and the Os and the way that those are laid out in a string to represent a chemical. The interesting finding here is that you can weave those into training data alongside other text and lo and behold, this is obviously becoming a recurring theme, the model learns how to deal with that kind of stuff too and seems to be developing a sort of set of higher order concepts. This is the remarkable finding of interpretability that the models are representing these higher order human recognizable concepts, love, justice, fairness, unfairness, whatever, as a means to predicting the next token. It seems like here there's something similar happening, but what I find so fascinating about these other modalities being woven in this way is that probably in a lot of cases, we don't even know what those higher order concepts are. So I guess my expectation for this sort of work is that it's a remarkable first step that we can weave this stuff in and then we can start to get these guesses out. But I also see interpretability being turned on these models and doing this sort of sparse autoencoder type work to identify what are the internal concepts and then actually discovering new concepts about the world that the models have actually learned as a means to next token prediction that we didn't even have coming in. Is that how you understand what's going on here? Is that where you see things going as well?
Vivek Natarajan: (1:09:18)
Yeah, maybe I'll take a step back and explain the motivation of this work a little bit more. So I think if you look at scientific history and if you look at scientists like Alexander Fleming or Jonas Salk and others who discovered penicillin and the polio vaccine, they used to all be practicing clinicians. And so they would see patients during the day and then use those insights and come back at night to do the laboratory experiments. And so in many ways, obviously, our primary mission is to build out medical superintelligence that can democratize access to healthcare. But the process of doing that naturally feels like trying to encode the biomedical universe. And that means data from across the entire biological stack. And so starting all the way from subcellular molecular measurements, DNA sequencing, RNA sequencing, protein data, all the way up to medical imagery, EHR clinical data, and population health level data. Right? And so when you start encoding that at scale and train these models to learn useful representations, then these models can start doing interesting things. And the overall composite system that you have or you probably will end up having is like a hybrid AI physician-scientist. And so this model is not only going to help us fundamentally democratize access to healthcare, but it's perhaps also going to help us improve our understanding of human biology, help maybe design better therapies, and really help scale personalized healthcare to everyone. And so that's the broader longer-term vision, and you're seeing specific instantiations of this overall composite system through the work that we did on AMIE and also this one. And I think in this one, again, this is a little bit of an older work, and so a lot of the techniques that we used are from the MEDPALM era, if I may use that word. And so all the data is textual data in there. And the model was not necessarily optimized for conversations. It was just instruction tuning at the end of the day. But the interesting result is the fact that you can possibly have a single generalist model that you can use across the entire therapeutics drug discovery pipeline. And so that includes tasks at discovery slash target identification phase all the way up to clinical trials and things like that at the final stages of the pipeline. And so it was interesting. I think there are some things where we are still much further behind state-of-the-art specialist models, but then there are other things where we are already very good. And the interesting thing is this transfer learning that is happening between different therapeutic modalities and learning of interesting representations. I wouldn't say that we've done enough work in terms of uncovering what sort of unique insights that these models are learning. So there's a little bit more work to be done there, but I think the results are promising. And then I guess maybe one other paper that I would want to talk about is the work that we're doing a little bit on the biomedical discovery slash genetic discovery side of things. And so there, it's not the inherent representations that these models learn themselves that we are using for discovery, but more rather the generative outputs that these models produce. And so if you just generally think about LLMs and modern multimodal language models, by training on a lot of data and especially scientific literature and things like that, they are already encoding more knowledge than any human can, than any scientist can. And obviously, hallucination is a big challenge with these systems, but the flip side of hallucinations is creativity. And fundamentally, for advancing science and discovery, you need creativity. And so the idea there is can we tap into that ability and capability of these systems? And so in very preliminary work from late last year, we showed that you can use these models to identify causative genetic factors of different kinds of rare diseases. And so specifically, the work that we did in collaboration with some awesome collaborators at Stanford was that we used these models to come up with a hypothesis for a hearing loss phenotype in mice. And then the collaborators did CRISPR knockout experiments to validate the hypothesis. And so that itself, the data is very promising. And since then, we've used it to also look at human variants of unknown significance because, again, rare diseases and undiagnosed diseases are a huge challenge. And the more we can get at in terms of identifying causative genetic factors, the better we can do in terms of providing care to such people. And so to me, that's one of the most exciting bits about the kind of systems that we are developing. I mean, we focus quite a bit on the clinical aspects, like diagnosis and treatment management. But for me, the other exciting bit is in general the fundamental understanding of human biology, the fundamental understanding of causative mechanisms of diseases and being able to then design better therapies and things like that to really just scale personalized healthcare. Right? I think that's where we are stepping towards.
Khaled Saab: (1:13:54)
And this is why it's an honor to work with Vivek because he's an ambitious visionary in the field of medical AI, and there's no shortage of amazing things to work on.
Vivek Natarajan: (1:14:04)
You're too kind.
Nathan Labenz: (1:14:06)
Yeah, that definitely resonates with me. It seems like we're headed for, a singularity might be a little strong, but I am struck by just how on track we are for a lot of the late nineties, early 2000s, Kurzweil style predictions and also how timely they have been. We're not that far off from the curves that they drew in those books 20, 25 years ago now. It seems like in addition to the foundation models learning these new modalities, whether it's genetic sequences or these all these different scan type modalities or even these chemical notations, they're also going to be trained to use the specialist tools. So if I just try to project a little bit out into the future, it's not only that the core models get better and that they get this sort of more agentic post training finishing to be able to recover when they run into obstacles and come up with a new approach to try to accomplish their goal, which a lot of times these days, I feel like they are smart enough to do it, but they just don't quite have the pattern of behavior that I need them to have to get over humps on things. But then in addition to all that, it's like they're also going to have AlphaFold 4 to call on when they need to, and they'll presumably be trained to use all these. And there's obviously a version of that for material science, and there's a version of that for basically everything under the sun. So it seems like we very much are on track to the AI scientist. And I guess I both am very inspired by that and a little bit fearful of it just because I do think these things seem like they are on track to become more powerful than any of us individually certainly are. Maybe not necessarily to overwhelm the collective in the next couple of years, but definitely I would not bet on myself to go head to head with AI scientists and try to out-discover it over the next couple years. I guess, is that your expectation? And if so, how do you feel about it?
Vivek Natarajan: (1:16:14)
No, I think I share your sentiment. There's this repeated joke on Twitter right now where people say, "Oh, we were promised flying cars," and then we got 280 characters. It feels like we were promised flying cars and we got much better: GPT-4 and Gemini-style AI agents, AlphaFold models that can do protein structure prediction, self-driving cars, progress in AR, VR, Vision Pro, and things like that. And taking a step back, I would say that the technological progress that has happened over the last decade has been quite incredible, and it feels like multiple different technologies are converging, and we're going to be accelerating quite a bit. Again, at the end of the day, I think the key thing is, how do you wield such powerful systems and technologies? What are the principles that apply here? With any kind of technology, there's always dual-use capabilities. It was true 5,000 years back when man discovered fire. It was true when we discovered the steam engine, electricity, and nuclear energy. And I think it's the same as we now advance with AI capabilities. If you look at what has happened in history, obviously there are concerns with such powerful technologies, but we've ultimately, as a society, figured out a way to use these systems in a manner that optimally benefits a lot of people. And I would expect that to be no different here. Obviously, there's going to be a lot of discussions, a lot of churn, and a lot of societal-level questions that need to be answered before we get there with AI. But I'm hopeful in humanity for sure. I think we'll figure it out.
Khaled Saab: (1:17:47)
Yeah, and perhaps to get there, to that AI scientist vision and to make progress there, the one key blocker is building a simulation environment to test hypotheses. Because right now, as Vivek was describing, the language model gives us a hypothesis. We have to go to the wet lab and do testing, and that takes months. If we can improve simulators—biological or whatever domain—then we can have that iteration loop be much faster and try out many more hypotheses. I think that would be a key breakthrough there as well.
Vivek Natarajan: (1:18:26)
Yeah. Maybe one final quick thing: engineering the right amount of safety in these systems. Obviously, we have this AI physician-scientist vision, but the key thing that we've been stressing is always having the expert in the loop to be able to validate or control the kind of experiments that are being run. We are not trying to, for example, integrate the system with a robotic process automation lab environment because we currently don't have a very good handle on how to safely control this. So it's about doing things in a manner where you're very clear that development is happening with safety first and foremost as paramount. Obviously, we will figure things out. We'll get better at simulators. We'll get better at controlling these systems, and that will help us in turn accelerate progress further. But yeah, the key thing is having the expert in the loop right now.
Nathan Labenz: (1:19:13)
So the last big topic I wanted to bounce off you guys is my own position. I always say we need to be more focused on what is first before we can jump to what ought to be done about it. Almost everything I've done in this process of making this podcast has been trying to just get really clear on what is, why is it working, all that kind of stuff. But at this point, we've got legislation proposed, and there's definitely a shift toward "what should we do about it?" So I'm trying not to shirk my duty as a commenter and at least have some go-to answer. My answer right now, if people were to just say, "Hey, Nathan, what do you think we should do about AI broadly?" I use this phrase—I describe myself as an "adoption accelerationist, hyperscaling pauser." What I mean by that is basically that I'm really excited about the AI doctor and the AI coder and all these different applications, but I am worried about creating something that is genuinely superhuman. It does seem like that's quite plausible to arrive with a few more orders of magnitude of scale. Obviously, nobody knows—that could totally flatline and not work, or it could work faster than people expect. Maybe it's here—it seems like we are hearing increasingly 2027 expectations from people. When I say "pause," I do mean genuine pause in the sense that I don't think we need to stop forever. I do think that we are making incredible progress on interpretability and control, and that line of research is actually going much better than I had expected it to go 18 months ago. But it does feel like it's maybe still struggling to keep up with just the raw scaling drumbeat that's happening. What I wanted to ask you is: if that's my recommendation, is that a coherent position? And when I say "is that a coherent position," it means can I have my AI doctor without too much more scaling? I think right now we're in the sweet spot where GPT-4 and Gemini 1.5 are powerful enough to be really useful. Probably the next generation still is in the sweet spot where they're powerful enough to be really useful, but they're not so powerful that I have to worry about major accidents happening or just overall systems-level stuff getting out of control. But then the question becomes: can we actually get what we want from that level of power? You can answer this in any number of ways, but ways that I was thinking of putting the question very specifically would be: if Gemini 1.5 Pro or if we imagine a Gemini 1.5 Ultra or whatever was the best language model that you could have, do you think you could still achieve your vision of the AI doctor with all of the post-training optimizations, validations, and elbow grease that you're putting into it? Or do you think there is something fundamental that still is yet to be unlocked that needs the Gemini 2 or Gemini 3 to actually get there?
Khaled Saab: (1:22:13)
Yeah. Maybe I'll say a couple thoughts.
Vivek Natarajan: (1:22:17)
I think it—
Khaled Saab: (1:22:18)
Maybe the idea of superhuman intelligence—I don't know why we necessarily should be scared of that, because in a lot of ways, the tools we have now are already, in some ways, better than humans. A calculator can be better at doing calculations than a math PhD, right? Or a computer might be better at recalling certain things or having larger memory. And so our tools are already superhuman in a lot of axes. As long as those tools are fully controllable by us and we're able to use them to benefit humanity, I think that's a great thing. For me—maybe it's because I'm an AI researcher—but it's easy for me to put both hats on, where one hat is, "Wow, these AI systems can already do amazing things, and the results with AI are pretty mind-blowing." I think there's no argument that the pace of AI improvement is incredible. But at the same time, it's easy for me to switch that hat and go, "Wow, these AI systems still have so much to improve on." Hallucination is the first thing that comes to mind, and it is a serious issue. Even with AI doctors, I think that is an issue that keeps me up at night: how do we fix this hallucination issue? I think that we can do a lot with Gemini 1.5 to benefit humanity. If it did pause at that, we can still have a lot of benefit from building AI on top of Gemini 1.5. But to really solve issues that keep me up at night with hallucination, especially in those settings, I think it is needed to continue improving these models because they can fail on very basic things, as we see on viral posts on Twitter that show a very concerning lack of reasoning. At the same time, they do amazing things. So I think we do need to improve those concerning error modes. Of course, it's an open problem. I don't know if scaling is the solution, but I think we need to keep thinking hard in a research sense about how to close those gaps when it comes to those concerning reasoning errors.
Vivek Natarajan: (1:24:39)
Yeah, I'm very confident we will solve hallucinations. I think I tend to agree with Khaled here. If you day-to-day interact with these systems, you simultaneously feel like these are the smartest systems that you have ever seen and also perhaps the most stupid ones in some ways. And maybe to answer your question very directly: do I think we can build some sort of a system that can democratize access to care with where current LLMs are today? I think that is possible, and I take a lot of inspiration from how self-driving cars and Waymo has evolved. We don't have AGI for sure, depending on the definition, but we have self-driving cars that are functioning and are perhaps the most magical experience that AI has to offer. They have shown the way, and the key thing is it's not something that happened magically, but rather a lot of systematic development, investment in safety, simulation, and overall systems engineering. I think for us to get to a system that can democratize access to care, we will need the same kind of thing. The real challenges are things like hallucinations, but also dealing with the long tail that you see here. That ultimately comes down to the kind of data that you can bring to bear on these systems to learn and experience from. And also, how do you actually train these systems? Over there, I feel like we still don't have the best possible handle. We need to get better at things like process supervision to really improve the overall reasoning process towards solving a given task. Yeah, I would say if I were to have these base capabilities as we have today, we can definitely build it, but it'll take us a bit more time. If we have more capable systems, it will take us a little less time. That's one of the best parts about working at a place like Google where we can continuously integrate the advancements that are happening in base capabilities and build out on top of them. If at any point in time progress stalls, that's fine—we'll keep pushing the vision. But the fact that we're able to do this allows us to perhaps get to the overall vision and the place that we all want to go to sooner.
Nathan Labenz: (1:26:33)
I think that checks out to me. My intuition was that I think you guys can get there even if the base model wouldn't necessarily get better. Obviously, all these things suffer from major definition problems, including what is AGI and what counts as superintelligence and even just what's good enough to use. If we're defining a pause hypothetically as basically you can only train foundation models at the general FLOP scale that they've been trained or maybe one more order of magnitude beyond that, there is still a lot of low-hanging fruit in terms of all the different techniques that have proliferated over the last couple of years. I don't think any single system has integrated all of those techniques into one single model or one single system at this point. So that feels like we have probably a couple years' worth of work just to integrate all those different techniques and then a lot of work of the sort that you guys are doing to really dial in performance. It does feel like we're pretty much to the point where it can happen even without massive further scaling. Now I will also agree with you that it probably is undeniable that it would get there faster, and that's maybe where my "adoption accelerationist, hyperscaling pauser" position becomes at least partially incoherent. But I do love the work that you guys are doing because I think we are headed—I don't know if you feel this, but to me it feels like we are headed for—especially last week we had the situational awareness manuscript published and there's this sort of prophecy which might in some ways become a self-fulfilling prophecy of an AI arms race with China and this race to scale and the trillion-dollar data center and whatever. That feels to me like a recipe for an unstable world. I'm like, man, is there any way that we can avoid an AI arms race with China? I would really like to avoid an AI arms race with China. And I think that the work that you guys are doing is so important in that it demonstrates how much value there already is—in some cases, even from a spring 2022 model, but certainly from the latest models—how far those things can really take us and how much the practical utility depends on... not to say it doesn't depend at all on further scaling, but certainly even in the absence of further scaling or even with a model that's now two years old, with focused attention, with a commitment to really dialing in the performance, you can get so much value. I think that is something I want people to pay a lot more attention to: the fact that we already have a lot of power and we're still figuring out how to wield it effectively in these high-value domains. That's the big—obviously, I just want the AI doctor too, but in the sort of big picture of AI and the dynamics, are we going to race with China to some unknown superintelligence as fast as we possibly can and try to achieve decisive advantage over each other? Oh my god. I think it's really helpful to keep in mind that the AI doctor doesn't necessarily depend on that and that it could be with this level or maybe a little bit more or whatever, but also it's just dedicated work of people that are really committed to the problem that is going to make the difference—maybe even more and probably more so than the next level of scale. Because the next level of scale in all likelihood is going to be smarter, obviously, and can maybe do some more really crazy things, but you're still going to have that problem of, as you said, Khaled, earlier: behaviorally, you've got to dial it in. Reliability-wise, you've got to dial it in. A lot of the same problems are still going to be there even if the model is incrementally more capable just because of scale. So I love what you guys are doing. I really commend it. I think people should be much more aware of this line of work, and I appreciate all the effort and all the great results that you guys are delivering. That might be a good note to end on. Is there anything else that—and I appreciate all the time that you've given today too, not just today, but in your third appearance, Vivek—anything else you want to touch on before we call it for today?
Vivek Natarajan: (1:30:54)
No. I think I completely agree with what you said. I couldn't have summarized it any better, I think. Perhaps the only thing that would change if progress plateaus is what sort of moats exist for businesses that are built or the go-to-market and the commercialization strategies. But that doesn't distract from the fact that the kind of things that we envision are truly possible. It'll just maybe take a little bit longer or take a different path. So in that sense, I think that's very exciting to know that such things are no longer in the realm of science fiction. And to me, again, it's a real privilege to be able to go deep and talk on specific topics with you, and I think you are an awesome communicator of AI progress. This is perhaps, as I said, the best podcast. And the key thing for maybe a lot of people to know is—I think it comes down to the fact that we need to have rational optimism about these things. And I think you, perhaps more than anyone else, manage to strike a balance. And so that's the key. I think we need everyone. We can keep doing work, but if it's projected or hyped up in a different way, then I think that's also quite bad. I really hope you keep doing this.
Khaled Saab: (1:31:54)
Yeah, Nathan, I totally agree with what you had said earlier. For me, it feels like it's our responsibility to try and bring this amazing technology towards something like healthcare or an AI doctor and democratize access to a lot of people for good quality healthcare. And yeah, it's been a pleasure also listening to your podcast in the past. Actually, funny story: you might be one of the reasons why I am here at Google because I was listening to the MEDPALM talks that—you know, the podcast that Vivek was on. And I learned a lot about those papers through your line of questioning, and I think that allowed me to maybe impress Vivek a bit to get in.
Vivek Natarajan: (1:32:39)
I didn't know that, actually.
Khaled Saab: (1:32:41)
Yeah. So thank you for democratizing AI education and doing this and giving us this opportunity to talk more about our work.
Nathan Labenz: (1:32:52)
Cool. That is awesome to hear. I really appreciate it. Khaled Saab and Vivek Natarajan, thank you for being part of the Cognitive Revolution.
Vivek Natarajan: (1:32:59)
Thank you.
Nathan Labenz: (1:33:00)
It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.