Google’s Med-PaLM and Med-PaLM2 with Vivek Natarajan

Watch Episode Here

Video Description

Nathan sits down with Vivek Natarajan, research scientist at Google Health. Vivek leads the Google Brain moonshot behind Med-PaLM, Google’s flagship medical large language model, featured in The Economist, The Scientific American, CNBC, and Forbes. In this episode, they discuss the foundational models that Vivek and team built before Med-PaLM, the techniques used to develop Med-PaLM which will be of interest to anyone developing AI systems for high-stakes use cases, and the capabilities for Med-PaLM to equalize access to medical knowledge and care.

This episode is part of a series centered on talking to the people at the cutting edge of building AI-driven solutions in medicine.

PODCAST RECOMMENDATION:
Youtube: @UpstreamwithErikTorenberg
Audio: https://link.chtbl.com/Upstream

LINKS:
https://sites.research.google/med-palm/

FEEDBACK / COLLABORATE WITH NATHAN:
TCR@turpentine.co

TIMESTAMPS:
(00:00) Episode preview
(03:43) The story of how Med-PaLM came to be
(09:41) Building Med-PaLM’s infrastructure
(13:10) The US medical licensing exam as a measure of AI progress
(15:23) Sponsor: Omneky
(18:17) Practicality of benchmarking in real-world usage
(21:39) Overcoming the shortfalls of Flan-PaLM with Med-PaLM
(25:08) Choosing to use soft prompting over few shot prompting
(30:36) The process of training Flan-PaLM
(37:31) A curriculum approach to soft-prompting
(38:43) Layperson vs expert interactions with LLMs
(43:54) How did the Google team facilitate user exploration of the model’s capabilities?
(46:58) Shift in techniques from Med-PaLM to Med-PaLM2
(50:21) Using different prompting strategies with Med-PaLM2
(57:33) Is Med-PaLM 2 preferred over clinicians?
(01:02:28) Will there be a multimodal version of Med-PaLM?
(01:04:52) Breakthroughs required for AI to further advance human potential
(01:10:23) The Med-PaLM business plan
(01:12:08) Is there a vision for a consumer product?
(01:15:46) The pros and cons of pre-training a model
(01:19:45) Vivek’s favorite AI products
(01:21:01) Would Vivek get a Neuralink implant?
(01:23:08) AI hopes and fears

TWITTER:
@CogRev_Podcast
@vivnat (Vivek)
@labenz (Nathan)
@eriktorenberg (Erik)

Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

Music Credit: MusicLM

More show notes and reading material released in our Substack: https://cognitiverevolution.substack.com/

Full Transcript

Transcript

Vivek Natarajan: 0:00 It just feels like the opportunity of a lifetime impacting the world in a safe and beneficial manner. Leverage technology and AI to improve the health of millions of people and help people reach their true potential.

Nathan Labenz: 0:10 Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Eric Tornburg. Welcome back to the Cognitive Revolution. Today, we're closing our short series on AI in medicine with Vivek Natarajan, AI researcher at Google and one of the lead authors of the groundbreaking Med-PaLM and Med-PaLM 2 papers. In this conversation, we'll track the amazingly short timeline that the field has taken from just better than chance performance on medical licensing exam questions just 2 and a half years ago to expert level performance that Google is now beginning to commercialize today. We start with an overview of the foundation models that Vivek and team were building on, PaLM, Flan-PaLM, and PaLM 2, before diving deeply into the details of Med-PaLM and Med-PaLM 2. The techniques used to develop Med-PaLM are of high interest given their data efficiency and conceptual generality. And the pains that Vivek and team have taken to validate its outputs, which go beyond benchmarks and endeavor to truly understand the utility and shortcomings of specialist medical models, are instructive for anyone developing AI systems for high stakes use cases. While Vivek and his coauthors described Med-PaLM as inferior to clinicians as recently as last December, today, Google describes Med-PaLM 2 as outperforming human clinicians on 8 out of 9 dimensions evaluated. As Vivek says in our conversation, the trend is obvious and it's hard not to be excited about the potential. For Vivek, who grew up in rural India where access to health care often came at great cost, if it was available at all, the imperative to not just develop such systems, but to deploy them broadly so as to equalize access to medical expertise is deeply personal. Toward the end of the conversation, we turn to Google's business plans with the newly announced Med-PaLM API, and we imagine what might be in store as medical AIs become ever more capable and also multimodal. One note before we get started, last time, I invited anyone who might be interested in attempting to reproduce some recent LLM benchmarking results to reach out to me. One person already has, and we are beginning to collaborate on a project. With that success in mind, I'd love to invite anyone who'd be interested in reading drafts of my AI analysis megathreads to reach out as well. I have a number of drafts in progress and would love to get some feedback from interested readers. You can contact us at our new email, tcr@turpentine.co, or just DM me on Twitter where I am at Labenz. Now, I hope you enjoyed this conversation with Vivek Natarajan. Vivek Natarajan, welcome to the Cognitive Revolution.

Vivek Natarajan: 3:17 Great to be here, Nathan.

Nathan Labenz: 3:19 Congratulations on an amazing run of papers. You are one of the project leads, lead authors on the Med-PaLM series of papers, which has announced some incredible progress several times over the last year, most recently with last week's Google IO, where you guys even are starting to announce some steps toward commercialization. So I'm really excited to get into all this with you and dig into how it happened, the brief history, the techniques that you guys are using, the progress that you've made, and especially really want to give our audience a sense of the pains that you've taken and the ways in which you've sought to validate the performance of the Med-PaLM model. So maybe just for starters, can we go back to the beginning and tell the story of PaLM, where it came from, then on to Flan-PaLM and then to Med-PaLM. You've been at Google this whole time, and it's amazing that this has all happened in just a 2 year timeframe.

Vivek Natarajan: 4:22 I think you're spot on that the progress has been incredibly exciting, and especially the implication in settings like healthcare and medicine. I think that's quite profound and it's, I think generally for me personally, the most exciting time in history with all the progress that's happening over here and the possibilities I'm living in. So I think if we were to look at where we are today, I think many people have mentioned this, but I would say it all goes back to 2017 with the transformers, the emergence of a general purpose architecture that is highly optimized for the kind of hardware that we have today in GPUs and can guzzle up data at scale. That has enabled, I would say, whatever progress and breakthroughs that we have seen so far today. And since then we have seen various different LM architectures emerge, different styles, encoder, decoder, decoder only, but the key has been the transformer, the emergence of that architecture. And if we look at it, there hasn't been maybe much of a change in the transformer architecture itself. In many ways, people have kind of frozen that architecture and all the work is happening around it in terms of data and scale and compute and applications, but over 6 years and 6 years is a very long period of time in AI and deep learning. The underlying based architecture has kind of remained static. I think that's testament to that original paper and there are no amount of praises enough for the work that Ashish Vaswani and others did. So yeah, so we have the transformer building block and then I would say the next big thing or breakthrough was GPT-3, undoubtedly from OpenAI and showing that decoder only transformers trained on internet scale data using this very simple next word, next token prediction objective can do some amazing few shot learning, albeit in natural language processing. I think that was a very big step as well and PaLM kind of followed up on that. In many ways, the recipes are similar, the architectures are similar, and I think it's probably just specific details that changed, but in terms of the studies, I would say both of them are roughly very comparable. The results are kind of the same. It's just 2 different systems. And I guess besides transformers and decoder only large language models and the emergence of that, the third big thing was alignment and RLHF. So we've had language models before. I think people remember Microsoft Tay or the fact that it was released on Twitter and then a day or 2, went wild and crazy. But now we have GPT, ChatGPT, GPT-4 and many other language models out there. And while there have been maybe some incidents of models producing unexpected things, for the most part, I think the experience has been that of delight for most people, right? And that is all down to alignment with reinforcement learning with human feedback. The fact that we can control the outputs of this model and ensure that they are behaving in a way that is expected and that just invokes delight in people. So I think these 3 things are kind of what has led to where we are today with these large language models, like GPT-3, GPT-4 and PaLM and PaLM 2. And I would really say the work that we are doing in medicine with large language models generally is just building on the shoulder of these giants that's happened over the last 5 or 6 years.

Nathan Labenz: 7:43 PaLM in particular, and I just want to quickly layer these on because I do think the intellectual history of this, because it's so brief, it's worth kind of highlighting the key chapters, which you did a great job of there. But just to add in a couple myself as well. So the PaLM model, 540 billion parameters. I've always wanted to ask this question. It seemed like there's maybe parallel kind of research paths going on there where if I had to interpret from the outside, it seemed like Google sort of saw 175 and was like, all right, we're going to do that one better. And then at the same time, there was kind of the chinchilla line of research suggesting that maybe you didn't even need all those parameters, but they kind of both came out in a similar timeframe. Is that kind of what was going on in Google at the time?

Vivek Natarajan: 8:28 Yeah, that is my personal opinion because I wasn't involved in both of those studies, but I think you are accurate in that because we want to have these explorations, right? I mean, in terms of scale and what is optimal with respect to model size and data and compute and everything. And I think both those studies were helpful, useful data points. And that I think is influencing the next generation of these models that we are seeing that are, again, I think seemingly more powerful than anything yet. So, yeah, I think those explorations are both cool, and that's possible at a place like Google. You have so many talented researchers exploring all these avenues.

Nathan Labenz: 9:00 One highlight, by the way, from the PaLM paper. I've considered this to be one of the great sort of buried leads in publishing history. There's a quote, and I think it's on, I've tweeted about this, I think it's on page 44 or something, where it says, PaLM outperforms the average human on the big benchmark. And I was like, wow, that's quite a claim to have sort of midway down this paper, not at the expert human level, we'll get to that, but already above the average human level. I was watching this stuff very closely because I've been using OpenAI's products extensively and really understanding them. And of course, everybody has this question of is Google keeping up? Are they behind? How does their stuff compare? So I was looking for these little clues in the data and coming across some of these gems. When PaLM gets done, is it a matter internally of Google of, how does it work from there? There's been all these kind of spinoff projects, right? You've got first, maybe not even necessarily spin off, but kind of a continuation is Flan-PaLM where we get to the instruction tuning. And then you've got the Med and the Embodied for robotics. How much infrastructure do you guys get as other product teams? Do you get a model that's kind of ready to be served up or ready to be tinkered with in a way that I, as an OpenAI customer, get convenient access? Or are you wrangling your own kind of server infrastructure?

Vivek Natarajan: 10:33 It's more...

Nathan Labenz: 10:33 Here's the weights, good luck, you guys can do what you want to do with it.

Vivek Natarajan: 10:37 Yeah, I can't really talk too much about the internal policies over here. And I think it really is depending on which application, or product team, or whatever research that we're bringing over there, on top of these models. One thing I would say is that the infrastructure is also constantly evolving and ensuring that model serving and inference is optimized for taking care of, especially as we scale up these models on both sides. And they mean actually quite different things. So the software is evolving, the hardware is also evolving, but on longer time scales. So it's not a static thing that we take on and build, but rather we get these pretrained model weights and then these libraries, and then you figure out how to make best use of it. And then 3 months down the line, maybe you see something even better and then you figure out, okay, do I continue with this one that I've built on or do I transfer everything that I've done onto this new system that seems even more better and promising. And so that's kind of the question that we wrangle with all the time. So, but yeah, it's not something that I would say is static and definitely that is true for the models itself, but also the underlying infrastructure and everything that we use for training these models as well.

Nathan Labenz: 11:48 Yeah, interesting. Everything kind of advances on all fronts. I feel like whenever I say this to somebody who's involved with research in a very deep way, they say, well, no, that's not quite true, but it always kind of feels to me like everything is working. It just seems like people are putting out so many papers. There's not that much time left for things that didn't work in the calendar, it seems almost.

Vivek Natarajan: 12:15 I think sometimes I am surprised that the infrastructure, the way that it is put together, there are a lot of amazing engineers at Google and legendary engineers who have built this infrastructure about 10, 15, 20 years, but sometimes I'm just surprised that this works because it feels like sometimes it's just taped together and the system could crash and burn anytime, so that aspect is there, but I think the other advantage being internal to Google, then instead of being a customer that is interacting with these models through an API, it's just that you get to see the nuts and bolts. And you can then go and tinker with it on your own set of models. So you see the raw form. And so I think that helps quite a lot, especially in domains like medicine, where you need that specialization because of the nature of the domain and the data that you're using.

Nathan Labenz: 13:02 So perfect transition then to Flan-PaLM. So that's the instruction tuned one. If I understand correctly, no RLHF in that model, right? Just kind of example based instruction tuning. And yet, at the time, state of the art performance on the US MLE medical licensing exam. We've covered this a little bit over the last couple episodes, but could you maybe just give us a little bit of a sense for what that exam is like, who takes it, how they study for it, the kind of depth of knowledge that it requires, and maybe what a passing score looks like on an exam like that.

Vivek Natarajan: 13:41 So I believe this exam, if you're training to be a medical doctor and want to practice in the US, you need to pass this exam. And there are different steps or stages over here. And the kind of questions that we generally see is a pretty large vignette with some description, some, it could be symptoms, it could be patient metadata, it could be other necessary information, and then you will have to use that information with anything else that you know, and deduce and infer and reason and retrieve appropriate knowledge and then come up with a final answer that is required. That often has considerable ambiguity and sometimes to come with the right answer. You may have to do this process of elimination because you have these multiple choice answers. So, I would say a good chunk of questions are like that. Some of them are more direct where they basically test your knowledge retrieval, I would say simpler, and these questions typically tend to occur in step 1, although I may be wrong over here because I think that's the easier one. So I would say, yeah, these questions are quite challenging for humans because they test a lot of different aspects, especially knowledge retrieval and then reasoning using that. And at the point of time when we were building out these systems or evaluating these systems on these benchmark data sets, they seemed like a good challenge for AI as well, the kind of LM models and AI systems that we had at that point of time, given where the scores were on these systems and given the abilities of these models. But now when I reflect back on it, I do think that the kind of AI that we have today, the kind of intelligence that it is, for that, maybe this is not the best measure. And so we probably need to think about something else.

Ads: 15:24 Hey. We'll continue our interview in a moment after a word from our sponsors. I want to tell you about my new interview show, Upstream. Upstream is where I go deeper with some of the world's most interesting thinkers to map the constellation of ideas that matter. On the first season of Upstream, you'll hear from Marc Andreessen, David Sacks, Balaji, Ezra Klein, Joe Lonsdale, and more. Make sure to subscribe and check out the first episode with a16z's Marc Andreessen. The link is in the description.

Nathan Labenz: 15:24 Hey. We'll continue our interview in a moment after a word from our sponsors. I want to tell you about my new interview show, Upstream. Upstream is where I go deeper with some of the world's most interesting thinkers to map the constellation of ideas that matter. On the first season of Upstream, you'll hear from Marc Andreessen, David Sachs, Balaji, Ezra Klein, Joe Lonsdale, and more. Make sure to subscribe and check out the first episode with a16z's Marc Andreessen. The link is in the description.

Vivek Natarajan: 15:53 But regardless of that, I think this was a very useful measure of progress and showing that these models can first reach passing score and now reach performance close to the level of expert test takers. I think that's great. When I say passing score, it's typically around 60%. And when I say expert test takers, I think it's in the top quartile, 85%-ish score. And one thing I would want to clarify is we evaluate this on a dataset that has representative USMLE-style questions, but it's not actually an exam that we take from the board and then evaluate it on. So it's a representative dataset. I think you can roughly equate performance, but it's not the same as saying, well, this AI system has passed the USMLE. And so in our papers and all the media articles that have been written about it, we've been very careful in saying that this is not equal to passing the USMLE. Rather, representative questions is what the model is evaluated on.

Nathan Labenz: 16:47 Is that a step that you take to make sure you're not seeing memorization sort of artifacts? Or what's the why not just use the actual USMLE questions?

Vivek Natarajan: 17:01 Yeah, I think it could be useful. But what we are maybe more interested in is not passing the USMLE, but rather these models being actually useful in the real world. Make sure these applications are useful. Also, if you look at it, both in the original Med-PaLM paper and the Med-PaLM 2 paper that we have, we've decided to focus on grounded use cases, which is what sort of questions do people who have medical information needs ask, typically in the context of search engines. And we are trying to evaluate our systems and benchmark it against other LLMs as well as what, say, physicians would do. We want to be more use case grounded. And I think the USMLE, this is my personal opinion, it was good from a PR perspective to show that these AI systems can get over there. And several other systems have shown that. But that is not necessarily a good indication of real world utility, especially in medicine, the kind of challenges that we face.

Nathan Labenz: 17:59 Yeah, that makes sense. So it's essentially another example in the increasingly long list of examples of benchmarks. In this one, in this case, it's a human-focused benchmark as opposed to an AI-focused benchmark. But nevertheless, it's sort of, while indicative, falls short of a standard that is maybe more aligned or close to what is actually practically useful.

Vivek Natarajan: 18:24 And as you start doing these deeper evaluations or start building these models in settings that are more reflective of the use cases, you start to see these gaps. I think that was best reflected in the Med-PaLM paper where we showed that Flan-PaLM, the model that was instruction fine-tuned using Flan that did great on the USMLE set and other benchmarks at that time, significantly outperformed others. But then as soon as we started giving it consumer medical question answering and had that evaluated by expert physicians and also by non-expert lay people, we started seeing that there were several gaps on multiple different axes pertaining to factuality, bias, reasoning, recall of knowledge, and even along axes such as utility and helpfulness. And so it became very apparent to us that passing these benchmarks or getting state-of-the-art performance on these benchmarks, which are kind of narrow and limited, is not the same as real world utility and actual workflows. And I think the story again kind of repeats itself with Med-PaLM 2 as well, where we're building on top of PaLM 2, and that again, on a bunch of different benchmarks achieves state-of-the-art and has been used in many different applications already. And that's particularly impressive on multilingual tasks, code generation and other stuff. But then when we do this side-by-side comparison, both on the general purpose model and then Med-PaLM 2, which is fine-tuned specifically for the medical domain, we see significant improvements again on all these axes, and particularly on evidence of good medical reasoning or incorrect medical reasoning. We see almost a 9x improvement once you do this fine-tuning. So that shows the importance of specialization. And also that shows the limitations of benchmarks that we are using. General purpose systems, it's very hard to evaluate narrow systems where the use cases are well defined, intended users very well defined, it's much easier to evaluate. But general purpose systems, I think you need to go deep into the intended use and the actual workflow to get a handle.

Nathan Labenz: 20:31 Yeah, I think I would put that on a poster because that's definitely a big theme that keeps coming up in all these conversations that I'm having and just my own work as well. Validation of language models is really hard. GPT-4 is only preferred to 3.5 by a 2-to-1 margin. 70-30 is what they report in the technical report. And so often the sort of initial quantitative measure that you might get, if you just look into, just read through the chain of thought that it's spitting out, you kind of realize that your initial take on the entire situation was just kind of flawed. And I've seen that so many times. So you kind of alluded to it earlier, but I wonder if you could just give a little bit more kind of intuitive sense for, okay, you've got this Flan-PaLM, it's instruction tuned, it's basically hitting better than ever USMLE performance or USMLE-grade question performance, but you're finding that these things are falling short. If I'm a user of that, how is that falling short? And then from there, we can get into how you started to overcome that with the Med-PaLM project specifically.

Vivek Natarajan: 21:52 At least with Med-PaLM there, one critical assumption that we made was that given the PaLM model and the Flan-PaLM model were trained on Internet-scale data, we assumed that the knowledge required to kind of do well in the benchmarks that we were evaluating on, and these included these USMLE-style questions, but also the medical information, consumer medical questions and datasets as well. We assumed that that knowledge was already encoded in the weights of the model. And so the challenge was to be able to prompt or elicit the right response from the model given this domain, teach the model how to maybe reason more about it, think more critically, do better deduction and reasoning, like a clinician basically, if at all possible. But the assumption was the knowledge was there. And the second thing was that we could teach the model that sort of behavior, without investing too much data. So that is what led us to using the techniques that we ended up using, which I'm sure you want to talk about next, which was instruction prompt tuning. So the goal was to basically, okay, you know everything that you need to know over here. I'll tell you how to use that information to come up with the right answer and move around this space.

Nathan Labenz: 23:05 I would definitely encourage everybody to look at, I've tweeted some of these graphs and they're obviously in the papers, but I think you guys have done a really nice job with both kind of the validation scheme conceptually, and also some of the figures just really make extremely clear what is going on. And that's something I think to be celebrated, praised. I can still kind of picture the original Med-PaLM graph where it sort of shows side by side the rate at which clinicians and Flan-PaLM and Med-PaLM in comparison kind of do the right thing, so to speak, and then also do something wrong. And so there's multiple kinds of different right things, like retrieving the right information, demonstrating that you understand the question, and then there's ways to get it wrong too, like recall bad information or have something that's omitted that could have been important, right? So you've got kind of these multiple ways to be successful and ways to be problematic. And I think it's just such an important understanding for everyone that those are not mutually exclusive. So what I really like about that graph is that they add up to more than 100% and it's just so critical to kind of understand that these complicated responses, that's true for the doctors as well, right, as the AIs. Everybody can be both kind of wrong and right at the same time. So I think that's really just an important insight and very well captured in those graphs. So I guess I kind of, the Med-PaLM paper sort of answers my original question in terms of what kind of access do you guys have? Because you are taking this very data-light approach. And I'm very curious about how you understand the idea that maybe few-shot prompting couldn't work for this. It didn't seem like it was going to get you where you wanted to go. And so you ended up with this soft prompt technique that, I guess, first of all, I should just ask you to explain what soft prompting is, but it's amazing how little data it requires. And I'm really interested to understand that juxtaposition against few-shot prompting.

Vivek Natarajan: 25:22 Yeah. Before I go into that, I'll maybe quickly say a credit to a lot of the illustrations and figures that we have in the Med-PaLM paper. Sheikh Yeung, one of the coauthors on Med-PaLM, is brilliant at this. And then the evaluation rubric, it's just possible because we have this amazing interdisciplinary team at Google that allows us to think critically and holistically about this entire problem and actually deploy in those settings and workflows. And again, a lot of the credit goes to Doctor Alan Karthikesalingam, who was also reasonably famous on Twitter. You don't need to say anything more than that. Right. Yeah. And then going into what you're actually asking about, which is soft prompts. Yeah, so again, just repeating what I said before, the assumption was the knowledge was encoded. Knowledge that was needed was already encoded in the weights of the model. And so it's more about teaching the model how to elicit out that information in the right way and use it to answer the question and then more stylistic aspects around how to convey that information. And, for example, if you look at how clinicians respond to questions, they wouldn't say something objectively, they wouldn't say, oh, you 100% have this or 100% don't have this. Rather, there will always be a degree of uncertainty that has been communicated. You may have this, but you need to do this additional thing. Or you may have this, but I need these follow-up tests. Or we need this additional information. And so if you want a model to be used in such settings, where you're providing medical information, you also want the model to learn that behavior. And so that is, I think, somewhat orthogonal to having the knowledge. So you can be a very smart student. And you can learn all these books about medicine and have all the knowledge encoded in your brain. But then if you can't apply it in a way that's actually useful, then that's no use. I would say there's no point in studying all that information and everything. And so the goal with soft prompt tuning at a high level was kind of doing two things. One is, in this vast space of knowledge that is encoded in the weights of the model, showing the light to which section of knowledge that we need to actually use. So it's if you think about this parameter, the model knowledge space as a huge library, then this very specific section might be about medicine. And so basically shining a light on that and telling the model, this is the information that we need to use. And then secondly, it is more about the stylistic elements around how to convey information, how to convey uncertainty, and then how to ensure that your answer is more complete. And you know, there is less hubris in your answers, maybe. And so those are the stylistic elements that I think we were trying to add with soft prompt conditioning. And to do that, well, you need expert demonstrations. You want to be able to learn that from actual clinicians and learn from how they actually formulate these answers to these medical questions when asked by patients and non-expert lay users and so on. Yeah, we worked with our internal team. It was again, a pretty diverse team from a bunch of different countries, all expert trained, and then we collected responses from them, the styles in which they would have these responses to these questions. And then we use that to further fine-tune and align the model. But then the specific method of fine-tuning, as you said, was this instruction prompt tuning, where there's the soft prompt vector that we are basically learning through gradient descent. And the nice thing about that is, it does not necessarily mean anything as in these are not tokens corresponding to language words, or anything, but it's kind of in the same space. So it helps with the conditioning and anchoring of the subsequent words that's generated. And, yeah, because again, you're learning only an order of a million parameters. It's very fast. And so it's also very data efficient. So pretty much in a Colab, as long as you have the right number of chips to train, the amount of data that we used, the amount of expert demonstrations that we used, which was actually not that big, in the order of a few thousand examples, maybe even less than that, actually. So the ones that we did in the paper, it took a few hours. And so it enabled really fast iteration. One of the other things that I would maybe very quickly point out is, I think the ability to perform well with only prompt tuning, that's also an emergent property of scale. So we've seen that smaller models because they don't encode enough information in their weights, they're also not that effective with prompt tuning, whereas larger models just because they encode all that information. And what we actually need is guidance on how to use that information appropriately. With prompt tuning, they become even more effective. And so that's why we decided to use it. Small amount of data, we wanted to move fast. So not have, not be blocked on the compute. And then, yeah, that just made prompt tuning a very natural choice for us to iterate.

Nathan Labenz: 30:18 That's fascinating. And I think you explained it well, but I just want to dwell on it for a second more because I think it is such a profound trend that keeps popping up again. And there's a couple of things I'm not super clear on, and I'm not great with linear algebra theory. But is this, let me just try to describe the setup. You tell me if I get it wrong. So first of all, you've got Flan-PaLM. It's frozen. The whole block of it is not going to change during this exercise, but you're going to fine-tune instead just this very small auxiliary. Would you even call it a model? Is it ultimately a prefix or is it a matrix that sort of transforms the input? What is the nature of the way that the thing that you're learning interacts with the runtime input? Nathan Labenz: 30:18 That's fascinating. And I think you explained it well, but I just want to dwell on it for a second more because I think it is such a profound trend that it keeps popping up again. And there's a couple of things I'm not super clear on, and I'm not great with linear algebra theory. But is this a, let me just try to describe the setup. You tell me if I get it wrong. So first of all, you've got FlanPaLM. It's frozen. The whole block of it is not going to change during this exercise, but you're going to fine tune instead just this very small auxiliary. Would you even call it a model? Is it ultimately a prefix or is it a matrix that sort of transforms the input? What is the nature of the way that the thing that you're learning interacts with the runtime input?

Vivek Natarajan: 31:11 Yeah. It's more of a prefix to the hard prompt that comes up or the encoding of the hard prompt that comes up. And so essentially, it does as a test conditioning and anchoring, I would say, rather than, and that enables a few different kinds of transformations and we can do both. But what it essentially does is it doesn't impart any net new knowledge to the model, but rather it just helps with this conditioning and teaching the style of the domain. So that is all what we need over here. And so that is what we worked at over here. And so, yeah, I mean, debated this whether to call that a new model or not, but the fact that you do have 1 million additional parameters and that actually bigger than most models we were training even 3 or 4 years back. I mean, it's okay, it's fine. I mean, this model does have new parameters, but let's just get in the name. I think that helps. Yeah, I mean, I think this idea has been around for a long period of time. Adapter networks, even things like film, for example, in vision, and medical and visual question answering. This just the difference is where you're applying it and how we're applying it. But the idea of these adaptive weights and keeping a big frozen model and using that as conditioning has been around for a long time.

Nathan Labenz: 32:26 Yeah, you mentioned the visual question answering. We had the authors of Blip, Blip 2, and most recently Instruct Blip on to talk about that. And that was probably the first place where I really started to see how powerful and generally applicable this is likely to be. This is another great example. So the setup is you've got your full Flan, PaLM, Frozen model. Now you're running a small number of examples. Does the paper say, are we talking double digit number of examples? How few are we talking here?

Vivek Natarajan: 33:02 I think it's low triple digits. The scores that we reported in the paper, but we're actually just trained using low triple digits.

Nathan Labenz: 33:09 So that's insane in and of itself. I mean, that's a pretty small, I mean, it's also not insane. It is consistent with, it's insane and it's consistent with stuff that I've seen too. Just fine tuning as a retail customer of OpenAI. You don't often need, or often you don't need that many samples. I think they recommend 500 as a starting point. So it's pretty consistent with that general guidance. You're now setting up prompts and expert written grade A best you can find examples, right? And then just doing a pure still next token prediction loss function and just modifying this prefix instead of modifying the whole model, just this sort of prefix which exists in the language embedding space, if I understand correctly. And so in that sense, we can sort of think of it as it should be meaningful to us in some way, language is meaningful, although it's this sort of dark region of this space that we can't actually directly access with language.

Vivek Natarajan: 34:18 I think that is spot on. I mean, so you have these prefix tokens and representation that is in the language space. And then we just have that and then have the representation of the hard tokens. And together, so I would think about these soft tokens as more abstract conditioning, and then those hard tokens or instructions that you have as more hard constraints on the model as to what will generate next. And together, and one other way to think about this is I think these soft prompts, they give you a general idea of how to answer, whereas the instruction for that guidance specific question or that context gives you a more specific and detailed instruction. So it's a combination of a general instruction and a very specific instruction, and you use that together to come up with a final answer. I think you raise a very good point. And I think if I were to have a little bit more time, that's what I would be spending my time on, which is to see what is there even a human interpretable notion of what is actually being learned in soft prompt vectors? We haven't had much time to do that study, but I think if we can see, okay, are there some clusters emerging over here to these vectors that we can anchor with specific tokens in language. And if they are interpretable, I think that'll be amazing. Or if someone who's going to listen to this and they want to do this, please do as well.

Nathan Labenz: 35:36 This might be different from what we heard from the Blip Squad. It didn't sound like they found much there that was, I don't necessarily know how hard or exhaustively they looked either, because they've been busy publishing paper after paper too. But

Vivek Natarajan: 35:53 at least

Nathan Labenz: 35:53 the preliminary exploration that they did, it did not sound like they found anything where they were able to sort of say, in their case, it's image, right? So if a picture's worth 1000 words, they were not able to give me the thousand words that the picture sort of represented. So one thing I hadn't maybe understood there is, so there's the soft prompt that's the general, some sort of position in language space that we can't access directly through words. But things, that's followed by explicit instructions, answer this question in such and such a way. Okay, that's interesting. So it's not a substitute, it's an and. Okay. And then all that is ultimately interpreted by the model as language, right? There's no separation really between how that's been processed through the layers of the model.

Vivek Natarajan: 36:47 There is a lot of difference. And I think that is interesting. I think image domain is a little bit harder to do these kinds of explorations. I think in language, there's possibility that we'll have something interesting. Again, even if you don't end up finding that's also fine. I think we are barely scratching the surface in terms of understanding how these models work.

Nathan Labenz: 37:07 I wonder, you probably didn't have time to do this either, but I wonder if there could be a curriculum sort of approach to this where you might have multiple soft prompt prefixes that are bedside manner upgrade, relative to clinical detail upgrade, anything like that explored?

Vivek Natarajan: 37:29 I don't want to say it too much, but that would be there in some upcoming work. It's not directly using soft form factors because in Med-PaLM 2, we moved away from that. And primarily that comes down to having a more compute optimal model and having more data. But we are exploring that direction in a slightly different manner, which enables us to control the outputs of these models according to the different axes that we talked about over here, bedside manners, factuality, safety, and other stuff, but it's more dynamic and at runtime. We're definitely doing that because as you can imagine, you may want to have that control and different end users may want to have different outputs on one of them. So they may want to have a button or a knob that you can change and you give it slightly different experiences. You want to give that control to users.

Nathan Labenz: 38:16 Yeah. Exploring the latent space of text with these sort of abstract. Well, it sounds like you're doing it a little differently, but yeah, very interesting. Layperson evaluation. That was the other thing that I wanted to get into. I think there's everything that's old is new again. Language models in 2020, 2021 were sort of international contractors picking A versus B. And now we've shifted dramatically up market from what I understand with Scale. AI has dozens of PhD evaluator positions on their website last I checked. You also have the expert evaluator, but then you circle back to the layperson evaluation. So tell me what motivated that and what you learned from that.

Vivek Natarajan: 39:06 I think it's helpful to think about LLMs as platform technologies having a spectrum of use cases. And that means spectrum of end users, would be interacting with the model in very different ways. And even in medicine and life sciences, you may end up having not just doctors and clinicians, but also people who are maybe more on the administrative side, maybe more people who are medical researchers or life sciences researchers, and then definitely non expert layer users who are simply searching for medical information online. They maybe did play with a search engine, and they may want to do the same with large language model as well. So when you're doing these evaluations, and if you look at question answering itself, it's just a very broad topic. Anything pretty much can be framed as an instruction slash question and then answer, right. And so that basically subsumes any task. You want to try to cover as many evaluations as possible. And we thought one of the most important applications would be where you directly have these models interacting with end users for their medical information needs. And this is a little bit of my own personal opinion, but for me personally, and then for a lot of us in the Med-PaLM team, access to healthcare matters a lot. I personally grew up in parts of India where going to see a doctor for most people in the nearby towns and villages with me, walking 30 miles in extreme heat, or giving up on a day's wages or going without food. And so that was simply not an option for many people. And that would in turn mean many people would actually go their entire lifetimes without seeing a doctor. And that in turn meant adverse events occurring later over a lifetime and in turn translated to lower life expectancies, all sorts of things. But now with these technologies, and the arc of progress over here, in technology and AI in particular, we can now start imagining a pocket world class general practitioner that's scaled up to billions of people and internships, world class upkeep to everyone. So this for me personally has been a dream for a long period of time. My undergraduate thesis back in 2013 was an app called ask the doctor anytime anywhere. And that was using non deep learning technologies, did not work really well. But now I feel like we can realistically start thinking about really putting a world class experience into the pockets of millions of people worldwide for all the medical information needs. So that is the subtext of this evaluation, although I think there are other people, mostly my personal opinion, other people in that different take over here. But the fact remains that non expert users are going to be interacting with these systems. And one aspect that we have generally seen is the more you know, the more you generally tend to get out of these systems. And sometimes the less you know, the more harmful or unsafe these systems become. So we want to ensure that when people who are not experts are exposed to these systems, it doesn't act in a way that affects their well-being or whatever. So that is the underlying subtext of this evaluation. And so we wanted to directly understand from such people who are looking for medical information, whether they found it helpful, useful, directly addressing the intent of what they were getting at. The interaction was useful at all. And I would say we were barely scratching the surface over here. There's a lot more to be done. But I think pairing that sort of evaluation with some experts on axes around factuality, medical reasoning bias, I think that's going to help us get where we want to. But again, very early days over here, lot more studies needed with better evaluation.

Nathan Labenz: 42:51 I love the vision of that too. It's the simplest one I find to just come back to always when people are, why don't we just forget about this AI stuff or isn't it going to be more trouble than it's worth? I'm like, I think the people that don't have access to doctors are really going to want to have a word about that. And by no means, my regular listeners will know that I have my deeply held concerns about the general AI future, but that is such an incredible promise that I really want to keep my eye on the ball with that as well. So in this process, I can imagine you could run it different ways, right? You could run it in, for one thing, multiple languages. Your background, obviously, I'm sure multiple languages would be a critical component to actual deployment. And then I can also imagine asking laypeople to just evaluate outputs versus giving them some ability to interact with the system, phrase the questions in their own way, maybe even have multi turn interactions. Yeah. Tell me just a little bit more about how you approach that in terms of the languages and sort of how much the individuals actually got to explore.

Vivek Natarajan: 44:13 So I think the evaluation was informed by the capabilities of model itself. And if you look at it, PaLM wasn't necessarily optimized for multilingual applications, whereas PaLM 2 there's been a step change on that front, or else all the evaluation show that eventually. And then again, with respect to interactions and multi turn conversations, PaLM wasn't optimized for that. That's more the Lambda system. Yeah, Lambda actually does impressively well on those things, but we weren't building on that system format PaLM. So, yeah, I would say basically the capabilities of both systems and the desire to keep the evaluation simple and scalable while giving us enough data. That's what I think informed the process that we made. But I think as we are improving the capabilities of these systems to interact, engage in dialogue, answer questions in different languages and all that is coming very rapidly. And yeah, it's just that for the first version of the paper, we wanted to keep that simple and not have any future creep, which as you know, can take down projects. But now as we are opening up these models more broadly, or hopefully, to different kinds of collaborators in academic settings and medical research settings. We can do these studies as well and understand the capabilities and limitations of these systems.

Nathan Labenz: 45:46 So that's a perfect transition, I think, to PaLM 2 and Med-PaLM 2. And just as a quick refresher too, the timeline on all of this is just insane, right? PaLM, if I remember correctly, first paper was April, April 2022. And then FlanPaLM is maybe August, September 2022. You then follow up with MedPaLM in December 2022. And then now we're on PaLM 2 and MedPaLM 2 announced a month ago and paper just released on the archive server this week. So ever so slightly over a year from first announcement to this announcement, which just, everybody should ponder that for a minute. Definitely didn't take long. So you, if I understand correctly, there's a new base model is essentially the core shift. And then it sounds like you've also though shifted your methods. You said it was, it is, I don't have any insider information, but it certainly sounds like it's probably fewer parameters, more intensive training that obviously leads to more efficient inference. And then that opened up different techniques. So can you tell us about the different technique that you use to customize to the medical domain this time around? Nathan Labenz: 45:46 So that's a perfect transition, I think, to Palm 2 and Med-Palm 2. And just as a quick refresher too, the timeline on all of this is just insane, right? PALM, if I remember correctly, first paper was April 2022. And then FlannPaLM is maybe August, September 2022. You then follow-up with MedPaLM in December 2022. And then now we're on Palm 2 and MedPaLM 2 announced a month ago and kind of paper just released on the archive server this week. So ever so slightly over a year from kind of first announcement to this announcement, which just, everybody should ponder that for a minute. Definitely didn't take long. So if I understand correctly, there's a new base model is essentially the core shift. And then it sounds like you've also shifted your methods. You said it is, I don't have any insider information, but it certainly sounds like it's probably fewer parameters, more intensive training that obviously leads to more efficient inference. And then that opened up different techniques. So can you tell us about the different technique that you use to customize to the medical domain this time around?

Vivek Natarajan: 47:13 Yeah. Again, without too many details over here, the fact that this model was maybe more compute optimal and allowed us do entered and fine tuning more efficiently, or we just decided to go with that one. And coupled with the fact that we now have a little bit more data, not a lot more data. I mean, if you still look at it, just commit fun to using mostly public datasets. And then again, the expert demonstrations that numbers in a few hundred. So it's just the order of million tokens. Not that big, but just the fact that now you have a more efficient model that you can buy your best we decided to do fine tuning and update all the bits of the model. So I think that is basically key difference, switching over to a more powerful base LLM and then doing this end to end fine tuning because the model is all simply more efficient to train and fine tune.

Nathan Labenz: 48:08 And so the output of this is amazing. I mean, you've got now 85% accuracy on the same question set, right? And this is the USMLE questions, which is basically expert level. We don't have too many people that can outperform that, if I understand correctly.

Vivek Natarajan: 48:30 Yeah. I think that's likely in the top, well, doctors that take these tests. But again, I would say we should not anchor too much on this. I think by now, we should not be surprised that these models are solving these tasks or these questions as well as they are right now. One hypothesis that has been there in the community for awareness and that on Twitter, the office, whether these questions or these benchmarks are contaminated in the training data. So that was a concern for us because if that's the case, then these are definitely not a true measure of progress. So we explicitly spent time to see how much overlap there is. So it was not done in a hand wavy manner, which some other papers have done. And there we find, yeah, there is definitely some overlap, but it's a very small percentage, and for most data sets, it's less than 10%. And then the performance difference between question, performance on the data with overlap and without overlap is, I would generally say not statistically significant. So that was kind of reassuring in the sense that this is not simply a case or memorization, but rather there is something more powerful emerging release models. So that is great. That is definitely one measure of progress. But as I keep saying, that is not the end goal over here. The end goal is real world medical capability. And for that, we will do extensive validation in real world applications about this.

Nathan Labenz: 50:03 Well, we're not at the end of your validation effort just yet. But maybe before getting to that, do you have a sense for how much performance boost there is from say, for example, if we just had Palm 2, let's say we had Palm 2 and we just used our best prompt engineering versus a soft prompt approach versus a full end to end fine tuning approach, how much difference does that make? Can you quantify that?

Vivek Natarajan: 50:41 Yeah. I think this has been on the subject of quite a bit of debate, whether you can take a model, larger LM such as GPT-4 or on to other books, and then use them with simple prompting strategies or do you need some specialization? And I would think anytime you do get into certain update parameters, that is specialization and so that I usually use palm tuning or palm tuning. Are both in the same category for me personally. So that has been some sort of a recurring debate, both internally, but also with some folks that are really now in respect at OpenAI and Microsoft. And our dig and my dig over here has been that you would benefit a lot from pangenic. And that again comes across not on these benchmarks, because as you can see, GPT per well, that simple prompting performs a data on 2 performs a data. But then when you start doing these real world evaluations, and this multifactorial evaluation that you granularly check factuality, possibility of bias and harm evidence of recall, reasoning and all those aspects over there, that is when using the gaps. And so even if you have a very general purpose model, that seemingly has all the knowledge in the world, it may not necessarily know how to act in settings, in the unknown application scenarios and so on and so forth. This effort came across as a brain moonshot program. So moonshots are basically a program where it's a more bottom up thing, where a group of researchers across the company across alphabet, kind of think all these things should exist in the world. And then they come together to build it out. And so that's why if you see our the team is an interdisciplinary team spanning, have their brain deep mind and other research organizations at alphabet. And the pieces we had that these models such as onto another vision models, foundation models, they are very strong building blocks. They have general reasoning capabilities, and they have a lot of knowledge encoded in them just because of the state of data that they are trained on. But now the next step that is needed before they are really applied in medicine and clinical settings is sending them to medical school. And that means not just training them on specialized medical data and conferences, but also exposing them to real world interactions in settings and allowing them to learn from feedback. And so that was the tagline of this moonshot sending foundation models to medical school. And so that is where that is how Med-PaLM and Med-PaLM 2 and other models that we also have, has emerged. And so the core idea of the core thesis over there is you can have a general purpose model, such as farm to wall GPT-4 that has very strong intelligence, similar to a human, right? I mean, we are generally intelligent. We can pretty much learn anything that we want to. But without that specialization or years of training, you don't actually get good at it. And medicine is a very specialized endeavor. And so you need to spend time learning about the domain, adequate the safety, the nuances, and how to interact with fellow doctors and with patients and other people in this setting. So that requires specialization. And so fine tuning is one sort of specialization. And that's how you should think about it. But then there are other ways to do the specialization. RHF could be one way, but then there are others that we're doing. So that is the core thesis that we have. You can have a very strong general purpose system, but that's not enough. You need this expert specialization, especially in this training.

Nathan Labenz: 54:15 So again, the results are pretty arresting. And I've just got the graph up here again from the most recent paper, another outstanding graph, I would say. Two parts. One just shows again, the up into the right curve of going back to just two and a half years ago in December 2020, when GPT Neo was the best on this US MLE style question with 33.3%. And then it's just up into the right until the point where Med-PaLM 2 is at 86.5% expert level. But then on the other side of this graph, it's a comparison between really showing the share, right? You've got nine evaluation dimensions. And for each of these evaluation dimensions, it shows how often is Med-PaLM 2 preferred, how often is the human physician response preferred, and then how often is it evaluated as a tie? And you've got Med-PaLM preferred to the physician in eight of nine dimensions. And just eyeballing the thing, to me, it really looks like seven of those nine dimensions are pretty clear cut for Med-PaLM 2. I should say Med-PaLM 2. It's not that close. So those dimensions are better reflex consensus, better reading comprehension, better knowledge recall, better reasoning. All four of those, we're talking an average of 70% of the time Med-PaLM 2 is preferred. That's a way bigger, and then 10 to 20% of the time is a tie and only 10% of the time you have the physician preferred. That's a way bigger ratio just for comparison in terms of preference than GPT-4 has to GPT-3.5. So it's a substantial difference. I'm always shocked by how close those ratios are. GPT-4 right now versus Claude V1.3 is 6 to 4. It's even a little bit closer. So in general, we're seeing not huge ratios. This ratio of 7 to 1 with 2 going to tie across all those core dimensions is a big deal. You don't over anchor on it, but I'm starting to anchor on it when I see these kinds of graphs. And then the other five, completeness, the one that the physicians are preferred on is having, how often do they have inaccurate or irrelevant information? Med-PaLM 2 is judged to commit that mistake a bit more often than the doctors. But then omits information, Med-PaLM 2 dramatically preferred. Extent of possible harm, again, Med-PaLM 2 dramatically preferred. Likelihood of harm, again, Med-PaLM 2 dramatically preferred and then finally a tie basically on evidence of demographic bias. So that is a big deal. I guess to bottom line all of that, in the original Med-PaLM paper, there's this line that says Med-PaLM still remains inferior to clinicians. And as I look at this chart, I'm, is that statement still true? Or do we now have to kind of reckon with the fact that this appears to not really be inferior to clinicians, at least on this question answering domain?

Vivek Natarajan: 57:41 I think that qualifier that you added at the end, at least in this question answering domain in this data set, the setup, that is what we observed. But I would still think that generally, despite all the progress, these models are not set ready to be used autonomously. You should still have expert divisions and events. And I think there are limitations of this physician response generation as well. Because if you think about how physicians produce answers, right? They are biased towards delivering information in a succinct manner because they generally just constrained for time. So in that sense, I think that bias might be creeping in. So in additional studies, what we want to do is be more deliberate. So in this one, we were deliberate, and we did ask our physicians to use whatever sources of information that they thought was necessary. But even then, think this natural bias might be their own. So we're trying to remove that confounder and the additional evaluations that we hope to do soon where we want physicians to be the best possible version and also explain the situation as to how their answers might be used or how the model on stage might be used. So while I mean, these results are great, think there are these limitations. And so I wanted to call it out. And I think that requires further validation. But to me, what this shows is even though these systems are maybe not yet ready for use without any sort of supervision by expert humans in the loop, they could actually be already very valuable in terms of augmenting our clinicians and doctors. So you can imagine these E concert scenarios where doctors produce the gist of the response to a patient query. And then these models kind of come in and complete it and make it more they use a friendly, extreme terms that may be difficult to pass. And so those kinds of things, for example, I think these models might already be ready. And that could again have a pretty big impact. And more broadly, the way to think about these things is cases where it's easy to verify the solution. I think by an expert human, I think that is where these models will, I think, immediately shine on. And as long as they verify that verification doesn't take most time. So if the model produces something, instead of you as a doctor having to write out that entire thing, you kind of just edit and correct the model that is necessary. And hopefully that brings down the time for you to generate the documentation from 10 minutes to 2 minutes or 1 minute or so I think those are the scenarios that we are immediately looking for. But as you can, I mean, as you're saying, I mean, it's hard not to be excited because the trends are kind of obvious in terms of being able to do more autonomously, but I would still think that two things I think for the foreseeable future in parts where there's maybe not really a shortage of doctors, I think these systems are going to be a co doctor, which is going to maybe listen into conversations or look at the information they got out of the patient more holistically, maybe interpret information such as genomics, for example, which most doctors don't understand really yet, and then surface information. And then the doctor ultimately decides how to use that information to help the patient at hand. So I still think that is likely going to be the scenario for the foreseeable future. But then, as I alluded to before, there are parts of the world where billions of people have medical information needs. And right now, the standard of care is nothing. And so we can instantly do something pretty profound over there. But I think that again, requires responsible innovation validation before we get over there. Don't think we can just go out there and just deploy these systems. Think there's still a lot more work to do.

Nathan Labenz: 1:01:46 I definitely want to come back to the commercial path that you're really now beginning on. And it's amazing, again, just how fast this has all happened. But maybe as a bridge to that too, my qualifier of, in this question answering domain, the sort of obvious next step that's maybe the buzziest trend in AI right now would then be multimodality. And so, what is Palm 2 kind of promise for us there? And do we think that we're headed for Med-Palm 2 plus or whatever that can sort of ingest scan data or look at images of wounds. It seems like this has got to be the next frontier, right? Nathan Labenz: 1:01:46 I definitely want to come back to the commercial path that you're really now beginning on. And it's amazing, again, just how fast this has all happened. But maybe as a bridge to that too, my qualifier of in this question answering domain, the sort of obvious next step that's maybe the buzziest trend in AI right now would then be multimodality. And so, what is PaLM 2 kind of promise for us there? And do we think that we're headed for Med-PaLM 2 plus or whatever that can sort of ingest scan data or look at images of wounds. It seems this has got to be the next frontier, right?

Vivek Natarajan: 1:02:33 Yeah. Again, I don't want to give too much over here, but that's kind of obvious. Medicine as an endeavor is inherently multimodal. All the data that we're dealing with, it's not just language and text format, lab records, EHR, scans and images, genomics data. And I think one of the most interesting trends that maybe people don't appreciate enough is biology and medicine is kind of the largest data generating flywheel on Earth today. And that is higher than pretty much any other domain. And I think the implications of that are quite profound, because that means the most obvious way to make use of that data is through AI, because no human is going to be able to. I think we kind of gave up on that dream 10, 15 years back. So it's not giving up. But we need AI to make sense of all the data for us. And I think it's going to enable two things. One is as you start integrating this data at scale. And all this is going to be multimodal data. I think a lot of different hypotheses are going to emerge in terms of our understanding of human diseases or disease mechanisms, or how to do biomarkers and diagnosis and what sort of therapeutic interventions to apply. So there's going to be fundamental biomedical discovery that is going to enable as we start doing this next year. And then the second thing is we're going to leverage that with these systems to really scale up precision medicine to billions worldwide. That just seems like an inevitability at this point in time. So I do believe that we are at this very early stage of an exponential curve over here in bio life sciences and medicine. And the end goal over here is just precision medicine at scale. And beyond that is just simply advancing human potential and the kind of things that you gain. And that's probably all going to happen within 15, 20 years. Although the next few years, it's going to be interesting how this plays out.

Nathan Labenz: 1:04:34 Yeah. 15, 20 years seems like a long time given the December 2020 to present graph in the current paper. Do you think there are any fundamental breakthroughs required to achieve this vision? When I scan the landscape, I'm sort of all this soft prompting stuff, all this multimodal injection, going back to the Flamingo paper and a million other things, BLIP2, Meta's recent ImageBind release with five, seven different modalities just in the last couple of weeks. It doesn't seem to me we're really missing any key pieces to get to the point where the systems really should work. That's not to diminish the amount of actual field testing, rough spot identification and sanding down. My company, Waymark, we make video scripts for small business commercials, and we still find plenty of rough spots to sand down. So I appreciate the fact that you're dealing with a 1000x bigger and maybe 10000x bigger and 1000x more critical domain than we are. And there's probably correspondingly million times as many rough spots. But when we find those rough spots, we sand them down and then that pretty much does work. We patch training data, we sort of do a variety of different things, tweak our instructions, and we basically can get over most of the problems we have. Do you think that we're basically there subject to sort of refinement, engineering validation, field testing and deployment? Or do you see conceptual things that feel like they're not yet there?

Vivek Natarajan: 1:06:26 I think it depends on what you exactly want on those systems. If you want human style intelligence, I don't think we understand human intelligence well enough. And so these models are not going to be that. But I think these models are fundamentally a very different kind of intelligence. And that means the kind of things that they do is also different from what we do. It is their ability to actually deal with information at scale. And if you're thinking about that, so I think other people have just mentioned this, right? I think once you, people have said this, these models are simultaneously smarter than us and dumber than us. And I think that just is because this is a different kind of intelligence. And so if you're okay with that, and thinking about how to use that intelligence to augment us in our pursuits and endeavors, whether that's in medicine, whether that's in science, then I think most of the components do already exist. I would still say there are some technical challenges for sure that as we train these models across modalities, we see, for example, even combining vision and language. I do get a sense that our language models have become very, very powerful, that is kind of obvious with GPT-4. But the vision models, although there's been some really incredible breakthroughs with the Meta segment, anything model and a bunch of other stuff on the way. I maybe get a sense that despite scaling, both the model and the data, we are maybe not there yet, it's not that powerful. And that becomes more apparent in medical settings when we are dealing with these scans, where to interpret them accurately, it's like finding a needle in a haystack. So your representations have to be kind of really spot on. And then again, the volume of information that you are dealing with is also much bigger. And that has implications on the size of the model as well. And then I'm just talking about vision and language right now. Now you think about EHR, which is fundamentally a completely different modality. And if you want to model that separately and well, again, you have to deal with the architectural choices, the encoding choices that you make over there. Same with genomics, it's very different beast. So there's a combinatorial explosion that happens when you start modeling, and try to think about optimal encodings for these different modalities. And I would say, while we kind of have a good space of solutions to explore, we haven't arrived at the right design choices yet. And I think that is going to take quite a bit of exploration in the next year, year, two years to get over there. And I think the second thing is also generally around this architecture, there's, I think, this question of whether do you want to throw everything at this one single model that can process pretty much anything? Or do you want to maybe compress that model down and have a lot of these specialist models that you can fork off to form a mixture of experts style maybe, and train these systems? And I think there've been illustrations of that, Hugging Face has something. So we still don't know the trade offs over there. I don't think anyone has done a comparative study. So I think those two places for me, it still seems right for exploration and research. But as you said, I feel if we accept that we are not looking for human style intelligence, but something more complementary that is going to make sense of all this data at scale for us, then I do think most of the building blocks are there for us. And we will find out pretty soon if it's enough or not. But I do think that once we get this right, it's going to be very useful.

Nathan Labenz: 1:10:05 So then let's talk a little bit about kind of the business plan. With this last Google IO, there's now the notion of Med-PaLM API. And I understand that this is available on a limited basis right now to kind of trusted testers and researchers in the academic community. Can you tell us what it is? Are we now in a multilingual chat style modality that these folks can start to explore?

Vivek Natarajan: 1:10:34 No. I think the model is still optimized for mostly single term use cases and settings. So we've done development of Med-PaLM with certain applications in mind, but as I said before, this is a platform technology. And I think once you give access to people, they're going to find creative use cases of it. And so that is kind of the goal. But then again, it's important to do this responsibly because we know that these systems can be useful or not useful depending on the end user. So you want to do this where, in a setting where this feels safe. So that is why we're having this trusted tester program on Google Cloud where we are exposing it to a bunch of people with a spectrum of use cases in the medicine and life sciences domain, and hoping to gather more feedback. And we'll use that feedback to iterate and improve. And hopefully it gets to a stage where we feel it's safe enough to more broadly expose the system. I think the value proposition is very clear and the opportunity is very clear. If things go well, all that happens sooner rather than later.

Nathan Labenz: 1:11:50 Yeah. I guess is there a vision of a consumer product? I mean, sounds pretty straightforward to imagine, not straightforward, but I imagine there will be no shortage of people who need to process medical data for medical systems or for insurance companies or, all that kind of stuff is going to be probably less controversial, easy to attach revenue to, and just pure everybody's going to win. Or it seems things might get a little bit more controversial at a minimum would be an actual consumer facing application. And then, a theme we've heard a couple of times, I don't know if it's a theme, but certainly a theme of speculation anyway, is that there may be some sort of leapfrog type dynamics because maybe you can't or you don't want to deploy something like this direct to consumer in the US, but maybe for very good reasons. You might want to do it in your hometown in India where it's kind of this or nothing. So how would you kind of sketch that path forward for us?

Vivek Natarajan: 1:12:57 Yeah. I really don't know if I have all the answers over here, to be honest. I think we all see the potential for this technology. And I think we all agree that there's a spectrum of use cases, and some of them are obvious where we can deploy into workflows and settings where there's not much risk involved. And the upside is very clear and obvious and that's going to have a lot of impact. So these workflows, settings, documentation, generation and everything. I think that's going to happen very, very fast. The middle ground is mostly going to be scenarios where you do actually have an expert in the loop, or that you can fall back into. But even that scenario, I think would need verification, validation, evaluation studies that are sufficiently well powered before we can start doing that. But again, that is I think going to happen fairly rapidly. There's no shortage of interest from folks in healthcare and medicine to do these clinical studies. Yeah, the last application that you mentioned, direct to consumer, is yeah, I think that is the one that I am maybe less clear on, honestly, it could go anywhere over here. And there's again, my personal thing, but I'm pretty sure people are already using GPT-4 for medical information needs. And, so the genie is kind of already out of the bottle. And so it's unclear to me whether there's going to be a clamp down over it or whether there's going to be some other form of regulation that comes in. I think we're going to find out very soon. There's a lot of discussions happening. The value prop is clear. I just don't know who does it and when and how. That is a big question.

Nathan Labenz: 1:14:35 Well, Google is quickly coming out of its shell in this space and certainly seems like you guys are going to make a big impact along with a couple of others, obviously, who are helping lead the way. But the quality of this work is obviously extremely high. Kind of random question I want to sneak in before the end, and then I've got just kind of a couple closers for you. And I appreciate all the time. This has been extremely illuminating and fun for me. But there's been some debate a little bit lately around narrow, more trusted data sets versus the sort of super broad data sets. I just saw a company launch in the last few days that says that they train only on the kind of trusted data that they're able to license or partner for and kind of, by implication, seem to suggest that pre training on the internet, you learn all this crap, some of it's wrong, that could be a problem. That kind of rings plausibly true to me. But then I also wonder, and I was talking to our last guest on the show, Neil Koslow, about this, it seems there is also some sort of inherent breadth to medicine or just deeply contextual nature. So examples that he gave were if a patient says they ate cereal, that probably also means they consumed milk. And if you don't have that kind of general common sense, you may struggle. Or, very Silicon Valley take, but he was like, If the patient says they went to Burning Man, they probably inhaled a bunch of dust. And you might want to know that as you're trying to engage with them. So do you have a point of view on that? This general pre training does seem like it maybe have pros and cons. What do you think? Nathan Labenz: 1:14:35 Well, Google is quickly coming out of its shell in this space and certainly seems like you guys are going to make a big impact along with a couple of others, obviously, who are helping lead the way. But the quality of this work is obviously extremely high. Kind of random question I want to sneak in before the end, and then I've got just kind of a couple closers for you. And I appreciate all the time. This has been extremely illuminating and fun for me. But there's been some debate a little bit lately around narrow, more trusted data sets versus the sort of super broad data sets. I just saw a company launch in the last few days that says that they train only on the kind of trusted data that they're able to license their partner for and kind of, by implication, seem to suggest that pre-training on the internet, you learn all this crap, some of it's wrong, that could be a problem. That kind of rings plausibly true to me. But then I also wonder, and I was talking to our last guest on the show, Neil Koslow, about this, it seems like there is also some sort of inherent breadth to medicine or just deeply contextual nature. So examples that he gave were, if a patient says they ate cereal, that probably also means they consumed milk. And if you don't have that kind of general common sense, you may struggle. Or, very Silicon Valley take, but he was like, if the patient says they went to Burning Man, they probably inhaled a bunch of dust. And you might want to know that as you're trying to engage with them. So do you have a point of view on that? This general pre-training does seem like it maybe have pros and cons. What do you think?

Vivek Natarajan: 1:16:15 So I think it again goes back to how we did this work, right? I mean, we did not decide to initialize our large language model from scratch and training only on medical domain data. And the reason for this is the fact that by training on internet scale data, even though that objective seems very simple at a high level by doing this at this scale, you're basically importing a very complex multitask objective. And so that means to do well on this task and predict next words accurately, you need to not only understand syntax and semantics, but also medicine, physics, biology, chemistry and everything. And so as a result of doing that, and with models that have sufficient number of parameters, a good enough architecture, we do see emergence of these reasoning capabilities in these models. And so you want to build on top of that substrate. It's clear that as you do that, filtering becomes important as well. Data quality is important. And you can filter out a lot of the stuff that we think of as toxic or harmful, and still end up with a sufficient number of tokens where you can train these models. But it's not a perfect process. But I would say that you do need that to have this emergence of reasoning and common sense reasoning in these models. That is the substrate that you want to build on top of. For example, even engaging in dialogue and conversations. If you're training purely on biomedical text or scientific text, I don't think you're going to be able to have a normal conversation with an end user who's looking for very simplified explanation of those. You're probably going to talk like a scientist who only understands complex terms. That's not going to be useful at all for an end user. So you want to build on top of the scale of things. And then the second thing about medicine is the fact that I think we should not underestimate the power of the internet scale data and the amount of information that is in COVID-nineteen. So if you're looking for rare diseases, conditions, symptoms, it's very likely that someone somewhere would have posted it about it on some social media forum. And so I think it's useful to have that indexed and represented some way or another. And then the challenge is, okay, how do I elicit out that information when needed given the context? And so that is how all this work on fine-tuning and alignment and everything helps. But I generally do believe and is not just true with large language models, but over a period of four years that I've been working on medicine and health, we have seen that as you scale up these models with more diverse data, the reliability of these systems improve, the calibration improves, the distribution performance improve. And we've done very rigorous studies in many different modalities, imaging records, and now, large language models. And the alternative of not training on internet scale data is small scale datasets. And that is by definition biased. That is not going to perform well when you take it to a new segment, which has where it has not seen that kind of data. I feel this is absolutely critical. But then the work on alignment, and ensuring that the model is performing in a safe manner is also critical.

Nathan Labenz: 1:19:27 AI tools that you use, products that you would recommend that the audience check out, what are you using?

Vivek Natarajan: 1:19:33 Yeah. I think this is where just because I am so closed up within the Google ecosystem, I don't actually get a lot of exposure to other AI tools necessarily. So my response is not going to be great. But I do enjoy these new apps that can do this avatar generation, data warehouse stuff with images and all those things. Think that is in the newsfeed. I would also give put in a plug for MusicLM from Google that can do interesting stuff. So you can have me playing around with it and pleasantly surprised. And as someone who has had a keen interest in music, but is not particularly skilled at it, I find those sort of tools to be they can have a potential in allowing me to explore the space more broadly.

Nathan Labenz: 1:20:22 I was just got my hands on that for the first time earlier today, and we might even have a futuristic classic reggae track about AI. That was my prompt as kind of the theme music for this episode. So that's a very timely recommendation. Okay, second one. So let's imagine a hypothetical situation. A million people already have the Neuralink implant and the safety profile is generally looking good. Usually say, imagine it's kind of like COVID vaccines where it's, by and large, most credible sources seem to agree that it appears to be safe. That doesn't mean, of course, nothing could go wrong or there's no doubt about it. But if you get one, then you can communicate directly from your brain to all your devices. Would you be interested in getting one?

Vivek Natarajan: 1:21:22 Yeah. I think it's a good question. I need to think more carefully about the potential applications and what benefits it does have for me. Again, I think this is one of those trends, which is obvious in terms of how our interaction with technology and AI has evolved over the last few minutes, we're making it simpler and simpler, more efficient. So you've gone from GUIs to now language. And I think the next obvious step is neural interfacing. Yeah, I think the question is, can I keep up with that pace of interaction that might happen as soon as I have this high bandwidth channel? And what are the kinds of new things that I can do now with that, which I couldn't do before? I haven't maybe given deep thought on that one to give you a direct answer right now. But as someone who likes to explore new things, think that it could be fun. And I see that happening. But I do think that we're going to get probably invasive ones are not going to be what is going to become popular. I think we're going to make progress towards non-invasive neural interfacing technologies as well. And that may be maybe a longer time still out than what Neuralink has right now, but I think that is also very obvious. The trends are obviously there.

Nathan Labenz: 1:22:39 Yeah. There's been pretty visible progress on that just in the three months that I've been asking this question. So may have to update it to a V2. Okay, so last one then. Just zooming out as big picture kind of wide lens as you can. What are your biggest hopes for and fears for society at large as we enter this seemingly likely to be transformative AI era?

Vivek Natarajan: 1:23:07 It just feels like the opportunity of a lifetime to have impact in the world in a safe and beneficial manner. And it's leveraged technology and AI to improve the health of millions of people and help people reach their true potential. I think there's a very good chance of that happening, given a sufficient enough timescale. My hope is just that the wider community, everyone who's in positions to influence, people who train these models, but also policymakers, regulators and users of the systems and so on, kind of work together in a collaborative manner. And we ensure that everyone pushes together in the same direction in a fast and bold manner, but also in a safe and responsible manner. And I think if we are able to achieve that, I mean, I guess I don't know what the odds of that are. It could be that that doesn't happen. I think the future is very bright. The arc of technology is such that I think we've had generally technologies that are dual use, fire or electricity or automobiles and so on and so forth. Humanity as a whole has always until now managed to make safe and beneficial use of these technologies, while being able to constrain elements of society that want to use this in a negative way. So I remain very optimistic about humanity's ability to make use of AI. And yeah, just use it to do incredible things.

Nathan Labenz: 1:24:49 Vivek Natarajan, thank you for being part of the Cognitive Revolution.

Vivek Natarajan: 1:24:52 Thank you so much for having me, Nathan.

Approaching the AI Event Horizon? Part 1, w/ James Zou, Sam Hammond, Shoshannah Tekofsky, @8teAPi

AGI-Pilled Cyber Defense: Automating Digital Forensics w/ Asymmetric Security Founder Alexis Carlier

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

Google’s Med-PaLM and Med-PaLM2 with Vivek Natarajan

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next