Revolutionizing Patient Care with Neal Khosla of Curai Health

Watch Episode Here

Video Description

Nathan and Erik sit down with Neal Khosla, founder of Curai Health, a venture-backed virtual care startup using AI to provide low-cost primary healthcare. Prior to his current role at Curai, Neal was a machine learning researcher at Google and Stanford. In this episode, they discuss the current state of AI in medicine, what the future patient experience may look like, and how developments in AI healthcare may interact with different regulatory and social forces across the globe.

FEEDBACK
We'd love to hear your feedback or answer any listener questions live on an upcoming episode. DM @labenz on Twitter or email us at Info@turpentine.co with "TCR" in the subject line. We'd appreciate it immensely if you left us a review on Apple Podcasts or a rating on Spotify, which helps others discover the podcast.

RECOMMENDED PODCASTS:
Upstream: @UpstreamwithErikTorenberg

TIMESTAMPS
(00:00) Preview
(04:32) The future of AI in medicine
(12:50) Patient experience in AI-driven medicine and the current state of AI in medicine
(15:04) Sponsor: Omneky
(20:34) Building the LLM architecture for medicine
(25:25) Current state of the art for AI in medicine
(28:54) Evaluating LLM performance in medicine
(31:49) Benchmarking LLMs in medicine
(39:17) Using the Socratic method in training LLMs
(43:57) Multimodal systems and deploying multiple models in medicine
(51:30) The future of LLM creation and usage
(59:47) The interaction of AI medicine with social and regulatory forces
(01:11:00) AI adoption in countries with centralized healthcare systems
(01:13:37) How should we regulate the usage of AI in medicine?
(01:16:57) AI-first systems in medicine
(01:18:19) Is there an obligation to deploy AI in medicine in its current state?
(01:20:48) Neal’s favorite AI products
(01:22:21) Would Neal get a Neuralink implant?
(01:24:54) AI hopes and fears

TWITTER:
@CogRev_Podcast
@nealkhosla (Neal)
@labenz (Nathan)

Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

More show notes and reading material released in our Substack: https://cognitiverevolution.substack....

Music License: CRDIPZTVJ4XMPB9U

Full Transcript

Transcript

Neal Khosla: (0:00) If you imagine that for your particular condition, there's one doctor in the world who's the world's expert on it, that person should be available to you around the clock. It's a little bit alarming to me that in 2023, as a patient, I am still living in a world where a couple of smart people sat down and talked it out, as opposed to we've had millions and millions of people who go through all these medical conditions. Why can't we understand what happened to them, what was done to them, and really use that data to help improve our interventions? There was survey data that came out that showed a direct correlation between optimism on AI and the wealth level of the country. Basically, poorer countries are just abundantly optimistic about this stuff. It's very clear to me that in the next 10, 15, 20 years, whenever it happens, we're reaching a point in society where there's going to be so much infrastructure that cognitively can do so much for us that you have to decide to understand what spiritual fulfillment means to you.

Nathan Labenz: (1:01) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my co-host, Erik Torenberg. Hello, and welcome back to the Cognitive Revolution. Today, we continue our exploration of AI in medicine with our guest, Neal Khosla, founder and CEO of Curai Health, online at curaihealth.com. As you'd expect for an entrepreneur who's raised more than $50 million in venture capital, even before the current AI moment, Neal is a polished communicator. I thought he did an excellent job of describing Curai's vision for the future of medicine in just the first few minutes of our conversation, so I'll keep this introduction relatively brief. Of course, we do cover a variety of topics and angles, including the impact of GPT-4, Curai's recently published research, which uses multiple instances of GPT-4 to improve performance, how Neal personally uses AI for medical advice, what's missing and still needs to be built in order to ensure consistent quality of AI medical advice, Curai's go-to-market strategy, how the medical establishment is reacting to AI progress and potential, whether poor countries are likely to leapfrog rich countries when it comes to AI adoption, an inconvenient truth about today's LLM landscape, how medical use of language models should be regulated, and plenty more. Before getting into it, though, I want to take just a quick moment to, again, thank everyone for listening and share a few quick updates. First, we've heard your feedback about sound quality and have recently begun offering to send guests an external microphone should they need one to ensure that you can hear them as clearly as possible. This change will come online over the next couple of weeks, and sound quality issues should be a thing of the past. Second, if you have any other feedback or questions for me, you can email us at info@turpentine.co, or feel free to DM me on Twitter where I am at Labenz. I really love doing these interviews, but we've also got great feedback on our Erik and Nathan discussion episodes, so we do plan to do more of those in the future as well. Please let us know what's on your mind. Third, I invite you to subscribe to our newsletter, which is online at cognitiverevolution.substack.com. We send out new episode updates and also cross-publish my AI megathreads there. Finally, for now, if you're enjoying the show, I'd ask you to help pay it forward by writing a review on Apple Podcasts, Spotify, or the podcast platform of your choice. I've received a bunch of private messages of thanks and encouragement, which has been super rewarding, but I'd love to see more of this posted online as well, as I'm told that this is the single best way to help others discover the show. Now, without further ado, I hope you enjoy this conversation with Curai Health founder and CEO, Neal Khosla. Neal Khosla, welcome to the Cognitive Revolution.

Neal Khosla: (4:13) Yeah, thanks guys. Thanks for having me.

Nathan Labenz: (4:15) Super excited for you to be here. We are, and I am, increasingly obsessed with the role that AI is going to play in the future of medicine. So really excited to get your take on it from all sorts of angles. In your role as the founder and CEO of Curai, I'd love to just start off with giving you a chance to give us your vision for the future of medicine. What is my experience as a patient going to be as AI starts to have an impact, maybe say two years from now, 2025, and then if you can see that far into the future, end of the decade, 2030?

Neal Khosla: (4:50) Well, there's a couple of things I'd say. If you actually take a step back and just think about medicine and what it means to practice medicine for as long as humanity has been around, it's basically been you go and spend 15 minutes with a doctor. And the only lever we have is how much time you spend with the doctor. If you're wealthy, you get more time with the doctor. If you go back to the Middle Ages, the kings and queens would get a lot more attention, obviously, than anybody else. But now it's still not that different. People pay for concierge doctors, but it all comes down to doctor time. And really, the idea is that we have the doctors, they're basically oracles and sages. They're supposed to be well-read on all the biomedical knowledge that's most up-to-date that we have, and they spend 15 minutes with you and they give you a recommendation. And if you think back over the last 60 years, the most amazing thing about the computing revolution is it fundamentally has not changed that at all. It's probably the only profession in the world where if you take a human being today and drop them into the same profession 60 years ago, they would function just fine. The only thing that's changed is that there's not an MRI or an X-ray machine. Those things were invented in the '70s. Outside of that, you have some therapeutics, but otherwise it's the same job, which is pretty mind-boggling to think about. And so when we say, what's your vision for the future of medicine? I think we have to start with this: fundamentally, what we do in medicine today is a very old thing. It really has not changed at all. And so our notion from the beginning has been we should fundamentally reimagine the way that a physician practices with data and computing at their center and at its core. And I think there's some really obvious things that fit into that, that everybody talks about. You should be able to pull out your phone and talk to your doctor, or the data from your wearables and your other devices should bake into your health. But I think the main thing that we've really focused on is that we think that AI can be a great equalizer in terms of the ability to make healthcare broadly available to many, many more people. And the way I always explain it is, if you imagine that for your particular condition, there's one doctor in the world who's the world's expert on it, that person should be available to you around the clock. And so our conception is that in the future, every human being is going to be able to talk to that person, maybe not directly, but at least somebody who represents the same set of knowledge. And a lot of that starts with building AI systems that can scale more like software does. That's how we think about it at a large scale. I know that's a 10,000-foot view, but in the future, you should be able to pull out your phone and have basically best-in-class, super-personalized knowledge about whatever issue you're going through or your particular health that's available to you basically at zero marginal cost. And I think if you build that future, not only is it going to be really meaningful here in the US, but then broadly across the world where you've got 8 billion people who are never going to get access to the kind of care that somebody going to Mayo Clinic, for example, gets here in the US. So I don't know if that answers your question to start, but I'll pause there.

Nathan Labenz: (8:39) Yeah, that's great. I mean, there's a couple of themes there that we've been increasingly developing and obsessed with. I've got this notion of zero marginal cost expertise in general, and you have a much more developed, particular version of that for medicine. And then your comments about the difference between the level of access and therefore the level of impact that developments like this would have in the US versus much of the rest of the world is something that I also harp on every chance I get, because I think that should not be lost. There's a lot to worry about with AI in general and with turning over medical decision-making to AI. People are certainly understandably cautious and concerned. But boy, when you think about the impact globally, it certainly gets me extremely excited. So it's cool to hear you talk about that right off the bat in your first run-through of the vision.

Neal Khosla: (9:42) Yeah, I would say the one other thing that's popping out, just hearing you talk, was prompting some of these thoughts. One of the things that's most amazing about medicine is it is not a particularly data-driven science. It's much more a judgment-based art. Some of the studies on this are actually pretty fascinating. So if you look at clinical guidelines, basically how the medical establishment says that doctors should treat an issue, there was a review done a couple, now probably about five or six years ago. They basically looked at them and said, how many of them are based off of Grade A clinical evidence? And the answer was about 11%. So that means 90% of the time when you are getting a clinical best practice guideline, you're getting something that's based off of really what comes down to opinion, expert opinion. And then when they look deeper, they say, what percentage of time do doctors actually follow these guidelines? It's about 50% of the time. And so if you take a step back, as a human being, 95% of the time you're not actually getting a very data-driven recommendation. Now, there is some data that informs these things, and there's usually expert panels that sit down and talk through it. But it's a little bit alarming to me that in 2023, as a patient, I am still living in a world where a couple of smart people sat down and talked it out, as opposed to we've had millions and millions of people who go through all these medical conditions. Why can't we understand what happened to them, what was done to them, and really use that data to help improve our interventions? And that is one thing that we've worked on at Curai as well that I think is a much longer path. That's not something that's going to have an effect in the next three years or probably even five. But if you look on a 20-year timeframe, I think the ability to create the infrastructure to collect that kind of longitudinal data on what happens to patients and then use it to surface decision support right at the point of care so that we can make a decision based on that data, it's never existed in humanity. It's kind of interesting if you go to a state like Utah, where it's mostly been a homogeneous population because the Mormon population, many of them have lived there for multiple generations, and they were very progressive in getting into electronic health records and genealogy and some of these things. It's one of the few places where they tend to have some of these more longitudinal and genealogical databases on patients, and they're just starting to figure out what they can do with that. But the rest of the world doesn't have that, and it doesn't even have a glimpse of that. And I would argue what they have in Utah probably is insufficient to really get insights. That's one other thing. I do think medicine over the next 20 years really needs to start thinking about how we lay the foundation so that care can be made in a super data-driven way as opposed to where we are today, which is, I'd say, still incredible. Modern medicine is a miracle in many ways, but it's not all the way there for what I would want as a patient.

Nathan Labenz: (12:50) So let's maybe spend another second just filling in a little more detail, giving a little more color on this future of medicine in an experiential way, from the patient's point of view. And then I want to talk about where the state of AI for medicine is today. A number of exciting things have been published recently, including work out of Curai, which I definitely want to dig into. And then we can maybe take a step back and talk about, okay, now where are you today and how are you going to get to that future and some of the barriers that might arise, including regulatory, et cetera, et cetera. So if I'm a patient, I'm just trying to imagine my experience, right? And this is just a few years from now, potentially. The idea would be that I have 24/7 availability, that when I begin an interaction with the system, there is a seamless, presumably embedding-backed database of all of my previous interactions that have the semantic representation of issues I've had, conversations I've had, notes that the doctor took in the past. Presumably, all of that is dumped in and can be used for retrieval. Presumably, the medical literature is also deeply baked into such a system. And then what do I do? Am I interacting with a doctor in the same way that I would today, except it's just a language model? How does that then play out into care?

Neal Khosla: (14:25) Yeah, so I think it's important to root in what patients do today. There are basically two ways that patients access medical expertise. One is they talk to their doctor or a doctor if they don't have their own doctor. And the other is they go online and do self-research. Ultimately, I think these two things probably need to become one. There are a lot of ways in which self-research is maybe suboptimal for the patient in terms of coming to the right conclusions because patients are self-directing and they may not always know the best path, and the information online is not always vetted or good. And on the other hand, a lot of those pieces of information also should inform what your doctor is doing.

Nathan Labenz: (15:03) Hey, we'll continue our interview in a moment after a word from our sponsors. I want to tell you about my new interview show, Upstream. Upstream is where I go deeper with some of the world's most interesting thinkers to map the constellation of ideas that matter. On the first season of Upstream, you'll hear from Marc Andreessen, David Sacks, Balaji, Ezra Klein, Joe Lonsdale, and more. Make sure to subscribe and check out the first episode with a16z's Marc Andreessen. The link is in the description.

Neal Khosla: (15:32) I'm sitting here. I take fish oils every morning. My doctor, most likely, if I'm an average American, has no idea that I'm taking fish oils or that I'm considering taking fish oils because most Americans don't have the time to talk to their doctor about those things. Most people are going on Google, they're going on Reddit, they're going on these other places to figure out what they need to do given whatever their goals are. At the end of the day, we basically think that the appropriate interface does look something like a text box, which is I, as a patient, have a way to say, this is what I'm thinking about, this is what I'm wondering about, this is what I'm dealing with. And on the other end, there's a system that helps me figure out what exactly that means I should be doing for my health. So if I'm having symptoms or if I'm considering taking a new supplement or starting a new diet or exercise regime, or if it's something more medical, I'm managing my diabetes long-term or I'm trying to get pregnant or anything like that, it all starts with a text box. And on our end, it's important to build the magic, as you alluded to, to be able to parse through that, understand what the patient's looking for. It almost always starts by asking more questions, and this is something that's completely not present in any kind of online research, the ability to have a dialogue with a system that says, okay, you're considering taking this supplement. Why are you considering taking it? What other things have you tried? How long have you been thinking about this? What are your goals? And let's dig into your biomarkers. Do you actually have high triglycerides? Is that why you're thinking about taking fish oils? And so for us, it always starts with what is the patient's goal or intent. And then how do we drill down and understand more? And so that's always an automated questionnaire system. I think to get into the technical side of things, it inherently does have to tap into a large amount of patient history and data. And depending on the patient, we have a very differing amount of data. We are finding increasingly that even with more simple patients, you're starting to extend beyond the context window of the patient pretty quickly. And so you need to find other ways to index that prior data you have on them, as well as what you're collecting in conversation with them right now, and use that to intelligently feed that stuff into your context window so that you can help the language model make a decision. I'd say the other big thing that we think a lot about is these LLMs are not totally solvable just entirely with a language model. You've got to build a lot on the guardrails and safety side. So especially for where we are today, just from a regulatory perspective, you can't have a language model giving a patient direct personalized advice like, go take this medication. That has to be done by a doctor. That's a regulatory requirement. You have to build requirements or guardrails around that. You have to build guardrails around when patients have mental health issues, handling those things appropriately. I don't know if you saw, there was a news article that came out about a person the other day who killed themselves, sadly, and their significant other mentioned that they felt it was because they were talking to a chatbot. And so these kinds of situations are real. And so for us right now, there's a lot we're building that's on the safety side. Yes, from an experience perspective, you come in, you talk about what you want, and then it's our job to drill down with questions and then serve that biomedical expertise for you. And a lot of it is happening on the language model side, but there's a lot of scaffolding that's being built around these models as well as support and checking to be able to surface good recommendations. And the thing that we always do before we give a patient a final recommendation, we're always connecting them to a physician. And that's where we are today, where things are getting escalated to a physician for final review and to give that advice. That also allows us to do things like have the physician give you a prescription or a medication or send you to get lab work and then further personalize the information or the recommendation we're making. So it's a really simple thing in concept. At the end of the day, we're trying to get people to talk to their doctors about the whole range of things that they're dealing with, and we're just trying to make it scalable and possible for the doctors to actually respond to those things by using tools that maybe do 80% of every visit that they're having.

Nathan Labenz: (20:16) So a couple of things to me about your description there. One is just how similar the architecture is between what you're building for medicine and basically everything else we're hearing about. Whether you are trying to build an AI assistant, you have a lot of those same things. People are saying I want it to be a text box. It needs to know my preferences for which do I like the window seat or the aisle seat or whatever, and there's this retrieval component to it. It's striking the degree to which things are just converging architecturally, both in terms of the nature of the models themselves, but then also the surrounding tooling. You can complicate that view for me, certainly, if you want to.

Neal Khosla: (21:06) I'd say the main thing, because quality and safety is so important in medicine, one thing that I haven't seen a lot of people working on that we are is what I'll generally describe as unit testing or regression testing. Completely hypothetical scenario, a patient comes in and is dealing with some issue. They've got some sort of bacterial infection, they need an antibiotic. Great. We can have a language model interview them, then the doctor can review the summary of what the language model gathered, and then they can work together to come up with a prescription for this patient. We want to have visibility into how the language model is handling a broad range of these kinds of scenarios. And so instead of just releasing this product, we've been working on a large set of test cases where you basically can run a set of regression tests that say, for a common range of clinical cases, how does the language model handle these things? And because these things are non-deterministic, you want to be able to have it, much like building software where you want to know that it works repeatably and unit testing is a great way to get insight into that. And so we haven't seen a lot of people working on robust unit testing suites, but at least in medicine, it feels like a must-have to know that there's predictability. It's even more true as you use language models that update over time. If you're doing any kind of fine-tuning or human feedback or any of these things, the model is changing. And you wouldn't want a scenario where we do really well with antibiotics today, but then tomorrow, all of a sudden the thing goes haywire. And so that is one area where I'd say our stack maybe is starting to diverge from what I'm reading about on the internet. You really want to have a large set of test cases to understand how the model is thinking through and reasoning as it changes over time.

Nathan Labenz: (23:00) Yeah, that's super important. A little glimmer of that type of thing has come out from OpenAI recently with their Evals library, but I'm sure you are going comfortably 10x deeper in the use cases of interest. So maybe let's talk for a second about just what is the state of the art in terms of what AIs can do in medicine now. Devoted listeners of the show will have heard me talk a little bit about my experience with the model we now know as GPT-4. I was, just for your knowledge, I was an early red team member, and it hit me hard immediately. One of the very first things that I tried was setting up a dialogue between me and GPT-4 as my doctor. Knowing what you said is so true that evaluation in general in language models is hard, but it becomes particularly hard when you don't have the expertise in the domain of interest. As a non-doctor, no real training myself, very difficult to do any sort of evaluation of the model's performance as a doctor. What I did was just attempt to recreate episodes from my own life. And I'm fortunate to have been pretty healthy, so I haven't had a ton. But went in to see a doctor about this, saw my dentist about this, whatever, and I kind of recreated those little episodes. I found that for my probably pretty routine, certainly not the most difficult cases, the performance that I got from early GPT-4 was almost indistinguishable from my actual real-life conversation. And I was kind of immediately like, woah, this is going to be a huge deal. It was really that, maybe more than anything else, that caused me to become even more obsessed and just drop everything else I was doing and focus on that red teaming thing for a while. But that's just an anecdote. So can you kind of characterize for us, I think people are probably going to be pretty surprised by what you'll say, but I'd love to just hear the characterization broadly of what are the state of the art things? What have they achieved? And then we can obviously get to your recent publication as well.

Neal Khosla: (25:13) Yeah, so there's the quantitative and there's the qualitative. So we've also had access to GPT-4 for quite some time. And I'd say the anecdotal experience of ours, as well as most of the doctors on our team, mirrors what you're saying. I, at this point, and I would not recommend this to anybody at home who doesn't understand the risks associated with this, but primarily, most of my medical, I hurt myself the other day working out and the first thing I did was go on ChatGPT, pull up GPT-4 and type it in. That is primarily how I'm getting my medicine right now, at least my medical advice, with the understanding that it's not always right, but it's easier and faster than trying to get in front of a doctor, and it's pretty damn good. We have a doctor who moonlights at Stanford, and he really insists on taking these tools into clinic with him. He's found that's made him a better doctor and has really, he said repeatedly that it does an incredible job. So I would say, yes, anecdotally, these things are really, really capable. In terms of the quantitative, there's been a number of folks working on testing these systems, large language models, on the US medical licensing exam, USMLE. And I believe GPT-3, the paper came out in January, suggested it was basically on par with an average physician. And now it exceeds the average physician if I have my data correct. And we just published a paper just a week or two ago that exceeded the performance of state of the art, which we had tweeted out, but I think Microsoft was the one who had formally reported it post GPT-4 launch. And so what we're seeing is these systems are, by any objective measure, as capable as any physician. I think there's still a long ways to go in terms of investigating. One of the challenges, as everybody in AI is seeing, is these benchmarks don't really work or make sense anymore. They start being based off of these assumptions that maybe don't work for the world today. One of the interesting things we've seen, for example, is on USMLE questions, they tend to be short and have short responses. And so they're little vignettes and cases that are more easily answered. And so some of the cases where these models struggle more may be on longer form, more drawn out reasoning. And that's where these things don't work as well out of the box. And you've got to build a lot of scaffolding, some of the things that came up, memory retrieval, reasoning systems that help them work over time. The short answer is what we're seeing already is that out of the box, they're pretty damn good and they've pretty much exhausted the state of the art current benchmarks. And qualitatively, what we're seeing is we believe these models can do as well as or better than the median doctor here in the US, but there's work to be done on building the scaffolding and proving that out in a more robust and rigorous format before you make any kind of egregious claims at scale.

Nathan Labenz: (28:36) Can we maybe unpack the nature of these questions a little bit more? I find that these numbers get thrown around so much, right? Most of our listeners will have taken the SAT presumably at some point, but I would venture that most have not taken, and I have not taken, the USMLE. So as part of your recent paper, Dialogue Enabled Resolving Agents, which we can unpack more as well, what that works and how that works and why it works, I noted that you said you had exceeded the previous state of the art on these open ended questions. Could you just give us a little bit of a sense of what those open ended questions are like, just to ground the audience and even me in terms of what sort of questions are we evaluating language models on right now?

Neal Khosla: (29:32) So the type of question you get is, I'm reading one right now, a 67 year old man with transitional cell carcinoma of the bladder comes to a physician because of a 2 day history of ringing sensation in his ear. He received his first course of neoadjuvant chemotherapy 1 week ago, blah blah blah blah blah. The expected beneficial effect of the drug that caused this patient's symptoms is most likely due to which of the following actions? Inhibition of thymidine synthesis, inhibition of proteasome, hyperstabilization of microtubules, or generation of free radicals. And then the fifth option is cross linking of DNA. That is the kind of question that is in this open ended dataset.

Nathan Labenz: (30:21) So safe to assume nobody is lucking their way through the exam. Yeah, I think it's worth just taking an extra second here to reflect on the fact that in some really important ways, systems are superhuman. It's a weird shape, right? I'm always interested in the ways that reality diverges from our expectations or our shorthand. And I think one key way is that we're seeing superhuman things in some ways, but not in all ways. In all of my obsessive GPT-4 testing, I never saw anything that I was like, that is more brilliant than anything I've ever seen a human do. I never saw any single insight that was superhuman insight. But then you look at breadth and you're like, man, this same thing can answer that question, and it can also do comparably well in law, and it can do comparably well in basically many of the professions, if not most of the professions. That is superhuman in and of itself. So that's just worth not glossing over in my mind.

Neal Khosla: (31:31) One of the other things I'll say about that dataset is the interesting thing, the interesting step here too, is we took the step of basically stripping out the multiple choice answers. And historically, when these things have been tested, it's pick one of these 5 answers. We said, answer this question. And the model can actually answer it as an abstract concept, which makes the problem statement significantly harder in terms of it doesn't just get to pick from a list. And then we introduced this metric. We actually introduced it a few papers ago called GPT Recall or GPT Precision, which is basically a way of saying you can actually get the model to ask the model if an open ended answer was the same as the reported answer. So if the model says high cholesterol and the answer is cholesterol over, I don't know, 200, those should be noted as the same thing. So you actually have to do this reverse mapping problem to figure out if the open ended answer was the same thing as the spirit of the answer. And so this is one of the ways where these previous benchmarks start to get a little bit broken by these models, which are just becoming so capable in answering these questions.

Nathan Labenz: (32:46) Yeah, I'm glad you mentioned that, and it's something that's been on my mind quite a bit recently as well. Going back to the red teaming thing, there were at least 2 instances of papers that were published during my personal 2 month red teaming window, which was September and October, where the conclusion published was basically language models still can't do X, for whatever X was. And at the time, I was like, well, I'm pretty sure GPT-4 can do it. So I spot checked, and sure enough, in the 2 that I checked, it was able to do the thing.

Neal Khosla: (33:24) I know exactly what you're talking about.

Nathan Labenz: (33:26) That sent me down a little bit of a benchmarking rabbit hole in that I started to think, well, how good are all these different benchmarks? And I found exactly what you found also, which is that if you set up your benchmarking script with a slightly dated paradigm, if you take a 2021 Big Bench script, for example, and you just run it, first of all, it's set up on a few shot basis. And second, the structure of that few shot is such that the model is basically forced to give you a multiple choice answer straight away. And as a result, its performance really suffers compared to what it would do, not even if you get really creative and do amazing prompt engineering, but literally, you just take away the few shot structure, take away the multiple choice, and just present it with the question, you'll get way better answers than you do by using the established structure of the benchmarks. So as an aside, if there's any listener that wants to sign up for a little project, I think there are a number of papers recently published that I would like to go dig in and re-try some of the experiments that have been run with, honestly, just more naive prompting strategy. So reach out to us if you want to do that. And I'm glad that you are on top of that and not falling victim to that pitfall.

Neal Khosla: (34:57) Well, I think it goes in both directions, right? So on the one hand, it can hurt the performance of the model. It can also inflate the scores. It can reduce the problem. There's no scenario in real life in medicine where the doctor is given 5 options. If you want to compare these things to how a doctor would do in real life, you have a patient in front of you and you have to guess one of 5 answers. No, that's never the case. You have a patient in front of you and you have to come up with the answer from the depths of your imagination. I don't know where it comes from if you're a human practitioner, but what we've noticed and felt is measuring these things on multiple choices is just a bad benchmark because it makes the problem so restricted in domain. And especially for areas like biomedicine, biomedicine is a very open ended thing. I mean, if you think about real patient cases, they're never straightforward clinical things. And we can talk more about the implications of this for language modeling, but you never have a scenario where you get a list of symptoms from the patients that are incredibly straightforward and then you just map it to some diagnosis. If you actually look at the history of AI and biomedicine, that's how people started trying to do it. They would come up with these big lists, these big graph structures that were these symptoms map to these, these clinical findings map to these diseases. And the problem is the expressivity of those models, right? You have somebody who's lactose intolerant. What is the finding there? If you want to diagnose the disease of lactose intolerance, the findings have to be ate milk or, in some cases, have to be ate cereal because they didn't even eat, they never tell you they ate milk. They just tell you they ate cereal. And this is how the human brain works. It can generalize these concepts to make the diagnosis. In real life, the world is really messy and open ended. One of the things that language models have really unlocked in medicine is the ability to understand this broader context and represent a lot of clinical findings in very abstract conceptual terms and still be able to reason on them. So one of the things that we're a big fan of in these benchmarks is removing this artificial structure because medicine never has artificial structure. Patients come in in really messy scenarios and doctors have to adjust and treat them with very, very messy data.

Nathan Labenz: (37:25) So the upshot there is, in addition to my complaint that the multiple choice benchmark, when presented the wrong way to the language model, can lead to understating its performance, you're also piloting an equally important point in the opposite direction, which is that just giving it multiple choice answers is a far cry from actual challenge in practice. And so that's why you've created this additional elaboration where you remove the multiple choice and then you do another language model mediated assessment to say, did it come up with the right answer on its own?

Neal Khosla: (38:05) Yes. I'll just leave it at that. Yes. I think that's very, very much true. I think the problem statements themselves, this is where we're starting to hit the ceiling of the benchmarks, the problems are controlled, right? What I read was a very controlled vignette and patients are messy. One of the things we found really messy about diagnosis and decision support in medicine is that it's an evolving thing. A patient comes in and you have one differential diagnosis. As you learn more about the patient, it changes. You're not just dealing with a snapshot in time. You have a moment and then you talk to the patient, you get more information and it changes. And then a week later, the diagnosis changes again and so on. So any model you build has to be able to dynamically reason and change over time. These benchmarks maybe don't do that. But anecdotally, the language models do really well.

Nathan Labenz: (38:59) Well, first, let's go to your research because you guys just published this paper and this is the perfect time to talk about it. So you're digging in on this benchmark. You've got GPT-4 access. Anecdotally and even quantitatively, we're finding that GPT-4 can do a lot straight out of the box. And now you've added this layer of dialogue enabled resolving agents, which reminds me of a couple of different things. The Socratic models paper was maybe the first one that cracked my consciousness that has a similar paradigm. But tell me how it works in this case. You guys have brought multiple models together or multiple instantiations maybe of the same model, and you're getting better results.

Neal Khosla: (39:42) So for this paper, we have 2 instantiations of the same model. And the concept is you give these cases to one of the models we call the decider, and then you have another model basically poke holes, we call it the researcher, that goes around and pokes holes in the conclusions that the decider is making. And it becomes this Socratic style dialogue. We actually originally, before we published a paper, we called it student teacher, and then for reasons, we moved away from that terminology. But it's a really clever way of getting 2 models to work together to come to a better set of conclusions. And I think the paper basically shows across a variety of tasks that this works to lead to state of the art performance in medicine. I mean, candidly, I'd love for somebody to try this kind of model and go take it to other things other than medicine, because I suspect it will work. It's a recurring theme that we're seeing right now in the AI world, which is GPT-4 is great, and it's great to create a prompt and give it a problem and see how the prompt does on a problem. But what seems to be even more powerful is setting up multiple instances of different agents and have them interact in complex ways. And we're still unlocking where that can take us as a paradigm. And in medicine, I think it makes a lot of sense. As I mentioned, so much biomedical knowledge is embedded in the latent knowledge base of a large language model. The question is how do you elicit it? And I think what we find in this paper is that skeptical questioning of the deductions or the conclusions made by the model can push you to a better resolution. And I just think this is a super, super exciting paradigm, and we're continuing to explore this as an area of research.

Nathan Labenz: (41:40) You said it's the same model. Is it basically GPT-4 with different lenses on prompt engineering that you're then just bouncing back against one another? For this paper, yeah. That's amazing. Another thing that really reminds me of is one of the authors of the diplomacy paper that came out of Meta. Cicero. Yes. Cicero, the model that played Diplomacy, the game. He talked about the very general strategy of trying to bring more compute to bear at runtime. He was talking about how, if you went back to Deep Blue in the original chess days, basically what you had a lot of was deep search and a ton of compute running at runtime for each individual move. Just crunching through the trees of possibility. And some smart heuristics around where to truncate search and which trees are worth exploring and which not. But it was just a ton of compute at runtime. And then he contrasts that to today where he's like, by and large, the compute is all done in the training. And then at runtime, you're just predicting one token at a time, and that's maybe a million times less or something, maybe even more than that. And so his paradigm and what they did with the Cicero paper was they tried to figure out ways to bring more compute into the picture at runtime. They had a multi-part approach that included a constellation of models more than purely a language model. And you're doing something similar here where it's two summonings of the same language model into different roles that you can then place into dialogue. But effectively, you're multiplying the compute with a certain flavor on it. I wonder if you would add anything to that. And then I also really wonder about other kinds of models that might be added into this system. I imagine people have been talking forever. Well, the radiologists will be the first to go because it should be easy for AI to read a scan. We haven't seen that, but we also haven't seen GPT-4 multimodal deployed widely at all either. So, yeah, I don't know. Any thoughts on bringing computation to runtime and different kinds of models working together?

Neal Khosla: (44:12) Yeah. I mean, I'd say there's a couple of things. One, yes, I think this is a really powerful paradigm. In the old classical world, you'd call this ensembling, right? I think one of the really interesting intuitions, and why two instances doesn't make sense in a lot of ways to people, is that two instances of GPT-4 shouldn't have independent failure modes. And so theoretically, the model shouldn't get better. But what I think a lot of people are finding in research is this state space of the model is so large that what you're really trying to do is you're trying to figure out how to elicit the right knowledge out of the model. And there was a paper that came out the other day that I haven't actually read the whole thing, but I'm very excited about it. People are starting to research this. It was about chain of thought prompting. Basically, they argued that reasoning is emerging from what I think they called the locality of experience. So it's sort of local clusters of variables in the model that influence each other. It's a super cool concept. I'm curious to see where this takes us, but I suspect that we will go really far with even multiple versions of one language model as these models get bigger and bigger, just through prompting. Obviously, you're seeing the Auto-GPT stuff. There's some differences there in that it's doing more coordination and orchestration, I would argue, but I think it's a lot of the same phenomena. Hey, one version of the model that's coordinating and orchestrating can prompt the model to do other things, and that is a very powerful paradigm. I think the other thing you're getting at is other modalities and other kinds of models. To date, what I will tell you is we don't feel that there is a tremendous amount of value in taking a worse performing model over the better performing models. Or I should say I don't. From my vantage point, these models are so large that you haven't hit the performance limit of using X versions of the same model in conjunction. That seems to be better than using one GPT-4 and one GPT-3 and one Bard and what have you, which is very counter to the intuition, I think, of a lot of machine learning scientists. It's counter to what my intuition was, but that's not anecdotally what I'm seeing. And then I think on the multimodal stuff, I think it remains to be seen how powerful this can be. But my suspicion is it's a really powerful paradigm. I actually was looking at a patient conversation this morning where the language model asked the patient to upload a picture. I was sitting there thinking about this. My suspicion is that GPT-4 multimodal, which I have not touched, is going to be able to synthesize information across these modalities, which will lead to a net improvement in the performance of these things. If you think about having a picture of a rash and the patient describes it as itchy and flaky, as well as you can see that it's red, that is a lot of information that if it can be synthesized and combined, is much more powerful than independently. A model that knows, can distinguish from the image that it's red and then has the description in the language that it's itchy and flaky. This is pure speculation because we haven't played around with these models, but I'd say what we see is the more diverse data and the larger data you give any of these models, the more powerful they seem to generalize, and we don't seem to have hit the ceiling there. So I'm very excited for throwing imaging data, for throwing whatever kinds of other multimodal data we can get. I mean, a human being theoretically listens to sounds to do diagnosis. One of the interesting things we found, though, that you might find surprising, Nathan, is that these models already generalize to other kinds of data really well. One of the things that really surprised us with GPT-4 is that it can interpret continuous glucose monitor data out of the box. So you take a glucose monitor and basically you can think of it as a graph of glucose versus time. And I suspect that the model never saw actual time series data of glucose, but it probably saw time series data and it probably read things about glucose. And so it can generalize to say at 1 PM, the patient's glucose spiked to 170, which was a sign that they ate a carb-heavy meal. And you sit there and you go, holy shit, it's never seen this kind of data. And all of a sudden, it can generalize to it and do a pretty damn good job. We didn't do super rigorous evaluations, but anecdotally, what we saw was it did pretty well. And I think this is a scary proposition for a lot of people who are talking about their data advantage. There are companies who have glucose data and say this is our advantage. I believe that these models are generalizing and as we feed them more multimodality, they're going to generalize even better to the point where out of the box, they're going to be able to do a lot of these things that people historically have thought of as their secret sauce. So that maybe took a different turn than what you were expecting, but I think it's a fascinating area and we're seeing it in medicine and I'd love to see where people are seeing it in other disciplines as well.

Nathan Labenz: (49:48) Yeah, that's a great one. Surprises have not stopped just yet, which shouldn't be too crazy because we're only on, I've started counting time from GPT-4 release. So we're on GPT-4 week 4 and day 1 in the new calendar. So there's still, I think, probably quite a bit in the vast surface area of these things that is yet to be explored and will continue to surprise us for a while. I guess I'd love to hear how some of this, you're starting to touch on some of the business questions, right? Moats, where do they come from? Does anybody have them? I'd love to get your sense for where you're headed there. It sounds like you're partnered with OpenAI to at least some degree where you had a preview of GPT-4. Do you expect that OpenAI is, to borrow a phrase, all you need for the foreseeable future? Do you think that will turn into becoming a Foundry customer? Foundry being their, as yet unconfirmed, but I think credibly leaked enterprise offering with robust fine-tuning that's coming soon. Do you think at some point you create your own models and go a totally different route? Maybe it's all of the above. But what do you think is the future of how you and Curai will use language models over the next couple of years?

Neal Khosla: (51:19) Yeah, so my general critique of everybody in this space right now is that everybody's trying to believe the things that are convenient. And I think the first inconvenient truth right now is that OpenAI is way better than everybody else, and it's not particularly close. I've played around with Google's models. I've played around with other models. I won't name other companies because I don't care about insulting Google as much, but they're not anywhere close. And that doesn't mean that can't change. And I really, I think everyone should hope that it changes, that there's good competitive forces. But as it stands, if you're talking about pure intellectual horsepower and capability, you are sacrificing pretty much no matter what, unless you're using OpenAI. I think it's really important for people to be specific about their problem statement and the problem that they're solving. I would argue for some of the content creation use cases, something like a Jasper, that the performance may be at a point where it's sufficient enough that you don't really need the latest and greatest model, and

Nathan Labenz: (52:23) from a

Neal Khosla: (52:24) cost-benefit trade-off, it's probably not worth it. And in that case, it may make sense to train your own model or rely on open source or other models and combined models. We don't have a problem like that, so I can't speak to it at length. But that's my high-level understanding. For something like us, performance is absolutely critical in medicine, and so we definitely need to use GPT-4 and we need to build on it in pretty robust ways to be able to get the performance we want. And that's everything from, as I mentioned, safety and rigor and unit testing to good prompt engineering to guardrails to all sorts of algorithmic improvements. I'd say everything from there's people working on how do you increase the context window or the memory of the language model to papers like what we're publishing, which is how do you utilize agents to get better reasoning? I think those are all areas of research for us where we continue to push the envelope. But in many cases, we're relying on OpenAI as the base model. I think we are also really interested in this topic called goal-oriented medicine for us, which is much like Auto-GPT. Often in medicine, you have a goal. A patient wants to lose weight, they want to control their diabetes, et cetera. It's an open problem to figure out, can you direct a language model with that goal to then interact with the patient proactively in some cases to say, how do we work together to accomplish this thing over time? So for where I stand, the lowest common denominator still, from the language model perspective, continues to be OpenAI. Everything else is reasoning, safety, improvements, memory. There's so much work to be done to build an advantage on the non-core language model stuff. I think the idea that you're going to train a language model for a specific use case, depending on the use case, can be anywhere from correct but insignificant to completely delusional. I would say companies that are thinking about, hey, we have a moat. I just brought up the CGM example. I think that's a great example of how these models are starting to generalize in ways that your data probably isn't that valuable. And most importantly, you know this Nathan, these things are incredible few-shot learners. And so you can give them 10 or 100 or worst case, 1000 examples of a certain cognitive task and they tend to generalize incredibly well to that task. And so the idea that I have a million data points of thing X and therefore that's going to prevent other people from doing it, I don't know that that's a particularly robust viewpoint when if I can just have somebody, whether it's Scale or otherwise, manually label 1000 of these things and I can get 90%, 95% of your performance. It really is going to depend on you needing 99% performance for that to be a substantial advantage. And I'd argue for many use cases, that's not the case. And it'll end up coming down to your UX and your distribution and what have you.

Nathan Labenz: (55:46) I appreciate the candor. There's a lot of inconvenient truths, I'd say, right now in the AI space and a lot of denial going around in a lot of different directions. So I think the dose of realism is always welcome.

Neal Khosla: (56:01) One other thing I'll say is I would not underestimate for your use case how much generalization really matters. Domain specific models are another thing that I think are a little bit overblown. Unless you're worried about cost or latency, GPT-4 is probably going to beat your domain specific model. I brought up the example of lactose intolerance, but we had another patient case where the patient went to Burning Man and their lungs started hurting. How do you actually make a hypothesis about what's going on with this patient? Imagine if you're trained only on medical records or biomedical data, you don't have any idea what Burning Man is. It turned out the patient got dust in their lungs, and that's what caused the issue. But unless you understand Burning Man, the Nevada desert, dust in the lungs, or maybe there's also a rare fungal infection in the desert, unless you know that kind of world knowledge, you're trading off performance there. Domain specific models are another thing that I think are a little bit overstated in terms of their potential. I'd love for them to pick up, but I think unless it's data that the models never can or will see, a completely left field thing, I suspect your generalized models are going to be able to beat you or at least meet you on performance.

Nathan Labenz: (57:17) Life is big. The world is big. The world is messy. And especially for something as complex a system as our own bodies, the clues that are sprinkled into conversation can be so meaningful. I think what you're saying makes a lot of sense. It's very hard to imagine how you could interact with a patient effectively without that broader context. You might be able to score well on the USMLE with a domain specific model, perhaps, but it does seem in those real interactions where the context matters so much and these little hints, these clues, there is a lot of value to all of that. You might even call it a world model. I don't want to get into trouble for using that term out of order, but it does seem there's something very powerful there. So let's reel back into the present. We've outlined your big picture vision. It makes a ton of sense to me. It's incredible to realize that for the most part, it sounds like the core tech that we have today is able to support that vision and that there are some refinements, some engineering, some integration, savvy usage, guardrails, unit tests, all that stuff that still needs to be improved and vetted out. But if it boils down to the question of, do we have the core discoveries necessary to realize that vision? Sounds like the answer is basically yes. Complicate that for me if you think that's wrong.

Neal Khosla: (58:48) I would just say it's a very complex problem, but I agree with you. The raw natural language understanding toolset and toolkit, I think it's there. There's a lot of complexity in the problem that needs to be solved for.

Nathan Labenz: (58:59) So tell us where you are today. I have, by the way, gone to Curai and signed up and become a patient. My own very fortunate privilege is that I honestly didn't have enough medical needs to really get too deep into how it can help me, so thank good fortune, providence, whatever, for that. But I'd love to understand how much of this vision already exists. And then how are you thinking about getting there? This is a topic that Eric is definitely super interested in as well. How does the introduction of this technology begin to play with the social, regulatory, governmental, and legal systems? Medicine touches everything, or maybe everything touches medicine. So where are you today and how do you navigate a path through the thicket of current structure to get to that future vision?

Neal Khosla: (59:50) We have a direct to consumer product. You can go online, you can try it out. It's curaihealth.com. Patients can come online, they download our app, they can match with a doctor, they get ongoing access to care. It's $14.99 a month. But a lot of our focus now as a business is more on our enterprise customers, and that's health plans and provider systems that we're working with. The concept there that I basically say to folks is at this point, the cat's out of the bag. Over the next three years, most patients are going to start with ChatGPT for medical advice. I genuinely believe that. We saw it with Google. Everybody starts with Google for their medical information and advice, and we're going to see that exacerbated with ChatGPT. I think generally this represents an opportunity for existing health systems and health plans to get on this trend and in front of it instead of being reactive where we were 20 years ago, where people started coming in with printouts from Google. In this case, it's even more dangerous because you're going to have a lot of scenarios where people are going to self serve on ChatGPT and never go to the doctor. And that information may or may not be correct. It's really hard to guarantee reliability. I think there are other implications in terms of the business of these institutions where, for a health insurance company, if ChatGPT says to go to a neurologist, that's an expensive thing. And so we really want to make sure that you really needed to go to the neurologist before the patient goes and self books or self assigns themselves to a neurologist. We're spending a lot of our time and effort scaling up our partnerships with folks who say, we want to create our own consumer centric, AI centric version of accessing care that has humans in the loop, that has doctors who can provide oversight and actually close the loop in terms of providing convenience for the consumer. If the consumer needs medication or they need a lab test or what have you, we see that as a key role we can play. We can actually deliver medicine in this virtual format instead of just giving you information.

Nathan Labenz: (1:02:00) Does that mean today, if I have, say, the app of one of your health system partners, do you have this deployed where I can go talk to GPT-4 and have that whole interaction? And then that gets kicked off at some point to, okay, now you're going to talk to the human doctor that's going to review all that information. Is that all live today?

Neal Khosla: (1:02:21) Yeah. Our first health system partners are going live this year, and that is exactly the conception. We don't work with them, so I can say their name as a fake example. But like Stanford Healthcare, which is right here, you download the Stanford Healthcare app, you go on their website, there's a button that says Get Care Now or Talk to a Physician and you click on it and really you get GPT-4 first. And I shouldn't say GPT-4 because it's really this set of models, this system we've built on top of these large language models. And then we built the system such that it can appropriately triage you to the right kind of provider depending on what you need at the right moment. If you're having a medical emergency, the doctor can jump in and give you guidance or close the loop. So it's that concept of we give the patient 80% of their care and then the last 20% is coming from the clinician, especially in terms of the active decision of what to do. And we've built a bunch of tooling on the back end that speeds up the clinicians, providing them with automation and decision support. So putting together notes for them, automating the follow-up process of checking in with the patient after a visit, automating putting together a care plan for them as well. And those are some of the problems we've worked on. We're in an interesting state right now, Nathan, where we're actually, with the public launch of GPT-4, releasing and aging out old models and putting new ones in. So right now, if you actually download the app today, you probably, unless you're in five percent of patients, won't get much interaction with the language model because we're doing a slow rollout. Otherwise, presumably three months from now, if you're listening to this, this was recorded in April, 95 percent, 100 percent of people will be receiving that kind of interaction directly with the language model.

Nathan Labenz: (1:04:15) Yeah. Again, it is amazing. Just the timelines are so short. It only came out a month ago. And I always remind people because of course the hype has also come up very quickly. But in a pre GPT-4 world, I think it was still reasonable, if not necessarily the right conclusion, to say, well, I don't know, I tried GPT-3 and it was pretty dumb still. And you're telling me this is going to change the world. And now we're in a moment where it's like, here's the real deal. It really is going to change the world. But it's only been available and still API access waitlisted, all that kind of stuff for not even a full month yet. So the timelines are just insane. So that's good to know. You've got the five percent deployed in your direct model, and then you're working your way up on that. What are you hearing from the establishment? What are the regulatory barriers that you think you're going to have to deal with? How big of a problem is HIPAA for you? Every time I feel like I do anything with a doctor, it always falls down on, I can't get the information out wherever I want it. It's always a pain. I'm sure that's a challenge for you, but how big of a challenge is that? And what do you think are the most interesting or difficult parts of that overall challenge?

Neal Khosla: (1:05:33) Yeah, I mean, one of the challenges with using GPT out of the box is that it's not HIPAA compliant. And so we've had to be intelligent about where we can utilize it, where we can utilize other models, where we have to build our own stuff. Yes, it is a challenge. I know that Microsoft and others are working on this, and I think long term, they all know that healthcare is a high value use case, and so I'm sure this problem will get solved. But for now, it continues to be a little bit of an obstacle. From the establishment perspective, what's amazing is how much ChatGPT seems to have changed everything. I never thought I would have CEOs of health systems and large health plans, some of whom we work with, that are like, tell us about your work with ChatGPT and how is this going to change our business? I think part of it is that it made our business, we've seen a pretty massive acceleration in our business just because it has really made our business digestible. I used to have to explain AI and what it can do and what it can't do. And now people just assume that everything can do everything because of ChatGPT. It's so abundantly evident that it's hard to argue against. I mean, you definitely see some more thoughtful critiques in the medical establishment that are like, here's where there are errors and there are hallucinations and things. And when I say GPT-4 is the most powerful model, you still have to solve for all of those challenges. And things like explainability are really important when you're deploying to doctors. So how do you combine GPT-4 with other models so you can get interpretability and explainability? These are the types of questions we're getting from the establishment, but I'm not really seeing anybody say, hey, these models can't do it. And that's remarkable because two years ago, a year ago, 95% of the establishment was like, these models can't do it. And that all seemed to change in a period of three to six months.

Nathan Labenz: (1:07:33) Are there countries that you think are best suited to take advantage of AI in medicine?

Neal Khosla: (1:07:39) I think there will certainly be a leapfrog effect. I don't know if you folks saw the survey data that came out showing a direct correlation between optimism on AI and the wealth level of the country. Poorer countries are just abundantly optimistic about this stuff. If you think about a country like India, there's a billion plus people and just no ability to service their patients. One of the practical challenges in these developing countries is that the way medicine is practiced, even coming down to things like the drug supply chain, are not necessarily embedded in these models and are actually quite nuanced and different. So you can ask your favorite large language model what you would prescribe this patient, and it will generally give you an answer that is acceptable to the Western world, especially highly indexed on the US. But when you go to India, it turns out they really only have a very limited drug supply chain, and so your answer becomes irrelevant. You can try and ask the model what should I prescribe if I'm in India, and sometimes it can do okay, but other times it doesn't really have the knowledge of what that entails. And then there's other challenges like basic health literacy and just ability to articulate what's going on with you as a patient. Those are all practical challenges that will get solved, and I think the regulatory environment will be a lot more favorable in those countries. But we operate primarily in the US, and I would say we're really optimistic about the environment in the US. Right now, we operate under tight doctor supervision. And so our model creates an opportunity for us because it's complex. It's like the difference between brewing a Starbucks coffee and building an entire Starbucks. Right now, we have to build the entire Starbucks. In India, you can just brew the coffee and hand it to everybody. And so it's a little more complex, but building the Starbucks is higher value and I think longer term allows you to optimize everything, be able to collect our own data, do our own human feedback, do our own fine tuning. And those are all things we're working on that allow us to gain pretty significant advantages. So I am still bullish on the US because the establishment and the regulatory environment haven't done the thing that some of these other countries are starting to do and come out and say no. I think the mentality continues to be, let's see and let's be careful and let's have safety, but we're open to this stuff. China, I think, is a different thing. I have to imagine the Chinese government's just going to, in six months, everybody's going to have a large language model in their hand and there's going be no doctors. I mean, I'm only being partially facetious there. It's a little bit alarming, but I think they'll do really well with it long term. I think short term, there'll be some serious damage done if they do something like that.

Nathan Labenz: (1:10:42) You mentioned China and the ability to make executive decisions on something like this. I was wondering if you also had a point of view around systems like, say, the UK with the National Health Service or Canada, where there is this much more centralized decision making structure. Do you think those countries could be super early adopters or super late adopters, just depending on maybe some very idiosyncratic factors?

Neal Khosla: (1:11:15) I have to imagine that's true. We've looked a little bit at the UK, I haven't looked at Canada. My general concern is that there are such large bureaucracies that it's hard to move. The one good thing about the US is if you want to get distribution in the US, yes, eventually you have to work with Medicare and Medicaid and you have to work with UnitedHealth Group and some of these super large players and Anthem. But there's a lot of places where you can start and get some evidence and prove out the model. And that's a lot harder in Britain where everything is under the flag of the NHS. Now, there is a private health care system that has cropped up in response to some of the shortages, and the need is really acute. I mean, pretty much everything you read about the NHS says it's crumbling from the insides, so they need to figure out how to make it economically sustainable. There's no question in my mind that large language models can probably solve a lot of the challenges that they're dealing with. But governments have to move slow and carefully, and most large nationalized health systems are careful by nature. They might set up some innovation pockets, but it's not like this is the core of what they do. And so I'm less optimistic that this is going to break through in Britain. Britain does have some interesting examples on the mental health side. There's a company called Limbic that's doing a bunch of mental health triage using large language models. That's a cool model. And so I think there's some opportunity. The challenge is as you move from areas where there's a clear shortage, and mental health is one of these areas where nobody's ever going have the access that they need, and so you just need to find a solution. The problem is when you start to move more into core medicine, you get into this territory fighting and elbow jostling kind of thing that I worry is going to be harder. But I'd love to be proven wrong.

Nathan Labenz: (1:13:19) What do you think is the right way to regulate this kind of new paradigm? I've heard a little bit that the FDA is maybe going to think about a device regulatory paradigm. Does that make sense to you, or how would you think about that?

Neal Khosla: (1:13:36) I think very clearly the right way is, so long as there's doctor oversight, these things should be regulated as decision support and the doctor is ultimately responsible for making the right decision. And I think longer term, if you want one of these things to operate autonomously, it needs to be done much like any other medical device. There's a company called IDx-DR, and they're a diabetic retinopathy screening that has full scale approval to take AI images of retinal images and then diagnose the patient and come up with a care plan, if I recall. And I think it's a good case study where you should be able to show the FDA evidence that you can handle certain kinds of clinical cases effectively and they get approval to do them autonomously. But there should be a high bar for evidence because I do think it's dangerous right now. In its current state, given what we see, these models are incredibly powerful. But I think releasing them to the average consumer and just letting them practice medicine would be a mistake. So I think right now the way to do it is to start with doctor supervision and then graduate to, hey, we can handle certain kinds of clinical cases autonomously when we get evidence, and then we can go from there. Right now, we don't really know what the ceiling of performance is. How close to perfect can these models get? What does it even mean to be perfect in a probabilistically uncertain environment where we don't really have perfect knowledge of the human body? Those are all questions. And the interesting thing is today, the way that's measured in humans is we measure it based off of what's called standard of care. So basically, if you get sued for malpractice, it's what would the average doctor have done? And the issue here is that these models are incredibly good at doing the average thing. And so right now, they probably already meet that bar of what would the average doctor have done. And so we're going to have to figure out some ethical questions about is that acceptable? To me, it should be. It should be the same standard because, unlike in the autonomous vehicle world, there's no perfect driving, right? There's so much ambiguity that I do think we have to compare it to what's standard of care and what's average. And so I think long term, that's going be the question when you think about regulating these things. What is sufficient? And I suspect there's going be a bunch of lobbying in all directions about where this ends up. But from my vantage point, these things have a lot of potential to improve access to care. And if they can replicate what your median physician is doing, that is something that the FDA and others should take seriously and say, hey, we can totally invert the supply demand curves of medicine in this country.

Nathan Labenz: (1:16:39) How do you think about competing against AI first systems that just do their best and let the patient decide what to do from there? We'll have a couple of them coming on.

Neal Khosla: (1:16:48) The main thing, Eric, is that those systems can't actually provide the utility to the patient. And the utility is in actually giving them a prescription, modifying their medication, giving them a lab test, interpreting those results. Those are all things that need to be done by a doctor in this country, at least from a regulatory perspective today. So while I expect a lot of people would just go to ChatGPT and get a recommendation, and it says rest and ice for the next week, great. But if it says start taking rosuvastatin, can't do that for the patient. So ultimately, I think the advantage comes in completing the job to be done for the patient. And then if you can stick with them over time and do the job, like I said, medicine evolves. You're not solving one problem at one moment in time.

Nathan Labenz: (1:17:37) You mentioned there's going be this regulatory battle. I'm honestly surprised by how slow that has been to ramp up. And we also talked a little bit about the possible leapfrogging. I think one really interesting claim that is starting to maybe be credible now, but I don't know, it's close, right, moving right on the border, is that there's sort of maybe an obligation to deploy these kinds of systems. And you could say, yes, certainly in the US, you can't get a prescription filled without a human doctor's signature or whatever. But there are a lot of places where that's not so true. A lot of places where you can buy anything you want at a pharmacy. Arguably, if they're good enough and the alternative is so weak, do you see an argument for just saying, hey, this stuff really ought to be deployed, even if it's still imperfect, even if we haven't figured out all of the guardrails, even if we maybe haven't got to a level of safety that would pass an FDA review, but just internationally, right, the billions of people that don't have the standard that we have. I can see a pretty compelling case for we should put that out there now and accept that there will be some downsides and some harms, but ultimately the benefit maybe dramatically outweighs that. What's your take on that argument?

Neal Khosla: (1:18:58) I certainly think in parts of the developing world where there's basically zero access to care, the answer is absolutely. There's a guy named Raj Panjabi. This is a project called Last Mile Health. And if I recall, what they're doing is they basically go and train high school graduates in how to do community based primary care. So you're talking about parts of Africa where you have people who are nowhere close to a doctor who are being trained on, they basically say these are the ten most common things you run into here. It's malaria, it's whatever. And we're going to teach you how to practice medicine for these ten things. And you don't have to be particularly trained at all. And so in this way, we can expand access to care pretty massively. There was a story about Bloomberg training high school graduates to do C-sections, I think, as well. I think Bloomberg had worked on that at one point. And it's just an example of you're talking about parts of the world where access to care is so poor that these kinds of systems are an absolute must. I do think it goes back to my issue that the real problem to be solved is how do you adjust to the cultural and biomedical norms of that region. That is one thing that will need to be solved to do this. But yeah, I think there is a moral imperative to get these things to a lot of people ASAP.

Nathan Labenz: (1:20:18) I think that's a great concluding note. I can just give you three kind of real quick rapid fire questions and you can give me as brief of answers as you like on these. First, any AI products that you use beyond the obvious ChatGPT that you would recommend that the audience try out?

Neal Khosla: (1:20:39) Oh, man. I've been playing around with a couple of these Sheets based things for spreadsheets, but I'm not advocating for any of them super strongly. I've been playing around with some of these YC companies doing what I call RPA, Zapier clones using AI. I think those are super cool. There's one called Layup that I liked. I don't have a good answer. I'm more playing than I have something in my habits right now.

Nathan Labenz: (1:21:10) Honestly, this did not start as a trick question, but you're in very good company with that answer. And one of the biggest takeaways I've had from this series of conversations has been how there just aren't that many applications that are adding that much value to the core model right now. I do think integration, as you said, with Sheets, that'll make a ton of sense. It'll be way more convenient when it's in the Sheet directly than to flip over to ChatGPT.

Neal Khosla: (1:21:38) The sad thing is I'm trying to learn how to use these things too. I have to teach myself. I know there's some way to look at what I do in a day and be like, this can be GPT-ized and this can't. But the only thing I can do is put it into GPT right now. That's all I do, and everything else is, I'm trying to learn these new products.

Nathan Labenz: (1:21:59) You're in good company. We're all learning in real time here together. All right, second quick hitter. Hypothetical situation, and I'm especially interested in your take on it through your medicine lens here. Let's imagine a future world where a million people have the Neuralink implant, and now it's general availability. If you get one, you have thought to text or thought to UI control. Essentially, you can use your devices and transmit information to your devices straight from your thoughts. Would you be interested in getting one?

Neal Khosla: (1:22:37) I have long said that the product that I want is the thing that records my thoughts while I'm falling asleep because that's when I have all my best thoughts. There's a particular term, it's a hypnagogic state that your brain gets into. I forget the exact word, but you can actually coerce your brain to be in this state, which is a separate thing. But the direct answer to your question is, I don't know. And I think at a certain point, it's very clear to me that in the next 10, 15, 20 years, whenever it happens, we're reaching a point in society where life is going to be more about understanding what spiritual fulfillment means to you. And it's not going to be about optimizing your performance or whatever. Or maybe that will be what it is. But there's going to be so much infrastructure that cognitively can do so much for us that you have to decide what do I want my experience in life to be? Do I want to live in VR or the metaverse? Do I want to just go out in nature and hike? Do I want to have some neural implant? I just don't know. And I think the sad thing about being an entrepreneur is I probably don't have time to think about those things, even though I'm working directionally on some of them. So my sad answer is I kind of need to know more about spiritually what I want out of the next 70 years of my life, or maybe it's 170, I don't know, with this longevity stuff before I can say, yes, I need the implant now. On its surface, I want anything that enhances my life. But I think what we're finding is these things have such, even with our phones, they have these very unpredictable second order effects on how we live. And it's a little scary in that regard, but I'm also super excited about it.

Nathan Labenz: (1:24:28) Well, you kind of anticipated the last question there, which is just zooming out as much as possible, and you're just starting to do that. What are your biggest hopes for and fears for society at large as we begin to feel the impacts of AI over the rest of this decade?

Neal Khosla: (1:24:45) Yeah, I'm a pretty staunch capitalist, but I do believe that we are going to have to figure out how to redistribute wealth or at least redistribute prosperity. It's not even clear to me that classical economics are going to totally hold up the way they have historically. I do think that there's going to be an abundance of wealth created. And I think I'm super optimistic, but I think this goes wrong if you turn into this dystopian, a few people control the world kind of thing. Or if we try and stop these things because you won't stop it. People like China and Russia will build these things. I think the world gets dystopian maybe in a different way. And so for me, if I look out 20, 25 years, I think the big question for society is, how are we going to get people to shift from a scarcity mindset to an abundance mindset as we build it? And if we can do that, then we can build a world where people are really happy. Then I think there's a separate question of, how do we as humans finally shift to a world where people don't have to work, they don't have to do stuff just to get by? I remember listening to Bill Gates talk at one point, he said if the purpose of a human being is to be a hamburger chef, that's a pretty depressing existence. And so a lot of people worry about work displacement and stuff, and I think there's good reasons because we need everybody to share in economic prosperity. But I don't think anybody aspires, I mean, SpongeBob aside, to be a fast food chef. And so I think it's a net good thing. The problem is we don't have answers for what it means and what the implications are for how people should spend their time. So until we figure those things out and if we don't, we're sort of in trouble. And so that's kind of what I'm hoping for the next 25 years. People realize there's so much good that's going to come. There's a version of the world in 25 years where people can live forever. There's energy is free, intelligence is free, there's just abundance created everywhere. And the problem that we're worried about solving long term is, how do we then expand and colonize more and more places and expand our presence in the universe? And that's a very exciting version of the world, and people are free to play music and games and socialize and do all the other things that give them joy and meaning in life. But to get there is going to require a lot of societal restructuring. So personally, I'm really optimistic and I think compared to the average person in Silicon Valley, I do believe that we have to work with existing governments and I'll call it the public side of the world to make this thing happen. Otherwise, we're just going to build these things in isolation. They're going to be more destructive than anything, which is why I kind of have the perspective I do on regulation, which is, yes, I think if we let AI run wild in medicine, it's going to be a problem. Otherwise, we could really benefit from this stuff if we collaborate. I just talked for way too long. I rambled a lot today, but this was awesome.

Nathan Labenz: (1:27:59) We love it. Neal Khosla, thank you for being part of the Cognitive Revolution.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Revolutionizing Patient Care with Neal Khosla of Curai Health

Watch Episode Here

Video Description

Full Transcript

Transcript

Transcript

Nathan Labenz

Read next