Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

Watch Episode Here

Listen to Episode Here

Show Notes

Karan Singhal, Head of Health AI at OpenAI, explains how ChatGPT Health is achieving attending-physician-level performance and already serving hundreds of millions of users. He details how OpenAI works with over 250 doctors, built the 49,000-criteria HealthBench evaluation, and ran one of the first randomized trials of AI copilots in clinical care. The conversation explores privacy and safety safeguards, medical multimodality, N-of-1 treatment plans, and how AI could become a standard part of global medical practice.

Use the Granola Recipe Nathan relies on to identify blind spots across conversations, AI research, and decisions: https://bit.ly/granolablindspot

Sponsors:

Claude:

Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr

Serval:

Serval uses AI-powered automations to cut IT help desk tickets by more than 50%, freeing your team from repetitive tasks like password resets and onboarding. Book your free pilot and guarantee 50% help desk automation by week 4 at https://serval.com/cognitive

Framer:

Framer is an enterprise-grade website builder that lets business teams design, launch, and optimize their.com with AI-powered wireframing, real-time collaboration, and built-in analytics. Start building for free and get 30% off a Framer Pro annual plan at https://framer.com/cognitive

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

CHAPTERS:

(00:00) About the Episode

(06:11) Cancer story and mission

(11:46) Designing safe health AI (Part 1)

(17:49) Sponsors: Claude | Serval

(21:09) Designing safe health AI (Part 2)

(26:48) Uncertainty, HealthBench and robustness (Part 1)

(30:23) Sponsors: Framer | Tasklet

(32:50) Uncertainty, HealthBench and robustness (Part 2)

(38:11) Chain-of-thought and evaluation

(46:49) Real-world performance and frontiers

(55:35) Multimodal data and science

(01:05:36) Personalization, privacy and monitoring

(01:15:47) Models, data and incentives

(01:29:31) Doctor adoption and workflows

(01:38:13) Scalable oversight and alignment

(01:51:06) Move 37 and future

(02:00:50) Episode Outro

(02:03:06) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Full Transcript

(0:00) Nathan Labenz: Hello, and welcome back to the Cognitive Revolution. Today's episode is brought to you in part by Granola. To help new users experience the power of the Granola platform, Granola is featuring AI recipes from AI thought leaders, including several past guests of this show. There's a Replit recipe that converts discussion notes to an application build plan, a Ben Tossell recipe that creates content production plans, and a Dan Shipper recipe that looks across multiple sessions to identify cultural trends at your company. My own recipe is a blind spot finder — it looks back at recent conversations and attempts to identify things that I might be missing. This has already proven useful in the context of contingency planning for my son's cancer treatment. And as I use it more and more, it's getting better and better at suggesting AI topic areas that I've neglected and really ought to explore. See the link in our show notes to try my blind spot finder recipe and experience how Granola makes your meeting notes awesome.

Now, today, my guest is Karan Singhal, who leads Health AI at OpenAI and who was just named to the Time 100 Health List for his pioneering work. This episode began to come together last year on Thanksgiving when I emailed Karan — who I'd met a couple times at AI events — to say thank you for all of his work on AI for health and to let him know what a difference ChatGPT had made for me and my family in the context of my son's cancer diagnosis. As it turned out, that was just as OpenAI was preparing to make a major product push with ChatGPT Health, which allows users to connect ChatGPT to data sources including electronic medical record systems and consumer wearables, plus a physician-facing ChatGPT for Healthcare — both launching in early 2026.

In this episode, we dig into how Karan and team have achieved attending-physician-level performance with their latest models, their plan to ensure that this capability does benefit all of humanity, and their vision to raise not just the floor but also the ceiling of human health with continued research and even better models to come. Highlights of this conversation include how OpenAI works with more than 250 human doctors to ensure accurate, robust, and culturally appropriate responses; how they built HealthBench, which contains some 49,000 evaluation criteria to measure model performance; and how models have already gone from a 0% score on HealthBench-Hard by GPT-4o when the benchmark was first created to already 40% today. Plus an overview of my experience using large language models to navigate a health emergency, including the critical importance of giving models as much context as possible on your situation, and how that's about to get dramatically easier as ChatGPT Health rolls out globally.

We also discuss how 230 million people are already using ChatGPT for health questions on a weekly basis; the first randomized trial of AI copilots for physicians, which OpenAI conducted with Kenya's Penda Health system and which did show a statistically significant improvement in outcomes for patients whose doctors used AI; and why Karan believes, based on the reception that OpenAI is getting from health systems, that 2026 will be the year that using AI becomes a standard part of medical practice.

From there, we go on to cover the steps that OpenAI is taking to ensure privacy and security of users' health information; how they're using worst-case-end measures to make sure models first do no harm, while at the same time striving to maximize value by training AIs to acknowledge their uncertainty as they offer their best guesses; how Karan understands the relationship between AI for health, AI safety plans such as scalable oversight, and AI alignment more broadly; Karan's report that OpenAI's models' chain-of-thought reasoning has not drifted toward neuralese as much as some reports had previously caused me to believe; the future of medical multimodality, which will do a much better job of converting data to value and which inspired me to buy a Whoop wristband to start collecting data on myself; and the compounding effect of parallel advances in AI for science, the growing potential for n-of-1 treatment plans and medical move 37s, and the possible need for an update to the rules governing access to experimental medicines and information sharing.

Finally, Karan describes OpenAI's utopian plan to make ChatGPT Health available to all users globally for free with no ads — an early form of universal basic intelligence that I really think everyone ought to celebrate as a triumph of human ingenuity and goodwill.

Zooming out: in the grand scheme of AI development, I think it is fair to say that we have far more questions than answers, and in my mind, all outcomes from a post-scarcity utopia to literal human extinction absolutely remain on the table. I signed onto the recent call for a ban on superintelligence because I do worry that an AI arms race driven by recursive self-improvement loops could easily get out of control. And yet, at the same time, capabilities like this — which have been so valuable for me and my family, and which will undoubtedly save millions of lives in the coming years — are for me both an incredibly inspiring accomplishment and a practically irrefutable argument for the upside of AI.

The question at this point is not whether we will create powerful AI systems, but exactly what form they will take and under what circumstances and incentives they'll be developed and deployed. This conversation demonstrates that for the moment at least, we can have it all: AI systems meticulously crafted to minimize downside risk, which are both capable and efficient enough to meaningfully improve the human condition globally. There is a ton of work left to be done both inside and outside of the frontier companies to make sure that these lofty standards don't slip in the face of intensifying competition. But today, if you or a loved one are facing a complex health challenge, you owe it to yourself to take full advantage of the incredible medical expertise that Karan and others have managed to build into systems like ChatGPT Health. With that, I hope you enjoy this inspiring look at the frontier of medical AI with OpenAI's head of health, Karan Singhal.

(6:11) Nathan Labenz: Karan Singhal, head of health AI at OpenAI. Welcome to the Cognitive Revolution.

(6:17) Karan Singhal: Thanks for having me.

(6:18) Nathan Labenz: I'm super excited about this. It's rare that I've had so much impact from a guest's work on my life as I've had from what your work in health at OpenAI — and at Google previously — has meant for the last three months for me. Regular listeners know the story: my son got cancer. I've been an intensive user of all the frontier language models over the last three months to advise us as we've gone through this process, and they've been a game changer. Thank you for all your hard work and for making the mental health side of this equation dramatically better than it otherwise would have been, and also for moving the needle on my son's treatment and our confidence that we were actually doing the right thing for him. It has been invaluable, and the consumer surplus has been off the charts.

(7:13) Karan Singhal: Amazing. Thanks for the kind intro, and thank you for sharing that story. I think it resonated with a lot of people, so thank you for that.

(7:19) Nathan Labenz: A big takeaway of this conversation, I think, will be: if you find yourself in a medical emergency — or even just want to do a better job of managing your health in general — the frontier models today are getting really good at that. I always go to the example of the AI doctor. There's obviously a relative scarcity of medical expertise even in a wealthy country, for a privileged person like myself in the United States. You broaden your worldview and look around the world, and the shortage is extreme. I've always felt like this would be just an absolutely killer use case that everyone could agree on. When I started talking about it a couple of years ago, it felt like it was getting close but still a ways off. I've done a bunch of episodes over time with Vivek and some of your former teammates at Google DeepMind who work on similar topics, and they've always been appropriately cautious — yeah, it's not quite there yet, we're not quite ready to roll it out to production, but encouraged by the progress, that sort of thing. But it really is getting there. The first question I wanted to ask is: how did you get into this, and what were you expecting? How big of a dream were you daring to dream when you first got into AI for healthcare some years ago now?

(8:34) Karan Singhal: Yeah. I started working on AI for healthcare about four years ago, transitioning to doing it full time around that time. I was thinking a lot about a few fundamental research problems — foundational work in representation learning, privacy-preserving learning, and interesting applications in healthcare. The background for all of this was a conviction I'd had since undergrad that AGI would be a pretty big deal and that it would probably happen within our lifetimes. And I thought there were probably two things I could do to make that better: one is to work on safety, and the other is to work on benefits. I saw healthcare as the most obvious area for benefit, like you. And like you said, we've been on this amazing exponential over time — both with model capabilities and, more recently, with people's adoption and the Overton window shift around trust in these models. We're seeing a lot of people start to use these models across individual patients, individual clinicians, researchers, and seeing a lot of the benefits become much more tangible.

All of this comes down to, for us at OpenAI, thinking about what it means to make our mission real. Our mission is to ensure AGI is beneficial for all of humanity, and there are three parts to that: one is build and deploy AGI; the second is prevent downside risks, whether short-term or long-term frontier risks; and the third is about how we can make benefits happen. And I think health is one of the most obvious and tangible ways those benefits can materialize.

When I started working on health — around 2022 was when it became full time for me, right before ChatGPT — my thinking was around a capability overhang between where LLMs were at the time and how they were being adopted and thought about in the clinical AI and healthcare worlds. You were scaling these models up, seeing that you could instruction-tune them and they could do amazing things. My ambitions at that time were really two things: first, get the healthcare and clinical AI world to think about LLMs as something that could actually work — and again, this is prior to ChatGPT — and second, think about the work we would need to do on safety and reliability to make the models trustworthy for this setting. Over time those ambitions became less and less ambitious, and my ambitions at OpenAI became larger. We started out with three goals. The first is to make access to medical expertise more universal. The second was thinking about how we can use this setting as a way to ground our work in safety and alignment. And the third was thinking about how we can bring society along with this high-stakes technology — working with partnerships, rolling out products, working with policymakers to think about the right ways to iteratively deploy in a setting like this. Two years ago, these things sounded really ambitious. Now we're kind of feeling like they aren't ambitious enough, so we're really excited about what's to come.

(11:46) Nathan Labenz: You did a great job there of laying out a taxonomy and a scaffold for this conversation. Let's maybe talk about capabilities first, but they are sort of inseparable. From a capabilities perspective, is there also a Hippocratic Oath kind of mindset that you bring to the table — focused on making sure the AI performs well in medicine — that is distinct from the bigger-picture safety agenda that motivates you? How do you think about the way the model should perform in terms of doing no harm? Because I think GPT-4o, while it did add value, could also definitely do some harm by giving you wrong ideas. I do think we've come a long way since then, but I wonder how you think about that challenge.

(12:37) Karan Singhal: 100%. The way we think about the health work at OpenAI, we've been operating in three phases. One is laying the foundation, and a lot of that has been around safety research — making the models not just better reasoners, but also better at having good bedside manner, conveying uncertainty well, and knowing when to escalate to a doctor when needed. More recently we're in a phase of adoption: as the foundations have solidified, a lot of people have started using it. It's been one of our fastest-growing use cases. We shared recently that over 230 million people a week are using ChatGPT for various health and wellness-related queries. And this year we're really focused on scaling the impact of that work.

So your question about how we imbue models with the right sense of how to behave — how do we ensure minimal harm? We were very thoughtful about this as we were laying the foundation and starting to work on health at OpenAI. A lot of it comes down to our close partnership with a cohort of about 260 physicians that we've been working with for roughly two years. One way to imbue models with certain behaviors is to write a spec from first principles — you or I just sit down and say, "this is how models should behave, they should say X, Y, Z in this situation and something else in another." There are pros and cons to that approach. A pro is that it's easy to explain and understand what's going on. A con is that it's very hard for you or me to specify what should happen in every scenario, and saying "do no harm" is an excellent principle but it doesn't tell you what to do in most scenarios. What you want to do is find a way to move from large-scale principles to what you do in very specific situations — and make sure that's guided by the expertise of hundreds of experts, not just one or two people.

That's essentially the approach we've taken, and you can see it in our approach to evaluation. We recently published, back in May 2025, work on HealthBench — an evaluation of how large language models perform in realistic health conversations between users and models, where users could be either lay users or health professionals. We built it by leaning on the expertise of these 250-plus physicians rather than writing specs from first principles. I can explain more, but I'm sure you have many more questions too.

(15:38) Nathan Labenz: Yeah, go on as long as you'd like. I want to double-click on the physicians for a second. What's the nature of the relationship with them? I could imagine anything from side-by-side RLHF-style preference labeling to some of them being much more deeply integrated with the research team.

(16:04) Karan Singhal: We have about three different layers of physician expertise, and we were very thoughtful about how to bring that expertise into our team in a way that balanced and combined well with the research expertise we have internally. The first layer is high-level advisors — more informal relationships, people who help us with strategy, share our roadmap, things like that. The second layer is folks we work with more closely. You can think of it as a human data operation, but with more active collaboration. We're Slacking with them all the time — they're not just going off and doing tasks, we ask them for advice constantly. We have a Slack community where we're actively interacting with them. Some of what they do is comparing model outputs, red-teaming model outputs, looking for ways in which we might have blind spots today and flagging those so we can prioritize them in the future, testing new products, things like that. For example, the work on ChatGPT for Healthcare, which we announced on January 8th — we had red teamers test that product over nine waves across six months, in close collaboration with this physician community. The third layer, at the top of the pyramid, is really close advisors who work most directly with our team. They're the ones channeling the collective voice of these hundreds of physicians, interfacing most closely with our research team, and translating all of that into evals and model training data so that we can then improve our models.

(17:44) Nathan Labenz: We'll continue our interview in a moment after a word from our sponsors.

(17:49) Nathan Labenz: One of the best pieces of advice I can give to anyone who wants to stay on top of AI capabilities is to develop your own personal private benchmarks — challenging but familiar tasks that allow you to quickly evaluate new models. For me, drafting the intro essays for this podcast has long been such a test. I give models a PDF containing 50 intro essays that I previously wrote, plus a transcript of the current episode and a simple prompt. And Claude has held the number one spot on my personal leaderboard for 99% of the days over the last couple of years, saving me countless hours. But as you've probably heard, Claude is the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you — whether you're debugging code at midnight or strategizing your next business move. Claude extends your thinking to tackle the problems that matter. And with Claude Code, I'm now taking writing support to a whole new level. Claude has coded up its own tools to export, store, and index the last five years of my digital history from the podcast and from sources including Gmail, Slack, and iMessage. The result is that I can now ask Claude to draft just about anything for me. For the recent live show, I gave it 20 names of possible guests and asked it to conduct research and write outlines of questions. Based on those, I asked it to draft a dozen personalized email invitations. And to promote the show, I asked it to draft a thread in my style featuring prominent tweets from the six guests who booked a slot. I do rewrite Claude's drafts — not because they're bad, but because it's important to me to fully stand behind everything I publish. But still, this process, which took just a couple of prompts once I had the initial setup complete, easily saved me a full day's worth of tedious information gathering and allowed me to focus on understanding our guest's recent contributions and preparing for a meaningful conversation. Truly amazing stuff. Are you ready to tackle bigger problems? Get started with Claude today at claude.ai/tcr. That's claude.ai/tcr. And check out Claude Pro, which includes access to all the features mentioned in today's episode.

Your IT team wastes half their day on repetitive tickets — password resets, access requests, onboarding — all pulling them away from meaningful work. With Serval, you can cut help desk tickets by more than 50%. While legacy players are bolting AI onto decades-old systems, Serval allows your IT team to describe what they need in plain English and then writes automations in seconds. As someone who does AI consulting for a number of different companies, I've seen firsthand how painful and costly manual provisioning can be — it often takes a week or more before I can start actual work. If only the companies I work with were using Serval, I'd be productive from day one. Serval powers the fastest-growing companies in the world like Perplexity, Vercel, Mercor, and Clay. Serval guarantees 50% help desk automation by week four of your free pilot. Get your team out of the help desk and back to the work they enjoy. Book your free pilot at serval.com/cognitive. That's serval.com/cognitive.

(21:09) Nathan Labenz: We talk about channeling the voice of the physician and calling back to the Hippocratic Oath. One thing I find kind of frustrating about my experience in the medical system recently is that I think there's a little too much emphasis on "do no harm." And this connects in a pretty deep way to questions about how we should conceptualize what to tell our AIs to do. We have very detailed, Talmudic rule sets as one extreme — for every corner case, we try to map out what you should and shouldn't do. Then on the other end, there's the Anthropic Constitution, which has at least demonstrated that you can get pretty far with something less rule-based and more about trying to teach the model to have good character and use good judgment across all the situations it finds itself in.

My critique of the human doctors I've engaged with — who have generally served us really well — is that they definitely want to do no harm, to the point where they're sometimes too reluctant to engage with a hypothetical or to act on something that makes sense. I do have one good friend who's a doctor and who happens to have had the same or very similar cancer to my son, and he's notably behaved differently with me in private one-on-one. He's like, "Look, I don't need a randomized controlled trial to tell you that makes sense." You don't get that too much when you're actually at the hospital. There's a reluctance among physicians — certainly in my experience, and I think it's a pretty commonly shared perception — to act on things that make sense because they aren't guaranteed to work out. Biology is super messy, there's a ton of diversity, and there's just a lot we don't know. So how do you think about that challenge? And I wonder where you guys want to land in terms of only adhering to the most rigorously defensible advice versus doing a little more of the Amanda Askel thing — being willing to take some risk to help people sometimes, and maybe having the models do that too. (23:42) Karan Singhal: Yeah, it's a great question. You're pointing out problems on two sides of the healthcare ecosystem. One is as a patient — you have this challenge of needing to advocate potentially pretty hard for yourself. In your son's story, you pointed out a couple of false starts where you saw a doctor, they said it was probably normal, you had an abnormal blood test reading a couple of times, and then they were just kind of like, yeah, it's probably okay. This is the kind of experience that patients often have when they're having something that they feel or know to be an issue — needing to advocate for themselves and feeling like their doctor isn't hearing them. And on the clinician-facing side, you have this challenge of medical evidence increasing rapidly. It's very hard to keep up with the latest of what's going on. You're overloaded with documentation, and there's a bunch of burden there. Doctors are human too. So there are challenges on both sides.

And so you have this amazing thing, which is AI that is able to do a few things. You have AI that's able to talk to patients, understand their concerns, integrate knowledge and information across both their previous history and the latest medical evidence. One of the things that these models are obviously very incredible at these days is taking in a huge amount of health context — not just on you, but also on the latest medical evidence — and integrating that all together into one context to do something that I think is very difficult for a human to do. And then on the physician-facing side, you have, again, that same capability to integrate information and assist the physician.

That's a lot of why we're doing this work — because we see this gap between where the models are at and how people are using them, and we think that's really important with our upcoming products. To get closer to answering your question, you mentioned where do we see models navigating places where there's potentially a lack of medical consensus or where physicians would disagree about what to do. This is pretty fundamental to our approach. This is why we don't have one or two or three experts determining what the model's outputs are — we have a pretty multi-pronged approach. A lot of this comes down to presenting information to the user, but being sure to present uncertainty when uncertainty exists.

I mentioned the safety research motivation for a lot of the work that we're doing. One of the directions we've been exploring is whether these models are well-calibrated in their uncertainty, and can we get better at having these models verbalize their uncertainty? For example, if there are three to five potential paths for your son's next treatment, a doctor might, just for the sake of simplicity and clear communication, focus on one. A model can potentially communicate all three to five, but mention that the state of the evidence may be somewhat limited and flag the caveats. One of the things we've been investing in is, first, can models become better at understanding their own uncertainty, and can that be something we measure and improve? And second, can they verbalize it better? This is, I think, a big part of threading the balance between being more aggressive in sharing potentially early results or early evidence and being overly conservative. That's kind of how we think about that.

(26:49) Nathan Labenz: What are you seeing in terms of trying to get the models to understand their own uncertainty? I remember that famous graph from the GPT-4 model card where the pre-trained model seemed to be much more calibrated with respect to its own uncertainty than the trained model. We've only scaled post-training since then. I haven't seen an update to that kind of research in a while, but it seems like there was at least a fundamental challenge opening up there with respect to that introspective self-awareness of how confident a model is in its answer. Is that something you guys have solved? Is that why I haven't seen that sort of graph in a while?

(27:34) Karan Singhal: Well, I think there's a measurement challenge. The plot you're referring to was essentially: given the next token for a multiple choice question, did the model's probability of that next token correspond to how likely it was to actually be correct in choosing A? What you're seeing now is that it's become more difficult to measure that for two reasons. One is we have higher expectations of our models than answering multiple choice questions, so it's hard to say when they're correct or not correct. And the second — this is a little bit more technical — the models now emit reasoning tokens or thinking tokens between initially processing something and their final answer. The result is that you can't ask the model for the log probability of the next token being A in the same way you would before.

There are a couple of things you can do to handle this. One is you can go in the direction of richer ways of measuring whether a model is correct or doing the thing you want, rather than just measuring log probability of a certain letter. The second thing you can do is repeatedly sample from a model and then see whether performance stays the same or degrades across samples. Our work on HealthBench is a good example of doing both of these things at the same time. In HealthBench, we did this work around measuring not just one or two or three different aspects of model performance in health, but across 250-plus physicians and across 5,000 conversations, measuring about 49,000 different axes on which model performance could differ. Part of this is whether the model expresses uncertainty the right way, whether it's stating the right fact, whether it escalates to a physician when needed — again, 49,000 different things measured in HealthBench. One of the things you can do there is measure correctness in a way that is less about multiple choice accuracy and more about whether the right facts are included that are really important to emphasize to the user.

The second thing you get out of that is a metric we call worst-of-N — when you repeatedly sample from the model and measure performance on HealthBench, what is the worst performance you get across N samples? You sample from the model 20 times: what's the worst performance you see? So now you have a way of measuring — instead of the log-prob-based way — whether even a reasoning model produces a consistent result conditional on the different kinds of thinking it's doing. What I would say is: it's harder to produce plots like we could before, because now the thing we're measuring is so much more complicated. But when we do produce the plots, as we did in HealthBench, they're also looking pretty promising, and models have improved quite a bit at that.

(30:18) Nathan Labenz: Hey, we'll continue our interview in a moment after a word from our sponsors.

(30:23) Ad read: AI agents may be revolutionizing software development, but most product teams are still nowhere near clearing their backlogs. Until that changes, if it ever does, designers and marketers need a way to move at the pace of the market without waiting for engineers. That's where Framer comes in. Framer is an enterprise-grade website builder that works like your team's favorite design tool, giving business teams full ownership of your .com. With Framer's AI wireframer and AI workshop features, anyone can create page scaffolding and custom components without code in seconds. And with real-time collaboration, a robust CMS with everything you need for SEO, built-in analytics and AB testing, 99.99% uptime guarantees, and the ability to publish changes with a single click, it's no wonder that speed, design, and data-obsessed companies like Perplexity, Miro, and Mixpanel run their websites on Framer. Learn how you can get more from your .com from a Framer specialist or get started building for free today at framer.com/cognitive and get 30% off a Framer Pro annual plan. That's framer.com/cognitive for 30% off. Framer.com/cognitive. Rules and restrictions may apply. The worst thing about automation is how often it breaks. You build a structured workflow, carefully map every field from step to step, and it works in testing. But when real data hits or something unexpected happens, the whole thing fails. What started as a time saver is now a fire you have to put out. Tasklet is different. It's an AI agent that runs 24/7. Just describe what you want in plain English — send a daily briefing, triage support emails, or update your CRM — and whatever it is, Tasklet figures out how to make it happen. Tasklet connects to more than 3,000 business tools out of the box, plus any API or MCP server. It can even use a computer to handle anything that can't be done programmatically. Unlike ChatGPT, Tasklet actually does the work for you. And unlike traditional automation software, it just works — no flowcharts, no tedious setup, no knowledge silos where only one person understands how it works. Listen to my full interview with Tasklet founder and CEO, Andrew Lee. Try Tasklet for free at tasklet.ai, and use code cogrev to get 50% off your first month of any paid plan. That's code cogrev at tasklet.ai.

(32:51) Nathan Labenz: So how should you, as a user, think about the worst-of-N thing? Is there a way to translate — maybe you could just describe the result. Like, how much worse is the worst-of-N than, say, the next worst-of-N or the average of N? And if I'm a user, which I am, can I get security just by running it twice? Is there a practical upshot of that work that could give me confidence that if I do X, I can be sure I'm not getting something way worse than the model typically outputs?

(33:32) Karan Singhal: Yeah, the way I think about it is: the more compute you spend on things, you will get better results, and the marginal gains may diminish over time. One thing you could do as a user is sample from a model 10 times, combine that together into one output, and have an LLM synthesize the outputs — kind of an LLM council — and then produce that as an answer. I think that will be marginally better than the answer you get from just running one model, and this is not so dissimilar from what GPT-5 Pro and similar products do under the hood anyway.

So there are a few things you can do if you want to make the best of this. One is you can use GPT-5 Pro or something similar. A second thing you can do is increase the amount of reasoning, because I think it has a very similar effect to just running it multiple times. In both cases, my current sense is we're getting to the point where current models are performing incredibly well for most people, most of the time. But I know, for example, you using GPT-5 Pro was an important part of working through your son's situation. I think we're reaching a point where, except for the most complicated of cases, most people are best served by just using the model on a default reasoning setting. I would recommend using the reasoning models instead of the more instant models — using GPT-5.2 thinking rather than GPT-5.2 instant for a lot of health-related things. But I think most people can get the best of both worlds between latency and performance by doing that.

And the way to think broadly about the worst-of-N result is: if you sample from something 20 or 50 times, you'll have varying performance across model outputs. What we saw in the worst-of-N results is that recent models have improved pretty significantly — where the worst performance of, for example, o3 at that time was way, way better than the best performance of GPT-4o. We've continued to see that over time. We've been shipping model improvements in health pretty rapidly, and the model improvements in the last year have been more than in all previous years since ChatGPT launched. As an example, today the nano models — the GPT-5 nano models you can get through the API, and also our open source models — are actually performing similarly to o3, which was our best model not so long ago. And the latest reasoning models continue to push the frontier of how much you can do with less and less reasoning. This is true for 5.3 codex as well, and also for 5.2 thinking — if you use them by default on health queries, they'll actually think a little bit less but produce better results. We're continuing to push that frontier of not just needing to pour more compute in to get a good result, but getting better performance at a given level of compute. And the result is that the models are way, way better than they were even a year ago.

(36:27) Nathan Labenz: Yeah, it's been crazy in my just three months of intensive use. Nothing has brought home the pace of shipping quite as much as how many updates there have been just in this one chapter of my life. It has been wild to see. I don't know if this is something — how should we think about the density of effectiveness of the reasoning tokens? There was this episode I did with the folks at Apollo. I'm sure you know of their work, if not know them personally. One of the really interesting things they observed when they got access to the chain of thought is that, at least for — I think it was o3 at the time, although I'm not 100% sure which model off the top of my head — there was this seemingly development of a new dialect, basically internal to the model. You know, the famous "watchers, watchers," whatever that was.

Can you share anything about how you're balancing the obvious good of efficiency — denser thinking per token, more value per token created — with the seeming tendency for the internal chain of thought to go off in weird and potentially hard-to-parse directions? I guess there's also the commitment from OpenAI to not train on the chain of thought, or at least not apply certain kinds of pressure to it. I don't know if that's an absolute ban on any feedback on the chain of thought, but I'm interested to hear your thoughts on that, because it's something that I've been thinking about — the world moves past these big stories so quickly, and that one seems to have kind of come and gone, and I'm not really sure what the state of it is now.

(38:12) Karan Singhal: Yeah, it's a super interesting question, and also near and dear to my heart as a safety researcher. Chain-of-thought interpretability has been one of the nice advances in safety in the last year or two, which has been really cool to see. As models have become reasoning or thinking models, they've also emitted tokens that effectively explain their work and what they're thinking. This provides a form of interpretability for researchers who want to understand whether models are doing what they expect — as safety researchers do. It's been this really cool way of measuring whether models are doing things like scheming or producing other kinds of undesirable outputs. And this has been relevant beyond health to a bunch of other domains as well. I think a lot of the results that people have shared and studied have actually been in coding.

I've been pleasantly surprised. The danger you're pointing out is: as you put more pressure in reinforcement learning to get the models to produce a good output, will they slip away from the prior of having their thinking tokens be simple English that is easy for researchers to understand? What we've seen is actually pleasantly surprising — at least until now, we haven't seen a lot of large-scale evidence of a slip into what's called neuralese, where chain-of-thought tokens are used in a way that is not explainable or understandable in English. In general, as we've been trying to understand the monitorability of our models over time, we haven't really seen that effect as we've scaled up RL. I'm not sure if that'll continue to be the case in the future, but so far I've been fairly pleasantly surprised that this side effect of the reasoning paradigm has continued to be useful and hasn't robustly been seen to become unreliable. The result you're pointing out has been kind of continuous over time with weird blips in interpretability at times, but we haven't seen a continuous increase in that. And we haven't seen evidence — even though we've actually tried to study it — that scaling RL causes that to happen more. We haven't seen that yet. I would expect in the limit it does.

(40:28) Nathan Labenz: Okay, that's quite interesting. The way I sort of interpreted that, first of all, I thought it was happening more than it sounds like you're saying it is. And I naively assumed that there was some sort of brevity reward signal being applied in addition to ultimate correctness — and certainly it's intuitively comprehensible why something like that would start to happen if you did that. Should I infer from what you're saying now that there isn't really a brevity signal, and that this is more of just an emergent weird phenomenon that doesn't happen that often? So it's kind of one of those weird language model things that we keep an eye on, but don't obsess about too much because it's rare. Is that a fair summary of the state of play?

(41:11) Karan Singhal: I do think it's important to pay attention to. The right way to think about it is: there's nothing reinforcing that this should happen during training. There's no reason that models should produce chain-of-thought tokens that are human-interpretable during training. The reason it happens is because they have a prior of using the English language, and when you give them the space with thinking tokens to produce a more correct and helpful answer, they're actually using it in English because that's just the easiest thing for them to do. I think this is basically an empirical phenomenon that is extremely useful for safety research, and I would love to keep it that way as much as possible. I'd love to see more research into how we can maintain that. And I think OpenAI's commitment to avoid optimization pressure as much as possible — and also the commitments of the labs more broadly — is really exciting progress. I do think it's important to watch out for, so I think that's a great question.

(42:06) Nathan Labenz: Yeah, to be continued. You mentioned the incredible complexity of the evaluations you're running — by simple math, if there are 5,000 conversations and almost 50,000 criteria of evaluation, there's a lot. I imagine some of those criteria are reused across conversations, so at a minimum we're talking something like 10 evaluation criteria per conversation, and probably a lot more. It's hard to summarize. How good are they? Nevertheless, I'm going to ask you: how good are they, and how should we think about that? The HealthBench Hard thing maybe could be a way of saying they're good at most things, but here are some things they still struggle on — defining the frontier as one way to say how good they are. But how do you communicate to the world how good the latest models are when it comes to health?

(43:00) Karan Singhal: Yeah, I think it's good to keep in context the arc of work around LLM evaluation and health. A couple of years ago, people were mainly focused on evaluating LLMs on their performance on multiple choice questions like medical exams. Then with the work that collaborators and I did at Google, we started increasingly investing in what it looks like for specialized or unspecialized LLMs to answer general health questions — with some of the MedPaLM work, and potentially going in the direction of asking follow-up questions to a user with the AIME work. So increasingly over time, we've moved towards higher and higher fidelity evaluations. I think HealthBench is the latest big step in that direction — wide coverage of LLM performance and safety, but actually covering it in a way that covers many different axes of performance that matter for the real world.

Again, you see these 5,000 conversations, these 49,000 different axes of performance. There are actually three different versions of HealthBench. One version is the full dataset. The second version is HealthBench Consensus. And the third version is HealthBench Hard. These mirror what we view as the three high-level principles when designing an eval for health. The first is that you want it to be meaningful, which means that if the number goes up, hopefully human health will improve. The second is that you want it to be trustworthy, which means it's backed by the consensus of doctors or other experts. The third is that you want it to be challenging — you don't want it to hit 100%. One thing that's happened over the years is that all previous benchmarks that meant anything at all have gone to 95 or 100% over time, and HealthBench actually remains unsaturated to this day.

What you have with HealthBench, HealthBench Consensus, and HealthBench Hard is a little bit of focusing on each of these individual axes. HealthBench overall is a number where we think if it goes up for a model and people are using that model, health will improve — and we feel like that's a statement we can defend with the rigor behind the work. The second is HealthBench Consensus, where we specifically focus on criteria for evaluation where each example had a majority of multiple physicians agree that it was applicable. Not only did you have physicians write these criteria — you also had a bunch of physicians check whether these criteria hold for a given conversation and whether they're the right thing to be evaluating. And the final thing is HealthBench Hard, where we took somewhat adversarially, across a bunch of different models across all the different model providers, the examples that existing models fared the worst at, but still seemed high quality, and turned that into a benchmark. HealthBench Hard has kind of been my favorite external benchmark for whether an open model is doing really well. When it came out, GPT-4o was literally zero on this benchmark — it's an incredibly hard benchmark because of how we chose the examples. Over time, we've improved performance to around 40%, which is still not near saturation. I think this benchmark has a lot to go. For OpenAI's models, competitor models are more in the 20% range for current models.

So that's how I think about the HealthBench family of evals. We have this commitment to work on evals that are meaningful, trustworthy, and challenging. And if you want to focus more on the evals that are super trustworthy, we have the consensus subset, which is really focused on that. We also did, as part of the HealthBench work, a couple of additional analyses. One of them was: HealthBench involves grading these 49,000 individual rubric items using a model-based grader. We had physicians compare the model-based grader to the grading of other physicians, and we actually found that the model-based grader was doing a better job than the average physician. What that tells you is the grading for HealthBench is pretty high quality compared to what you'd expect from a physician.

(46:51) Nathan Labenz: Recursive self-improvement — signs of recursive self-improvement, alert alert. How about just a little bit of an intuitive sense for what's in HealthBench Hard? I could give you my experienced sense of the frontier from the bedside over the last couple of months, and I would summarize it pretty simply. This is what I say to my neighbors when I tell them, hey, by the way, you should really use a language model next time you're facing a health challenge. I basically say: look, I was in the hospital for initially like 30 days of really intense treatment where everything felt super high stakes. We didn't always know what was going on, and we didn't always know how much we could even trust our doctors at that point — everything was so new and stressful at the same time. And what I found was basically that the frontier models were step for step with the attending oncologist on almost everything. And by the way, that means they're a lot better than the residents — much more knowledgeable. They're

(47:57) Karan Singhal: at (47:57) Nathan Labenz: They're at the attending level, for sure. There were maybe a half a dozen times over the course of that month where there was some disagreement. Initially I was just using GPT-5 Pro, whatever exact version of GPT-5 it was, and then as the other frontier ones came out, I started to do everything in duplicate with Gemini 3 and then triplicate with the latest Claude. Interestingly, I would say that the AIs disagree with each other even less than the models disagree with the attending, but it's quite limited disagreement between the models and the attending. And typically when there is disagreement, I've found it to be a very minor thing — okay, his electrolytes have gone a little low, should we give him electrolytes today or not? And of those half a dozen things, there's not really a major trend. I would probably score it like six to four for the doctors with the benefit of hindsight, in terms of we usually follow what the doctors said — did we in the end feel like they were right, or do we kind of wish we had gone with the AIs? I'd say maybe two out of three times we felt like the doctors were probably right. And if I tried to chalk it up to something that they have that gives them an advantage, it almost always was one of those situations where whatever I'm putting into chat — and I was always exporting the latest results from the EMR and dropping in the PDFs they gave me — the difference usually came down to: in view of all that, a human would additionally say, "taking all that into account, but also just looking at him right now, kind of watching how he's breathing, looking at his color, I'm pretty sure he's fine." It was that kind of very intuitive, very multimodal, very subtle sense that these folks have developed over quite a few years of clinical practice that on very fine margins seems to give them a slight edge over the models.

So that's kind of my account. I guess what that means is my situation isn't that hard, and I think that is actually true in the sense that the cancer that my son has is not a super rare one, and the treatment protocol for it is quite well established. So it's not a super hard call in terms of the main line of what to do. Even though it's a hard thing for him to go through, maybe it's not that hard in terms of clinical judgment. I haven't experienced anything that I would say the models were only 40% on. So with that, maybe you can tell us a little bit about what's still out there on the frontier where we do have ground truth, or at least some sort of consensus that's solid enough to grade models on, where they're still only at 40%.

(50:59) Karan Singhal: Yeah, it's a great question. Let me describe a little bit more about what HealthBench is evaluating and the ways in which models have improved over time, then talk about next frontiers for model improvement. So HealthBench measures many, many different things. It has these themes which are the focuses of different evaluation examples. One focus was: are models appropriately escalating to care when needed versus not escalating to care unnecessarily? You want to balance this because you don't want to overwhelm the health system with patients who have been made anxious by alarmist medical advice, while also not failing to escalate when it's genuinely needed. Another aspect we measured was thinking about the ways in which models can adjust to different demographics, different epidemiological conditions, or different levels of access to care globally. This means making sure you adjust for a user that's male or female, but also adjusting if somebody asks a question in a region where tuberculosis is more common versus less common. That specifically — we call it global health — was actually the biggest single focus of HealthBench, because I think it's one of the biggest ways our work can be most impactful.

And then a bunch of other things. We talked a little bit about calibration. One of the ways models have gotten significantly better over time is: when they know they're uncertain, not only flagging that uncertainty, but actually browsing to get more information — getting the latest resources and synthesizing that together. Another thing is asking follow-up questions, and the right kinds of follow-up questions. Initially, ChatGPT would almost never ask follow-up questions in health settings. Now they do much more often, and they're much more likely to prioritize the right ones. You had that story about using the model for your son where you actually learned over time to ask the model to interview you and figure out whether you could do the physical exam yourself — the models have gotten better at knowing when that would be useful and flagging it proactively. So it's everything from pure reasoning, medical calculations, and figuring out the diagnosis all the way to how does the model behave, what is the bedside manner, is it comforting? And it's a balance of both difficult and high-trustworthiness signal that we're getting from working with our experts.

Over time, as the models have improved from o1 to o3, GPT-4.1, and GPT-5, all these models have improved significantly in health. Today, every major stage of model training for every model we ship at OpenAI now benefits from our work in health, and that's going to continue for future models as well. I think the frontiers are in a few different areas. A lot of text-based performance is actually pretty good outside of subspecialty areas. If your goal is to keep up with the latest evidence even as a physician, I think the models are doing an incredibly good job today and people are finding a lot of value — that's one of the greatest data points in support of this work.

Models have also continued to improve in their ability to integrate information. You mentioned this thing of a physician taking a bunch of different information — looking at the pallor of your son's skin over time — and integrating that all into one context. I think a big challenge for models today, and this is less of a model issue and more of a product issue, is actually getting the right context. One of the challenges you faced in your story was pulling the right information from the health system so you could upload it into ChatGPT and ask the right questions yourself. We're looking to make that better with the release of our ChatGPT Health offering, which I can talk about more later. In short, it's an experience within ChatGPT that enables you to connect your health information from your medical records, wearables, or Apple Health, with additional purpose-built privacy protections. On the model side, beyond context, I think the other big challenge is modalities that aren't well captured by HealthBench, which is really focused on text. I think people will start to rely on models in more and more modalities — image, voice — and there's still significant room to improve there. I hope that in the future, the best models in the world for various imaging modalities in health are the ones most easily and readily available to people today.

(55:37) Nathan Labenz: Now it occurs to me as you say the multimodal thing — maybe I was even undershooting it. I had the intuition that taking a picture of my kid as he sits in his bed wouldn't necessarily help and might even confuse, but maybe I'm wrong. In the future, would you advise me to start including cell phone camera pictures with my daily synopsis?

(56:01) Karan Singhal: I think you're right to point out that there's still a bit of a challenge in knowing exactly what data you can connect and how to connect it. You were highly motivated in this case, and so there's a gap between the most motivated or expert user who's willing to wade through signing into their patient portal, taking screenshots, or manually copying and pasting things — versus the person who wants their health data integrated and wants to understand and advocate for their health or that of a loved one, but has challenges doing so. That's exactly what we hope to work on with ChatGPT Health on the product side, really lowering the activation energy. In general, models will benefit from having more and more context. I'm not sure if cell phone photos specifically would have helped in your case, but I would generally advocate for people to try putting in various kinds of context if they find it useful. People have been surprised — I think you were surprised in your own story — by how useful it can be when models have more context.

An interesting tidbit: there have been studies since two or three years ago showing that for these clinical case challenges — effectively crossword puzzles for doctors where you get all the patient context upfront, potentially multimodal — the models did an incredible job at figuring out the next steps for diagnosis or treatment. These are extremely challenging puzzles, really difficult for doctors, and people found that models actually improved on the performance of these experts. That's been true for some time. A lot of the subsequent challenge has been: what does it look like to get the models to have a back-and-forth conversation to solicit the right kinds of context from you as a user? With ChatGPT Health, we're taking a step further and thinking about the right ways for the product interface to make that a lot easier — and to do it in a way that's secure, where we're not training on your data, and people can trust that.

(58:07) Nathan Labenz: How about for these larger modalities — when you get a PET scan or whatever? My understanding is I haven't really been able to get my hands on too much of that data. I'm told it's gigabytes, and I'd have to get it on a disk, and I don't even have a computer that takes a disk, so I'd also have to get a disk drive to read it. Obviously that data can be compressed. How do you think about the pipeline of raw scan-type data and feeding that into tokens? Are we talking about certain specialized perceiver modules that ingest those and do the reduction, or some other mysterious third thing?

(58:51) Karan Singhal: Yeah, it's a great question. One of the interesting things about biomedicine from a modeling perspective is that there's a long tail of modalities that are interesting and relevant, and they're often fairly difficult to put into models today. I think there are going to be two broad approaches — and this is talking more about external research. The first is thinking about the right ways for models to call specialized tools or run code over various kinds of data. If you think about how a human views a gigapixel scan — say, a pathology image — what do they do? They basically view slices of that image. Even a 3D or 4D scan, like your son's MRI, they view slices. No human can see a 3D or 4D modality directly. What they're effectively doing is using a tool to understand which slices might be most important, and then looking at and manipulating those slices. That's actually something models can do with tools and Python. A sub-bullet of that: they can also use very specialized tools that are fit for purpose — specific Python libraries, or even additional models that specifically encode these modalities.

(1:00:36) Nathan Labenz: The second approach is having models encode the modalities themselves — basically putting them into token space in some way, which is what's been done for images, video, and audio. I think what you'll see in biomedicine is a hybrid of both. It's just quite striking how good AIs are — not even necessarily general-purpose AIs, but we have specialized AI models that can fold a protein in a superhuman way. What seems to be missing is the latent-space joining of that modality to text. My expectation — and you can tell me if you think I'm right or wrong, or how long I have to wait — is that we're going to see that more, and that it's going to be a major driver of truly superhuman performance in a lot of domains, because we just don't have people who can intuitively fold a protein. But if you had a system that could do that and also had general reasoning capabilities, it seems like you'd have a qualitatively different kind of intelligence. So I'm really interested in what you see as the roadmap, the timeline, the expectations for that kind of deep integration.

(1:01:41) Karan Singhal: I think it's a really optimistic picture. I totally agree. If you think about how we lean into the natural capabilities of these models — they can take in vast amounts of information into their context, and they have the ability, at least in theory, to take in many modalities of data and effectively merge that together. The research for that is becoming increasingly solid for these other modalities. There is still significant research to do depending on the biological modality, and there's a long tail of biological modalities that matter. Depending on the modality, the relevance and impact, the availability of data, and all these factors, I think it could take more or less time. It's hard for me to give you a universal answer. I share your optimism.

(1:02:23) Nathan Labenz: I just saw one in the last couple of days — Sleep FM from Professor James Zhao and others. He's at Stanford, with collaborators probably at multiple institutions. They had a really interesting finding. The idea is they use, I think, six, maybe up to eight different modalities that are all measured during one night of sleep — how you're breathing, various things easily measured in a sleep study setting. It just takes one night of sleep to gather all this data. And they've gotten really good at predicting all kinds of different diseases from it. So they're integrating all these modalities into a holistic understanding and using that for a narrow but very high-value set of predictions — not yet going all the way to text. It's just another data point that's top of mind for me right now: all the latent spaces will be joined. I can't shake the idea that that's a big part of what superintelligence ends up looking like.

Going back for a second to bedside manner — you mentioned global health and context being a big part of HealthBench and what you're doing to make sure AI benefits all humanity. It was funny when you said "we have 200 million users, when we scale we want to make sure we're doing this right." I was thinking, yeah, one day you'll scale — only 200 million users today. Clearly there's already some scale. The thing I want to address, though, is bedside manner and how that relates to the n-of-one context of the individual user — their preferences, their memories. All these products now have an integrated memory module, which I'm sure will become a deeper integration over time. Right now it's a kind of scratch pad of key highlights that the model can reference, I believe still mostly in text modality. But regardless of how it works today, it's safe to say it will continue to become deeper integration as everybody pursues various strategies for continual learning. There's also integration with tools — Gemini now has access to my Gmail, and even just system prompts let me tell the model to behave a certain way. It just strikes me that this creates an almost impossible surface area to manage. We've seen emergent things like sycophancy, and whatever else. How do you think about making sure — because this is one thing I do think human doctors generally do a pretty good job of: sizing the person up and figuring out how to cut through to get this person to understand what they need to understand. Models broadly have not been as good at that yet, but it really seems to matter a lot in healthcare. Is there some additional suite of testing you do, or how do you think about the insane long tail of idiosyncrasy that 200 million health users bring to the product?

(1:05:38) Karan Singhal: Well, you pointed out two problems, and I'll talk about them in reverse order. One is that people are bringing an incredible array of experiences, settings, and ways of using ChatGPT — their memories are different, all these things mean that one person's experience of ChatGPT can be very different from another's for reasons that have nothing to do with the model version. We do a few things as a company to understand and improve model behavior. One is evals that we run on models before launching — HealthBench is a great example, and we have additional evals as well. Another thing we do is monitor production traffic in privacy-preserving ways, running classifiers to understand if there are any safety risks, from health to frontier risks. We're actually able to measure these things. There's a great example of this in the blog post we published about sensitive mental health conversations, where we were able to show a correspondence between our evals and the patterns we were seeing in production traffic — logged in a privacy-preserving way with models running over them. They were actually very well correlated. That's how we think about closing the gap between offline evals and what people are doing in the real world.

I would add an intermediate step as well, which is doing real-world study. Here I'll plug our work with Penda Health. This was, I believe, the first real-world study of an LLM-based copilot for clinicians, where we had clinicians in a group of clinics in Kenya use AI as a copilot or safety net. As they were typing in their electronic medical record, it would flag things if something was interesting, alarming, or potentially incorrect — while other clinicians did not have that. For the patients treated by clinicians with AI versus without, there was a statistically significant improvement in diagnosis and treatment outcomes. This is another example of how you move from offline evaluation — potentially unrealistic medical multiple-choice exams — to increasingly realistic evaluation you can run offline like HealthBench, and then move increasingly into the real world. You can view the HealthBench study as forward-looking, and our analysis of production traffic as more retrospective — and together they capture a lot of the variance and differences you're pointing to.

(1:08:23) Nathan Labenz: Yeah, that's really interesting. How do you want to talk a little bit about the privacy-preserving nature? I would preface by saying I think people — at least as individuals — typically, if they think about it at all, way over-index on this concern. I've had the occasion to think, on my son's behalf, should I be putting this information out there? And then I see what somebody like Sid from GitLab has done with his cancer — literally open-sourcing all of his own biology down to the DNA level, just an incredible amount of very individualized data. And I'm like, I think that's the way, because unless you're getting really sci-fi about it, it's hard to come up with a way that anybody would really use that against you, and there are a lot of ways it might stand to benefit you. Even just in telling my son's story on the podcast, I had a similar question — am I doing him a disservice? It turns out what has happened is that people have reached out to me with interesting opportunities for connection, and I've been able to tap into expertise, including the team that Sid has built to help him, and also they're starting all kinds of individualized therapy companies with fascinating developments there. So my advice to people would be: seek the benefit and don't worry too much about whether your data is sitting in some log somewhere. But anyway, that's my role to say that. It's your role to build a product that deals with people as they come, and people do seem to really worry about this. So what would you want people to know about the privacy-preserving nature of the infrastructure you've built?

(1:10:05) Karan Singhal: Yeah. Privacy is incredibly important to a lot of people, and it's a very personal thing — people's preferences for what data they share and how they share it. If you zoom out for a second, think about what the level of ownership of your own data is as a patient today, and what it was even before things like the 21st Century Cures Act. Most patients ten years ago had no real way to access their own health data. Not only did they not have a way to control who they shared it with, the data was just out there and shared anyway in various ways — and a patient didn't even have a right to look at their own data or an actionable path to do so. We're in a slightly better place now, but still not ideal. Most patients still feel like it's incredibly difficult to get access to their data or the data of their loved ones and take control of how it's used. People like you and many others with incredible stories have found incredible ownership and advocacy that is enabled only by having access to this data and being able to use it as they see fit. We want to respect that. We want people to be able to access their data. This is part of what we hope to enable with ChatGPT Health — a lower-activation-energy way to connect your health data and make it useful.

At the same time, it's super important to a lot of people that this is done in a way where there's no competing incentive at play. One of the things we're very clear about with ChatGPT Health is that none of the data you connect is used to train our foundation models. We do that for two reasons: first, we think our foundation models are already great, and this kind of health use isn't actually the most important way to improve them. Second, we think it will lower the activation energy for people to do incredible things with our models. We don't want people to feel like there's a tension between privacy and utility — we actually want them to see the value and use it.

Specifically, ChatGPT already has a bunch of really cool privacy protections built in, including encryption. ChatGPT Health adds additional purpose-built layers of encryption specifically for health data. The result is a couple of things: you have encryption of the data, and you also have isolation of health data from other data in ChatGPT. If you have other apps, memories, or conversations in ChatGPT, those are kept separate from your health context. Your health information doesn't bleed into your other conversations, and you can keep that completely siloed in its own experience. As we continue to get feedback and roll it out on the waitlist, we'll keep improving these protections — because we think it's really important for users, and because we actually don't think health data is as important for improving our foundation models as it might seem.

(1:13:09) Nathan Labenz: Interesting. Is that segregation of data a new feature with Health? So all the stuff I've done to date has been in kind of one product experience, but that fork has just now been introduced?

(1:13:26) Karan Singhal: Exactly.

(1:13:27) Nathan Labenz: Yeah. I've been motivated to get a WHOOP to wear on my wrist to start collecting data. I've been fortunate throughout my life to have generally good health and haven't created much of a paper trail in terms of medical history. I've always looked at these quantified self devices and thought, yeah, it's sort of interesting, but who's got time for that? And am I really going to do all the processing on this? The answer's always been no, so I haven't worn anything to date. But now I'm thinking, alright, it's time. I can actually get value from wearing this thing, because now I'll have a product running in the background that's smarter than me at this point, that will really bring to light what I need to know. So I'm motivated to change my behavior at least on that level. And I honestly think it's probably going to encourage me to exercise more. In fact, that's kind of my own North Star metric right now for all things AI — and not just health products either. I will know that I'm succeeding with AI when I'm spending less time in this chair and moving my body more. When I feel like I can untether myself from the desk, I'll feel like I'm really winning. I think I'm getting kind of close to getting there. I have one friend who uses voice mode nonstop and says he does a ton of it from the gym while he's working out. I'm like, alright, I want to be like you. I'm not quite there yet, but it's really interesting to think about both the ability to crunch all this data and the freedom — the liberation from the desk — to go out and make more exercise records in the first place. Those are major things I think are on the horizon in 2026 that I'm super excited about.

(1:15:24) Karan Singhal: I'd just add on that — I think people should expect that the value of data they collect on themselves will increase over time as model intelligence increases. The ability for models, both from a research perspective and a product perspective, to analyze that data and come up with useful insights should increase over time. So if there was ever a time to get a watch or anything like that, I would say it's now.

(1:15:49) Nathan Labenz: So I do, at this point, basically everything in triplicate. For my son, my morning routine when lab results come back is to export them out of the EMR, drop the PDF into all three of the frontier models. Groq has not cracked that top tier — maybe it should, but it didn't in my initial evaluation and I haven't gone back. So it's Claude, Gemini, and ChatGPT. The sort of tasting notes on how the models interact with me is, in short: Gemini is by far — and this is quite surprising for a Google product — clearly the most inclined to push me to advocate for something. It's also the most confident when it says I don't have something to worry about. And generally I would say it's very accurate. Sometimes I'm a little uncomfortable with how opinionated it is, because I'm like, what if you're wrong? That's a surprising behavioral profile for a Google product, especially because everybody historically has said, oh, they're going to be so conservative — they can't build AI products because they can't live with that kind of risk. Gemini does not reflect that analysis. I would say ChatGPT is kind of on the other end of the spectrum — it generally gives me, again with no system prompt, though it does have my memories and stuff, but no intentional attempt by me to shape how it's responding — it tends to give me the longest answers, the most information, and is the most clinical and neutral in tone. It's sort of a report style: there are nine issues, we go through and address each one, then summarize them all again at the bottom. And then Claude is kind of somewhere in the middle — generally much briefer than ChatGPT, more like Gemini in response length, but more measured in a more ChatGPT-like way, certainly compared to Gemini's more opinionated persona. What do you think of that? How do you think about the balance between thoroughness and digestibility — that's probably the main tension I'd zero in on, specifically for ChatGPT?

(1:18:00) Karan Singhal: Yeah. I think it's hard to know what's right and what's wrong. One of the things that we've aimed to improve — and also measure via HealthBench — is having models understand when the user is likely a health professional versus a lay user and tailoring the response accordingly. Part of this is making sure you're applying a level of detail and technical jargon that makes sense for the user's level of expertise. This is actually a specific thing that we train and evaluate on, and it's in our open source HealthBench eval — there's actually a part of it focused on exactly this. It's something we've seen improve, especially recently with GPT-5.2 thinking. So definitely an area we've been investing in. Overall, though, I think it's very hard to say what's ideal and what isn't. Personally, I think it's amazing that other competitors can be in this space and actually push the Overton window along with us, because I think it's hard for any one company to do alone. A large percentage of the work we're trying to do is figuring out where models are today and where the Overton window of user trust sits — and whether we can move it in the right ways. Additional models, additional products from other competitors and other players in the health ecosystem actually go a long way toward helping shift that as well. I think it'll be a hard battle to win if it were just us. So I personally think it's amazing that there are other folks here.

(1:19:36) Nathan Labenz: Okay. So you've said it's a striking thing — we don't need your data; basically, we've got enough. I'm interested in whatever you can tell us about where that data is coming from. This seems like an area where synthetic data — and maybe I'm naive in saying this — but it seems like you really want some real ground truth from actual human medical trajectories. I would feel a lot more comfortable synthesizing chats than I would feel like you can fully synthesize data all the way to superhuman doctor performance. So I'm trying to understand where data is coming from.

I have this notion that you may think is beside the point given the tricks you have up your sleeves, but I've been imagining a possible new social contract — not just between patients and AI providers, but even between patients and the medical regulatory system. There are a lot of people in a spot where they'd try anything if it had a chance to save their life, and a lot of times they can't even get access. Now we're in a spot where AI increasingly knows about treatments patients may not. The AI is going to tell them about them, and in a lot of cases it's going to make a pretty compelling argument that this is probably the best option available. I've been down this path in a contingency planning frame of mind — everything has gone well for my son so far, but what if there's a relapse? I've been really impressed by what the models have been able to give me there as well. So I imagine this situation where there's mounting pressure from patients who are saying: look, AI knows about this treatment and it's telling me it's the right thing to do, and you're telling me I can't have access. That seems like a very hard gate for the establishment to keep for much longer.

Then the other thing I'd love to see on the other side of that trade is: if you're going to be given access to unproven treatments, then what we as society want back are your results, so we can fold them into our general understanding. We're getting to a point — probably driven mostly by AI — where we could envision moving beyond clinical trials, doing a lot of n-of-1 things that are just like: this is the best guess we have for you based on all available information. And if we can capture the result and train on it — as humans, as AIs, as society collectively — there is so much room to learn so much more and move so much faster, while also delivering outcomes to people who are trying things they couldn't otherwise get their hands on. I'm feeling motivated to advocate for that. So I guess there are maybe two data questions here: how do you not need more run-of-the-mill data? And what sort of data do you still need? What would you think about an evolution of the social contract — what I'd call AI and a right to try?

(1:22:56) Karan Singhal: Yeah, that's a super interesting idea. We talked a little bit about the patient-facing problem of data fragmentation. The experience as a patient is that your healthcare data is just kind of in the ether somewhere — you don't know where it's going. It's governed by HIPAA for the most part, and the result is that there are pathways for that data to go places that don't involve your consent. So the experience today for a patient is that you have both limited access to your data and limited control over where it goes.

If you think about it from the provider perspective, this is also a problem. If I'm a healthcare provider, I want to understand the whole picture of a patient's health and integrate across modalities, as doctors are so expert at doing. Right now doctors have a lot of trouble pulling relevant records from other health systems. You can imagine a future where ChatGPT has a lot of context on you — and the doctor is able to pull that data in as well. I think that could be a really cool feature.

And then the third point you're talking about is research. Right now a lot of clinical trials — the majority of those that fail prior to conclusion — are failing due to recruitment challenges. We basically have a failure to recruit the right patients who meet eligibility criteria. A lot of this comes down to the data not being in one place: who these patients are and whether they're eligible. That's also a problem hampering our ability to raise the ceiling of human health.

So you're pointing to a potential future where not only do patients have more access to their own data, providers can access their patients' data in an unencumbered way with patient consent, but also researchers can access patient data — again, if a patient consents. I think it would be really cool to imagine a future where patients are able to opt in with additional consent and experience the benefits of what did you call it —

(1:25:04) Nathan Labenz: AI and a right to try.

(1:25:06) Karan Singhal: AI and a right to try. I think that would be a potentially really cool future. I do think clinical trials and the standard of evidence there — it's battle-tested and there's something important about that as well. But I'm optimistic about a future where, because data is a little more centralized and a little less fragmented, we have a better ability to advance the science. And because patients have access to their own data and are able to advocate for themselves through AI, they're actually more able to figure out what clinical trials they may be eligible for or what experimental treatments may be interesting to them. So that if they do make that active and informed consent, our understanding of science can improve, our models can improve — I think that would be a really optimistic future.

My previous comment, by the way, is not that the models have run out of data. Data is always helpful for models. It's more about thinking about what's the right way to roll this out to society, and where can we get the most impact. Is the most impactful path to lean into privacy and lowering the activation energy for users? Or is it to lean into getting data? I think the decision we made — for users, to lean into not using data for training our foundation models — is the right one. That's what's going to have the most long-term impact if we think about the arc of improving AI and improving human health: moving the Overton window in a way that's trustworthy, that shows we respect the value of people's privacy and data. And then in the long run, things like additional consent and different changes to the contract of how research is done — I think those are really cool things to explore.

(1:26:48) Nathan Labenz: Yeah. As you were mentioning additional consent, I did get a packet of 40 pages of paper at the hospital one day, and they clearly really do want to collect this. There's a person whose job it is to go around and visit the parents of these kids at the children's hospital, explain things, and answer questions. I was probably the easiest customer she'd had — I was pretty predisposed to sign, so I signed and initialed in a ton of places to share whatever data we could share for the public good. And still, I just got something in the mail from the University of Minnesota that's like, can we have your data? And I'm just thinking, man, I thought I already agreed to this. And what is the yield on those things they're sending out? It's gotta be quite poor. I haven't sent mine back just because I've been busy, even though I fully intend to. Those barriers are really tough.

Would you think about doing something like — obviously much has been made recently about ads coming to ChatGPT, and the argument that we need some way to support this for billions of people and ads is a proven way to do that is pretty compelling. I don't know if it extends to health. There's a lot of pharmaceutical advertising in the world today — would there be such a thing as pharmaceutical advertising in the world of ChatGPT Health at any point? And would you consider something like: you can get ChatGPT Pro for free if you give us your data? That seems like something a lot of people would probably opt into and still feel good about. Any other trades you could see OpenAI making to support that reach, since we do want this to go to billions of people?

(1:28:42) Karan Singhal: Yeah. So ads aren't coming to ChatGPT Health and we don't plan for that right now. We think it's really important to create a clear separation between our health impact work and things that could be seen as contributing to other incentives for the company. That's the line we've taken there. I think you're right that access is a really important point, and this is why we've made ChatGPT Health free. This was not the default path — providing a reasoning model for free without rate limits to all users. But that is the path we're charting with ChatGPT Health, and that's actually not something available elsewhere in the product. So we are deliberately optimizing as much as we can for access, and doing as much as we can. I think there will be some limits to how far we can go, but I don't see any additional trade-offs in the immediate future.

(1:29:33) Nathan Labenz: That's admirable. My sense of OpenAI on this dimension is one of real appreciation. The pains taken to support a free user base of hundreds of millions of people, at obviously not insignificant cost, when a lot more revenue could have been extracted from those people — I think that's a pretty admirable thing. And I hadn't even heard that there was this plan to go to free without rate limits for health for all. That's a major deal. I've had this idea for a while of universal basic intelligence — basically a version of a new social contract, and one that is going to be an unbelievable needle-mover for a huge number of people. That's awesome.

You want to talk a little bit more about the medical establishment's response? At the hospital, my experience is still mostly a lack of awareness among providers. As I've built up rapport with certain people, I tell them I'm consulting with AI, and they're usually okay with that — not hostile to it, sometimes skeptical. I've had a couple of interactions where they say, tell me what it said. I'll tell them, and they'll say, oh, okay — pretty good. The doubt can turn around pretty quickly when you get the right answer. There's certainly mention of open evidence. But broadly, I think there's just a lack of awareness of how good the systems have become.

My guess would be that awareness is probably the biggest challenge right now. In a lot of professions, we should expect to see closing of guild ranks and raising of barriers to entry — the cynic in me thinks maybe that'll happen in medicine, but the idealist thinks maybe not. Maybe this is a chance to live up to the actual mission of the profession. And maybe people are just so overworked in medicine that they'd be happy to take whatever help they can get. How would you characterize the broad reaction from doctors writ large?

(1:31:51) Karan Singhal: I think you're right to point out that the Overton window shift for consumers has actually been faster than for doctors. Already in the last year, the rate of adoption of ChatGPT for Health — even before we launched the recent products — was incredible, and it's been one of the most amazing things to see. Again, over 200 million people a week using ChatGPT for health and wellness questions. On the doctor side, you do see a rapid increase in adoption, but it's not quite as fast. So I would expect that to come with a couple of things: interesting interactions between patients and doctors who are at different points in their journey of adopting AI — and you've had a couple of those. And a lot of physicians I talk to first hear about AI through patients who are using it, rather than AI that's built specifically for them — which is one of the things we're doing with ChatGPT for Healthcare.

One of the things we've learned over time is that the best way to shift the Overton window — as we've done over my time at OpenAI and at Google, doing real-world studies and things like this — is putting the technology in people's hands. That's actually what we've been doing on the ChatGPT for Healthcare side. ChatGPT for Healthcare was announced the day after ChatGPT Health, and it's more of an industry-facing announcement. Basically, it's a version of ChatGPT purpose-built for the workflows of health professionals — specifically clinicians. It includes HIPAA compliance, additional features like specific evidence retrieval for medical guidelines, and additional workflows specific to enterprises, writing tasks that doctors do, and so on. We launched that with eight of the leading institutions across the country.

One of the most important things for building trust and credibility in the medical establishment is working with leading partners. And since that announcement, we've been receiving amazing feedback — a ton of inbound, more than our team can actually handle. My hope is that this announcement, along with some of the work preceding it on the research side and our study with Panda Health, will generate a wave of adoption — not just among individual clinicians, which is what we're starting to see, but also at the level of health systems and potentially even governments. We're just at the beginning of that wave. That's part of what I meant by we're seeing adoption, but not scaled impact yet. Scaled impact will start happening, especially on the medical establishment-facing side, more this year.

(1:34:43) Nathan Labenz: Yeah. I wonder if there's a question — I'm sure you've thought about this — that's sort of analogous to the privacy question on the patient side: what is the form factor or mode of rollout that will be best received by doctors? Because if I'm a patient, I might say, hey, what I want is for the hospital — or even some bigger organization — to architect some workflows that just grind through everything. I want nothing missed, every record of mine examined, and I don't really care if that offends a doctor at some point. I just want the best results. But I could also imagine that if you were to do something like that, you might ruffle some feathers and create an immune response from the profession that you'd rather avoid. So is there a way — analogous to the privacy situation, where you might take a little off the fastball in terms of immediate value — to make the rollout more acceptable to decision-makers so that they hopefully work with you rather than being inclined to fight you over time?

(1:36:16) Karan Singhal: I think the trade-offs here are actually less than one would expect. You mentioned the possibility of protectionism in the industry, and we've actually seen much, much less of that than one would expect. I think the reason is that a lot of these doctors actually use it for themselves or for people they take care of. And when you do that — when you see the value of it, see it getting incredible things correct and maybe pointing out things you hadn't thought of — that becomes the easiest kind of conversation. When we talk to health system executives, you can tell instantly who's used it and who hasn't for this use case. The conversation becomes incredibly easy when people have used it.

What we see today is not a lack of top-down interest in adopting. We actually see a huge wave of top-down interest. The reality in healthcare, though, is that workflows are fairly entrenched and it takes some time to make changes. This is the kind of thing we found in our work with Panda Health — where in addition to rolling out a cool new tool, we had to do active change management: bringing people along, showing them how to use the technology, hosting sessions where people could learn together about how to use it going forward. That was a really important part of rolling this out. That kind of change management will be really important here as well.

So I do think it takes a bit of time, but my expectation is that this will be one of the faster rollouts of software in healthcare history. That's already been happening with people's usage of AI, and it's just at the beginning. I wouldn't expect it to take years and years for people to adopt AI more and more in clinical workflows. One of our goals for this year is for AI-assisted care to become more part of the norm of care, and I think by end of year that'll be the case — we hope. But I don't think it'll happen over weeks, which is the timescale us AI innovators are used to.

(1:38:15) Nathan Labenz: Let's talk about the connection between health and AI safety more broadly. I think that's super interesting. I remember the classic question from Ilya once upon a time: how do we teach AI to love humanity? I understand that notion is not entirely gone from OpenAI and is still part of what you're thinking about. I'd love to hear more about that.

(1:38:41) Karan Singhal: Yeah, absolutely. I think it starts with the foundations of the work that we've been doing on health at OpenAI and how we started it. I have a bit of a background as a researcher who cares about safety and has worked on safety research in the past. Part of the motivation for coming to OpenAI and working on health for me was thinking about the setting as a place that can provide concrete grounding for technical work on safety and alignment. I had a feeling — and this was about two years ago — that a lot of the most ambitious, medium and long-term work on safety and alignment was going on in toy settings or with math problems or things like this. It felt like if there were only a setting where the problems people were working on were well motivated, that provided more concrete feedback loops to researchers and more short-term incentives for the research, the research could have been better. That was part of the thesis for our approach at OpenAI.

We've done a few kinds of research. I talked about one of them — how to work on calibration. Another problem we've really thought a lot about is scalable oversight: how do we supervise AI systems that are potentially more capable than us in certain ways? This is a problem we've actually had for some time in our work with physicians. In many ways, AI doesn't match your expectations of a doctor and the ability to integrate across a bunch of different modalities. But in many specific narrow ways, models can even outperform physicians. In those specific narrow ways, when we evaluate models with physicians or have physician signal be part of the training signal, we have to invest in research around this problem of how you supervise systems that are potentially more capable than you — which is one version of scalable oversight. This is a problem that goes beyond health and is, I think, a really important problem for thinking about AI alignment. Our work in health has given it concrete grounding because we do have models that in some ways are more capable. That's been a lot of our focus.

We've also thought a lot about the right ways to approach that problem through evals and by pushing the high-compute RL paradigm in this setting so that we can learn from the aggregate opinions of experts. That's been an important part of our motivation. But broadly, a lot of where this has been heading for us is thinking about how do we extract the right personas or characters from the model. Because if models become more and more capable over time, have more and more context about a patient's medical record — because users are more proactively uploading it or the activation energy is lower via product — and have more ability to take output actions, whether that's telling you things, outputting a note you can give to your doctor, or other things they can do, then probably the most important thing in the very long term is: are the models the kind of models that would do the right thing for the patient, the user, the clinician, or the researcher? So that's the kind of thing we've been investing in heavily — how do we think about extracting the right kinds of personas for models? How do we do so in a way that's surgical, that gets the best aspects of those personas and avoids the parts we don't like? For example, avoiding a bias toward being over-conservative in challenging situations involving unclear medical consensus, while also navigating uncertainty really well. That work, which is closer to persona, character, or soul in its form, is something we're grounding a lot in health. It's been a really exciting advance, and I hope we'll have more to share on that in the future.

(1:42:27) Nathan Labenz: Can you talk a little more about scalable oversight? I think at one point it was kind of the plan. The Super Alignment team, I think, was sort of premised on this idea — and maybe other ideas too — but my understanding was scalable oversight was going to be a big part of it. One of the things that stood out was that the strong student sometimes was just ignoring or overriding the instruction of the weak teacher. And we're the weak teacher in this situation. So how are we supposed to feel about that already-emerging trend, as of probably at least two years ago now, where at times the strong student just decides it should exercise its judgment despite the fact that it may contradict what humans — the weak teacher in the analogy — say? Have there been paradigmatic advances that give you confidence this is going to really work? Or is it kind of the understanding I have today from OpenAI broadly around safety — that it's going to be sort of a defense-in-depth strategy, where everything works a little bit, we gradually chip away at the problem, and hopefully with enough layers of defense it'll be okay? Do you have more ambitious ideas about what scalable oversight can still achieve?

(1:43:53) Karan Singhal: Well, my thinking about how scalable oversight will proceed has changed a little bit over time. I think of the problem as having two parts. One is what I call rater scaling — how do you think about the right ways to elicit opinions and values from people, experts, or whoever? That's an example of something we've been investing in in health. You can imagine schemes where you have AI be part of that loop and help improve the ability for humans to critique AI outputs. That's one area of work the company has been investing in.

The second area — and the framing for this has become more expansive over time, such that sometimes we don't use the word "scalable oversight" to refer to it anymore — is: given some idea of what the values are, whether they're elicited using rater scaling or not, how do you spend a lot of compute training models to have those values? I call that internally "value oversight." So there's rater scaling and there's value oversight.

The second problem is also one where I think things have advanced quite a bit, even though we don't use the phrase "scalable oversight" anymore. One example is the work people are doing on specs and constitutions and measuring and improving models to adhere to them. This is one way of saying: we have values we care about and we want the model to be very persistent in improving them. If you check various system cards, you actually see that models have gotten much better at this. People have been finding increasing generalization in training models to have certain personas or characters, and in whether they comply to certain safety requirements, specs, or constitutions. What I'd say broadly is that the research has been advancing, but in a way that looks different from what people imagined. It looks a little bit more like humdrum post-training. I think that's a good thing — it's more grounded in how systems will actually look and how we'll train them. I don't think we've solved all the research problems, but sometimes we're working on them without referring to them in the same way as we used to.

(1:46:20) Nathan Labenz: Would you say the core of the reason to think this will work is that a somewhat less capable model with a very large inference budget can be expected to catch problems — if we're thinking of a deceptive failure mode, a somewhat less capable model with a much bigger token budget, just really given a lot of time to think, can be enough — as long as the delta isn't so large that it can't detect what's going wrong in the more capable model? So as long as you've got a small enough delta and a sufficiently aligned current model, you can in theory continue to bootstrap your way into ever-more-capable models without losing alignment. Is that the core of the idea?

(1:47:29) Karan Singhal: I think that's part of it. Another part is that the task of discrimination or critique seems to be easier than the task of generating good outputs. When we study the performance of monitors, a given model can monitor itself — especially given privileged information like the chain of thought — pretty well, which is a little surprising on its face. But if you think about the fact that discrimination has been better than generation, and this underlies a lot of work in things like RLAIF and constitutional AI, then it's not so surprising. That's another part of it. But I don't think we have a full understanding of how this will scale, especially in the regime of having trusted but less capable models aligning a more capable model. So far I've mainly been talking about: given we have the values in some format, whether from humans or models, how do we instill those into models in a trustworthy way? I think that's a part of the problem. I don't think the whole problem is solved, for sure.

(1:48:36) Nathan Labenz: How do you think we're doing on safety as a whole? I've been an Eliezer reader since 2007 and actually read a lot more of his stuff way back in the day than I have more recently. I would say broadly speaking, this has in some sense gone amazingly well relative to baseline expectations. We do have models that undeniably have a pretty good sense of human values — that's just manifestly obvious on a day-to-day basis. That was considered a very unlikely outcome years ago. If you teleported back to 2007 and dropped a post on Overcoming Bias indicating as much, it would have been considered laughable that this would be accomplished in this way.

And yet we do still kind of see — my mental model of the seesaw we're on is that every generation of new models has new capabilities generally, and then it also seems to have some new emergent problem. Whether it's deception, or now eval awareness has become front and center and we sort of say "okay, that's a big problem" — the next generation gets more powerful, we kind of tamp down the last problem. It doesn't go to zero, but it's at least reduced. It seems like we're headed toward a strange world where if you extrapolate both the capability trend and the roughly two-thirds-to-order-of-magnitude reduction in bad behaviors that seems to happen at least once a given bad behavior is recognized and addressed from one generation to the next, and you extrapolate that out to 2028 or whenever — you're like, okay, I can imagine a model that can do a month's worth of human work, but for any given run there's maybe a one-in-a-thousand or one-in-ten-thousand or one-in-a-hundred-thousand chance of actively screwing me over in some super bizarre way. That just seems like a really weird thing to contemplate. If we believe in straight lines on log graphs, that seems to be kind of where we're going. Do you think that's where we are going? And if not, what's your mental model of what the alignment versus emergent-problems balance would look like in a couple years' time?

(1:51:08) Karan Singhal: Yeah, I think the future world will definitely be pretty weird. It's very hard to predict what will happen. I do think models will become more capable and will be able to do things over longer time horizons — that's very important. I also think we've been pleasantly surprised by the extent of safety generalization or alignment generalization, which underlies the trend you're pointing out of various safety benchmarks and failure modes decreasing over time. That's been a relatively pleasant development.

We've had two major scaling laws for deep learning. One is the pre-training scaling law. What we found is that when we scaled up pre-training, we had models that could be relatively easily tuned via SFT or small amounts of supervised learning to exhibit a certain persona and be helpful, useful assistants. That's a lot of what the early work on instruction tuning did, and also what the early work on MedPALM did for the health setting, which is really great. We saw a lot of generalization there. Then there was a question for the reasoning setting — whether that generalization would hold. So far, my sense is that at scale, we do see it. At smaller scales, we saw less of it. That's a promising development: if you are able to figure out the right ways to get these models to be broadly beneficial and not harmful, then even if they're put in settings where they're doing things you didn't foresee — doing a month's worth of work in days or things like this — they can continue to generalize and be safe in those settings, and those safety curves will continue to go down.

That said, you're right to point out there's a rapid increase in capabilities and in the surface in which these models will be used, and also a rapid decrease in failures we're seeing over time. It's a little hard to predict how both those curves will net out. I do think it's really important — and this is why I care so much about the parts of our mission that are about proactively making tangible benefits happen and also about working on mitigating safety risks — that we really stay ahead of these curves and think about the right ways to shape them in the right direction.

(1:53:21) Nathan Labenz: What does a Move 37 look like in health? Are we going to see that in the current paradigm, or would it take a deeper integration of modalities like we were touching on earlier? Another idea I'm kind of enamored with is a different kind of training objective — thinking about, instead of getting right answers to questions, more fundamentally predicting what the state of health is going to be for a patient in a way where you could even begin to do in silico experiments. Hey, I've got this whole profile of this patient — what if I change this variable? What would their health look like in that case? That seems like it might be a pretty different paradigm from "read the whole internet and make sure you give me the right answer." What do you think? Touch on any of those you want, but maybe also just take the opportunity to zoom out and give us a sense of your big visions and ambitions, and what people can expect as you guys continue to do your thing over the next year-plus.

(1:54:34) Karan Singhal: Yeah, I love the Move 37 framing. Just to explain the reference — this is a reference to the famous Move 37 during a game between AlphaGo, the AI system playing Go, and Lee Sedol, during one of their now-famous games. The interesting thing about this move was that it was a move everybody agreed humans would not have made, but in hindsight it was brilliant and was key to winning the game. So your question is: can we imagine a world where models are able to do something, make some kind of interesting prediction that a human probably would not have been able to make, but that is again brilliant and impactful in hindsight?

My view is that this is not too far away. In many ways, many people report to me that they saw many doctors for their case, and only after talking to ChatGPT was ChatGPT able to flag the thing that they then shared with their doctor, and together they were able to come to a diagnosis. I think that seems to happen somewhat routinely. Whether it rises to the level of a Move 37 or not, I think, is a matter of taste depending on the case.

What I'd say is that you should expect world models in health to improve. What I mean by world models is that our models should have a much better understanding over time of us, our health, our health trajectories, and how those intersect with our biology. I think that's going to be really key to what a Move 37 could look like. You mentioned the idea of simulating people and simulating things in silico. Full simulations are obviously very expensive and difficult and aren't really what models are most designed to do in their current form. But if it's more along the lines of: given a lot of context about a user, predict something interesting about them or predict the results of some intervention — I think this is something that models are already getting better at, and could get a lot better at over time, and that would be potentially quite impactful.

I think we'll start to reach a point where there will be more and more clear demonstrations of this. You're starting to see interesting and increasingly clear demonstrations of models doing interesting science and math in public. I think you'll see more in this space as well, but it will take a little bit of time.

My view on paradigms is that you can get a lot out of extending and tweaking the current paradigms — you can in fact view everything that has happened so far as just one paradigm. Between pre-training and scaling reasoning-style training, I think you can get quite a lot of juice. Even integrating additional modalities doesn't require too many additional changes or tweaks.

Broadly, our team's mission today is to do whatever it takes to ensure AGI is beneficial for human health for all of humanity. We see that happening through three channels. The first is helping consumers understand and navigate their health, which is already happening through ChatGPT Health. The second is empowering the health system and thinking about the ways in which AI-assisted care can become part of the standard of care, reducing the extent that clinicians are bogged down by paperwork so they can spend more time seeing patients and thinking about the problems that matter most for improving care. The final thing is really pushing up the ceiling of research, and I found your vision pretty inspiring there. What I would say is: you can expect that as the core problems in healthcare we've talked about — the fragmentation of data, fragmentation of patient experience — improve, you should also expect an acceleration of the application of intelligence to that data. I think biology and health is one of the areas where marginal gains in intelligence have the most obvious value in solving more and more problems for humanity. There are many examples of previous breakthroughs in biology where there was nothing stopping that breakthrough from happening five or ten years earlier except more human ingenuity applied to it — no physical blocker. When that's the case, you can assume that long-running models and agents connected to the right data can do really incredible things. The hope is that in addition to our existing work, which has really been focused on raising the floor of human health, we start to raise the ceiling of human health as well.

(1:58:57) Nathan Labenz: Yeah, it's going to be not just the AI doctor, but also the AI biomedical research scientist as well. Wow. Okay. It's an exciting time to be alive — I'm increasingly saying that at the end of these conversations these days. Anything else you want to make sure people are aware of that I didn't ask you about?

(1:59:15) Karan Singhal: Like I mentioned earlier, I think the right way to think about our work on health at OpenAI is really operating in three phases. The first is really laying the foundation, and a lot of that is focused on our work on safety. Examples include our work on HealthBench, where we have this evaluation you can run offline of large language model performance, safety, and health. We had this study of the first AI clinical copilot with Penda Health. We had a bunch of model improvements over the last year, and all of these laid the foundation for the work we're doing today. The second thing is this rise in adoption, which we've been seeing — an incredible rate of individual users, especially patients, adopting the technology. We've seen hundreds of millions of people asking health questions a week, which has been rapid growth since last year. The final thing is really the future of scaling the impact of this work. That's where our work on ChatGPT Health — the consumer-facing product that enables you to connect your health data with additional privacy protections — and also our work on ChatGPT for Healthcare — which faces health systems and enables health professionals to use AI as a copilot in their workflows — come in. Between all these things, I'm really excited about the work we're going to be doing to scale the impact in the next year. I'm looking forward to what's to come.

(2:00:33) Nathan Labenz: Karan Singhal, Head of Health at OpenAI — thank you, legitimately, for all your hard work. It really is incredibly valuable, incredibly impactful, and that's obviously just going to continue to grow exponentially along with so many things in the AI space. Thank you for being part of the Cognitive Revolution.

(2:00:49) Karan Singhal: Thanks for having me.

(2:03:07) Nathan Labenz: If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts which is now part of a16z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And thank you to everyone who listens for being part of the Cognitive Revolution.