Cracking the Medical Code: Why Cleveland Clinic Doctors Love Their Ambience Healthcare AI Scribe

Cracking the Medical Code: Why Cleveland Clinic Doctors Love Their Ambience Healthcare AI Scribe

Brendan Fortuner from Ambience Healthcare and Ben Shahshahani from Cleveland Clinic discuss how AI is transforming medical documentation and coding in a healthcare system that spends $1 trillion annually on administrative tasks.


Watch Episode Here


Read Episode Description

Brendan Fortuner from Ambience Healthcare and Ben Shahshahani from Cleveland Clinic discuss how AI is transforming medical documentation and coding in a healthcare system that spends $1 trillion annually on administrative tasks. They explore Ambience's technical breakthrough using OpenAI's Reinforcement Fine-Tuning to achieve medical coding accuracy that exceeds human doctors by 12 percentage points, including their specialty-by-specialty approach and solutions to reward hacking behavior. The conversation reveals key insights about AI deployment strategy, including how Cleveland Clinic achieved 75% voluntary adoption across 4,000 physicians after requiring just a single use of the AI scribe. This case study demonstrates what it takes to successfully implement AI tools in complex, high-stakes healthcare environments where user skepticism and regulatory requirements create significant deployment challenges.

Sponsors:
Oracle Cloud Infrastructure: Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs, including up to 50% less for compute, 70% for storage, and 80% for networking. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive

The AGNTCY: The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at https://agntcy.org

NetSuite by Oracle: NetSuite by Oracle is the AI-powered business management suite trusted by over 42,000 businesses, offering a unified platform for accounting, financial management, inventory, and HR. Gain total visibility and control to make quick decisions and automate everyday tasks—download the free ebook, Navigating Global Trade: Three Insights for Leaders, at https://netsuite.com/cognitive


PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(04:05) Introduction and Backstory
(05:20) Ambiance Healthcare Overview
(07:53) AI Adoption in Healthcare
(11:11) Documentation Pain Points (Part 1)
(16:11) Sponsors: Oracle Cloud Infrastructure | The AGNTCY
(18:11) Documentation Pain Points (Part 2)
(19:00) Product Architecture Deep Dive
(26:23) Technical Evolution and Specialization (Part 1)
(32:05) Sponsor: NetSuite by Oracle
(33:28) Technical Evolution and Specialization (Part 2)
(33:50) Healthcare Coding Challenges
(48:37) Reinforcement Fine-Tuning Implementation
(58:13) Task Prioritization Framework
(01:08:40) Adoption Strategies Culture
(01:12:12) Cost Lessons Grader Selection
(01:18:33) Future Directions Patient Products
(01:24:53) Closing Thoughts Opportunities
(01:27:58) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...


TRANSCRIPT

Introduction


Hello, and welcome back to the Cognitive Revolution!


Today my guests are Brendan Fortuner, Head of Engineering at Ambience Healthcare, and Ben Shahshahani, Chief AI Officer at Cleveland Clinic.


Did you know that the US healthcare system spends $1 trillion per year on administrative tasks?  Or that Doctors spend hours each day, during what they call "pajama time", to document their patient interactions after hours?  Or that human doctors are only 45% accurate when it comes to translating their understanding of patient conditions to the ICD-10 codes used in billing, and that these codes are then painstakingly reviewed by coding specialists employed by both healthcare providers and insurance companies?


I knew there was a lot of room for improvement in the US medical system, but I honestly didn't realize the magnitude of the opportunity.  And so… when the CMO at Ambience reached to suggest an episode, I checked the Ambience website, saw that they offer, among other things, an AI Scribe for doctor-patient interactions, and – remembering a recent chat with a doctor friend mine, who had been complaining about the inaccuracy and general uselessness of the AI Scribe deployed in his clinic – initially failed to recognize what an interesting conversation this might be.


That changed when I happened to see that Ambience was featured as a successful early adopter of OpenAI's Reinforcement Fine-Tuning product.  Having once earned such a feature myself at Waymark, I know they don't come easy, and when I saw that RFT had allowed them to outperformance human doctors on the ICD-10 medical coding task by a full 12 % points, I knew I wanted to dig in and learn as much as I could.  


And in the end, this conversation – to which Brendan also invited his customer & friend Ben from the world-class Cleveland Clinic Medical Center – turned out to be an excellent one, spanning both technical implementation and practical deployment strategies.  


We get pretty deep into the weeds on the details of how Ambience has achieved such strong results, including their specialty-by-specialty approach, how they used RFT to optimize the model's ICD-10 coding F1 score, and the instances of Reward Hacking behavior they observed and how they addressed them.  We even get into the patient-facing products they are developing to improve outcomes while further reducing burden on medical staff by automating the follow-up calls that nudge patients to get tests done and take their medicines as directed.  


As an aside, I confused the F1 score for "Pass@1" for a moment, so it's probably worth mentioning that the F1 score is way of balancing Precision, or the % of the system's outputs are correct, and Recall, or the % of all correct outputs that the system produces – by taking the harmonic mean of these two numbers.  


On the deployment side, I think Ben's account of how users develop mental models about whether AI tools are worth using, which considers both the success rate and the effort required to recover from any errors, is a brilliant distillation of things that I and many others have learned the hard way, but perhaps never articulated so clearly. 


And I was also fascinated to learn that Ben and team required that Cleveland Clinic doctors use the Ambience Scribe just once – and that single experience was enough to achieve 75% voluntary utilization across 4,000 physicians spanning some 60 specialties.  


For operational leaders wondering how to think about AI adoption mandates, and for AI product builders wondering what level of reliability is required for success, this is absolutely something to chew on.  The medical scribe company serving my friend's clinic clearly wasted a precious opportunity. 


There's a lot more here as well – including a discussion of what happens to the people currently employed as medical coding specialists – but without further ado, I hope you enjoy this outstanding case study on where the rubber of AI product development hits the road of deployment in complex, high-stakes, regulated environments full of understandably skeptical users, with trillions of dollars at stake, and the potential transform to American healthcare as we know it, with Brendan Fortuner of Ambience Healthcare and Ben Shahshahani of Cleveland Clinic.





Main Episode

Nathan Labenz: Brandon Fortuner, head of engineering at Ambiance Healthcare, and Ben Shashahani, chief AI officer at Cleveland Clinic, welcome to the Cognitive Revolution.

Ben Shahshahani: Good to be here.

Brendan Fortuner: Yep.

Nathan Labenz: I'm excited for this conversation. It's been, uh, a number of weeks in the making and, uh, just to tell a super brief backstory, I got a- an inbound pitch from, I believe, the CMO at, uh, Ambiance, and I just did a quick look and saw the, sort of, AI scribe notion for the medical context. As it happened, I had just talked to a friend who's a doctor who was complaining about his medical AI scribe, and I was like, "Oh, well, how do I evaluate this? Some of these things might suck out there and others could be good, but I don't really know." And then, days later, popped up a case study on the OpenAI website, which is, uh, a strong signal of knowing what you're doing. And so then I... Immediately having seen that, I was like, "Right. I'm in. This is the... You guys are the- the AI scribe for the medical context that I wanna talk to and learn from." And then you also, you know... Thank you for bringing an additional guest, which is, uh, which is incredible. We'll have the chance to talk about both the technology side, the implementation side, the social context in which all this is actually where the rubber hits the road. Maybe for starters, give us the, uh, quick intro to Ambiance Healthcare.

Brendan Fortuner: Yeah. For sure, for sure. Well, yeah. Again, thanks for having us. We're- I'm super excited. I think this is gonna be a ton of fun. Um, Ambiance, um, was founded about four years ago. We're building an AI platform for hospitals. You could think of us as this clinical intelligence layer that sits on top of the system of record that's like the EHR, like Epic and Cerner, and we help augment, enhance, automate both clinical and administrative tasks to make the overall hospital system more efficient, um, and actually improve the quality of care. There's, like, three different... You know, platform is- is often, you know, an ambiguous word. What is a platform? But I can make it concrete. There's three different product lines at Ambiance. Um, the first is products for clinical workflows. This is the, you know, flagship Ambiance Scribe, right? Uh, the one that- that we first deployed at Cleveland Clinic where it helps doctors take notes. The second, though, however, is- is, like, products that help the revenue cycle teams. Hospitals also have to make money. They have to be compliant. So Ambiance is building out a suite of point-of-care products for- for coding and billing, which we can talk about. And very recently, actually, we're moving into patient-facing products. So I think patient engagement, we have patient instructions that will write for the patients, but also more advanced things like voice agents, which will call you on the phone and check if you've taken labs and medications, et cetera. So those are our three different, kind of, tenets of the company, and- and we're just growing really fast.

Nathan Labenz: So many things are going vertical in the AI space right now. It's really quite something to behold. Ben, do you wanna tell us, kind of, the... How long you guys have been working together? But this is all coming at the healthcare industry very quickly. Actually, one of the things that I expected, two and a half years ago now, that I've been very pleasantly surprised by... When I first tested GPT-4, I was like, "Everybody is gonna unionize like crazy, and we're gonna see the most protective, you know, moves that we can possibly imagine, whether it's, you know, taxi drivers or lawyers or doctors. Everybody's gonna be trying to protect their turf and keep the AI out of their environment." And actually, a lot less of that has happened over the last two and a half years than I would have expected, and I wonder what... You know, o- one kind of candidate idea that I have had for that is like, maybe the doctors are just all so burned out that they'll take any help that they can get. But how would you describe the last, kind of, couple years of growing awareness, growing adoption, and- and just the reception and the lack of, at least from my perspective, the apparent lack of hostility to AI that has, um... That seems to have been the norm in the medical world?

Ben Shahshahani: Yeah. I think... First of all, let me just say a little bit about Cleveland Clinic. We're a, uh, academic medical center in Ohio. We have basically three me- uh, charters in our mission statement. One is providing healthcare. We have about 25, 27 h-... Actually, a little bit unusual in the sense that we have international footprints. So we have our hospitals in Ohio, in Florida. We have a hospital in London and in Abu Dhabi. And we also have outpatient facilities in Toronto and in a few other states, including Maryland. And I joined, actually, Cleveland Clinic just about nine months ago, and my background is in tech. So... And yeah. I mean, one of the reasons, actually, to come into healthcare is applications and implementation of AI in healthcare. And what you said is interesting because, A, in terms of the clinical usage of AI, if you think about the work that the doctors and nurses are doing, there is such a shortage of caregivers and so much demand that I think the issue of, "Hey, is this thing going to take away my job?" is really not something that we see maybe even in our lifetime 'cause there is so much demand for- for healthcare. And, um... And doctors are already using it. I mean, we know from the surveys and the publications that come out that a lot of doctors are using ChatGPT or other AI systems. So what we wanna do is do it in the right way, actually, rather than them thinking that, "Hey, you know, I'm using it. Am I using it the right way? Am I not using it the right way or, uh, the wrong way?" Or mistakenly put some confidential or PHI sensitive information in... You know, in the chat bot. We wanna bring those things forefront and actually put it... Make it as part of their standard workflow. Obviously, there are a lot of challenges in this, but it's, uh... Definitely is a... We think of it as a huge and important productivity tool. And the implementations that we've had so far with AI, most of them, I would say, are... You would consider them productivity tools. And in, you know, one... If you look at it in that way and look at the demand and supply, um...Productivity tools don't necessarily lead to reduction in force, right? Because it dema- it, it depends on how much demand there is for that service. If the se- if there is more demand, it just means that you can, uh, you can serve more people because now you're more productive. And, and there may even be situations where if you have more productivity tools, then actually the demand may actually go up because you can, uh, reduce the cost, for instance. And that actually generates more demand. So the fact that something automates and helps you be more efficient doesn't necessarily mean that, hey, you're in trouble and it really depends on the entire ecosystem.

Nathan Labenz: Tell us a bit more about the pain points. I'll circle back later to some of the more aggressive AI doctor experiments that are starting to emerge. But staying within the productivity tool perspective for now, where do doctors need the most help? Tell us about the products that you're working on.

Ben Shahshahani: Business documentation is definitely at the top of the list. It's well known that we have a shortage of caregivers, especially doctors and nurses, and they are burnt out. Many surveys identify the primary cause of burnout as administrative and non-clinical documentation, or documentation overall. That's the part that takes time away from their clinical work. It also cuts into their personal time, often referred to as 'pajama time.' On average, a doctor may spend up to three hours a day doing documentation, often after patient visits and in the evening. When we started working with Ambience and developing AI Scribe, we focused on improving the caregiver experience. Our goal was to reduce burnout and give doctors time back to focus on the important clinical work that drew them to medicine. Along the way, we discovered additional benefits, like how better documentation improves coding, as Brendan mentioned, which impacts financial outcomes. That wasn't our initial focus, but it's clear that the benefits of AI extend beyond reducing burnout to enhancing documentation quality and even revenue.

Nathan Labenz: One of the stats Brendan shared is mind-blowing—the math checks out: a trillion dollars is spent annually on administrative tasks in the US healthcare sector. You might wonder how that's possible, but with a $27 trillion GDP and nearly 20% going to healthcare, that's about $7 trillion. So allocating one-seventh of that to get to a trillion for administrative overhead is about 14%, which, when put that way, doesn't sound so crazy. Still, a trillion dollars is astonishing. My guess is that doesn't even include all the time doctors spend on this. I wonder—does the accounting really matter, or are there other factors?

Nathan Labenz: Are pajama hours being factored into that trillion dollars?

Ben Shahshahani: I don't know. I doubt it. But I've seen another interesting report—if you're interested, I can find the source. It showed that from 1975 to 2010, the number of doctors in the US increased by about 150%, which is similar to the population growth. However, the number of administrative workers in healthcare increased by more than 3,000% during that time.

Nathan Labenz: Yeah, those are university-like numbers.

Ben Shahshahani: What's striking is that having more administrative people hasn't reduced the administrative workload for doctors. In fact, doctors now have even more administrative tasks. For various reasons, the number of non-clinical staff in healthcare has dramatically increased.

Nathan Labenz: This might be a bit off-topic, but why is that? Is it due to increased complexity? I'm familiar with the rise of the hospitalist, but that's still a medical role. What are all these administrative jobs?

Ben Shahshahani: Partly, it's due to new regulations that require additional staff to ensure compliance. Most doctors are now employed by hospital systems, which adds complexity to managing these organizations. There are many factors, like insurance and billing, that require extensive documentation. Altogether, these demands have added up over time.

Nathan Labenz: Yeah. Fascinating. I mean, that is a lot. I've often asked the same question about the universities. Brendan, do you wanna get a little bit more in depth into the product offering and you know, the before and after? I mean, we know what the pajama time, uh, looks like-... what is the post-appointment reality for the doctors? You know, how are they saving time? Are they... You know, are they also saving money? And then I d- I really wanna also get into a lot of the technical detail on, like, how you made this work, because I think right now, obviously reinforcement learning is a mega trend. Reinforcement fine-tuning from OpenAI, relatively new. I would say kind of underutilized relative to what it... The potential that it seems to have. So yeah, give us kinda the, the setup, but then we'll really dig in on how we actually make this happen.

Brendan Fortuner: Yeah. No, I think maybe it's helpful to start with the scribing product, 'cause I think, you know, that's the bread and butter of Ambience and really becomes the epicenter of all the other product lines. So Ambience scribing is describing the clinical workflow. You and I, we go into our doctor's office, you know, that's called an outpatient care setting. Doctor has an Ambience mobile application. We're also natively embedded with Epic. Epic has a mobile application, you can use that for recording with Ambience. But visits are typically 15 minutes, 30 minutes. You'll bring the phone in, they'll record that conversation. After the visit, they'll click "end recording". Ambience will then go transcribe that audio into text using fine-tuned models, you know, trained for medical speech recognition, and then we'll call a bunch of different language models, all tuned for different medical tasks, to then generate these clinical notes in different sections of the clinical note. We then automatically write that back into Epic, into the other EMRs. So when the clinician is done with the visit, they can just go back to their desks, they refresh what's called their progress note and boom, all the Ambience documentation is just sitting there for them. And then that turns out to save them, like, two to three hours per day, just the task of kind of automating that. But getting into that coding component, Ambience realized, I think very early on, "Wait a second. Doctors went to medical school to practice medicine. They did not go to medical school to, like, select the right billing codes." And there's two cases which we can talk about with RT a little later, but basically, they make a lot of mistakes and that leads to a lot of downstream pain for insurers and, and these revenue cycle teams. So Ambience, at the point of care, before they finish th- that note, we will start to assist them and we'll make suggestions on the right ICD-10 codes, for instance, and then we'll automatically file them back into Epic. Again, saving them time and also increasing what we'll call, like, the quality of the documentation. So that's a little bit about the workflow, um, and it's different for every care setting and, and even every specialty, so there's some nuances there.

Nathan Labenz: Yeah, dig in a little bit on the nuances. One thing that jumps out to me is, obviously in the AI space in general right now, there's a lot of talk and a lot of confusion about agents. What is an agent? You know, the nature of agents. How agentic things should be.

Brendan Fortuner: Yeah.

Nathan Labenz: It seems to me that you are doing what I think most people who are actually realizing major value are doing, which is not letting the AIs choose their own adventure through this, uh, problem, but rather, like- ... really decomposing the task, setting up intricate workflows. Maybe not super intricate in some cases, but like-

Brendan Fortuner: Yeah.

Nathan Labenz: ... you know, step by step where you have decided exactly how you want this thing to go, and then measuring and optimizing every node in that workflow until-

Brendan Fortuner: Yep.

Nathan Labenz: ... you get to something that is deployable, and then beyond that, presumably, as well. Can you... And you mentioned, like, fine-tuning of, of speech models, which is interesting.

Brendan Fortuner: Yeah.

Nathan Labenz: Fine-tuning of, like, multiple different models. I'd love to just get as much detail as you can offer in terms of the breakdown of... And, and maybe also, like, how you work with the subject matter experts, 'cause I've, I've seen that-

Brendan Fortuner: Yeah.

Nathan Labenz: ... also be a real stumbling point, even in just all sorts of Main Street businesses that wanna do stuff.

Brendan Fortuner: Yeah.

Nathan Labenz: There, there's often, like, the AI person who has, like, some know-how, but then there's the, the disconnect of, like, how do you want this task to be done and can the person articulate how they want it done, and... And obviously that's gonna be pretty important to get right in the medical context. So-

Brendan Fortuner: Yeah.

Nathan Labenz: ... with that long prompt- ... let's hear your chain of thought on how you, uh-

Brendan Fortuner: Maybe, maybe too long for my context window.

Nathan Labenz: Uh, I- I doubt that. Yeah, go for it.

Brendan Fortuner: Maybe we can start, we can start with specialties and some nuances there, 'cause that really informs the architecture of Ambience, you know, and then, you know, we can talk about particular components of the architecture, how we power them, and maybe go from there. How does that sound?

Nathan Labenz: Yeah, that's great.

Brendan Fortuner: I first wanted to ask, you said some of your doctor friends, you know, they weren't seeing some benefits of scribing. Do you happen to know what specialties they were in or ?

Nathan Labenz: My... Yeah, the one friend who I was talking to, to about this, interestingly, is a family doctor, so it's a pretty, um-

Brendan Fortuner: Family care.

Nathan Labenz: ... diverse, uh, you know, set of reasons people come to him. It is in a highly Spanish-speaking community, so that's-

Nathan Labenz: One wrinkle.

Nathan Labenz: He probably speaks 50% English, 50% Spanish.

Nathan Labenz: on his day-to-day basis. Beyond that, I don't know too much, but I think it's a broad array of things he sees on any given day.

Brendan Fortuner: I ask because I think this is something we also observed. Early generations of Ambience were designed for primary care clinicians or internal medicine. These are typically the providers you and I see once a year. We have a rash, they go through our problems. What we found when we tried to expand to more complex specialties, right? There are over 100 different medical specialties in three or four different care settings, if you include telemedicine. The product didn't work. Users weren't using it. We tried to figure out why, and as we dug deeper, we realized, wait a second, there's a tremendously rich heterogeneity across specialties. If you don't bake that into your fundamental architecture, product design, and the actual models you're fine-tuning, you won't get adoption. We had these weird charts where primary care was great, or telemedicine urgent care was great, but oncology, cardiology had very little adoption. This was earlier generations of Ambience. So we stepped back and re-architected our system to allow us and our team to think, instead of Scribe, we think about cardiology Scribe, oncology Scribe, and inpatient Scribe. Each of those, you'd think on the surface, the first step is different models, and we do. But it's more than that. It's the user experience itself in the product. There are different buttons, different elements of the chart that we pull into the note from Epic between primary care, cardiology, and oncology. We can get into some of the nuances. Then you have the emergency department and inpatient. These are multiplayer settings. It's more than one clinician. They see the patient multiple times over the course of multiple hours and multiple days, right? So your whole workflow in the mobile app, like record and then stop, totally breaks. That was the first insight and really grounded Ambience's philosophy around development. The benefits are tremendous, and maybe Ben can speak to some things we saw at Cleveland Clinic, but by tailoring it this way, you can two to three times the amount of utilization. That has downstream effects for ROI, value capture, and clinician satisfaction. That's the bread and butter of how we approach it.

Nathan Labenz: What would you say has most moved the needle? I don't know exactly what calendar time you were describing when primary care worked reasonably well but other things weren't working yet. I wonder what thresholds or big unlocks, like I would imagine... I don't know if that was pre-Whisper or post-Whisper, but the open sourcing of Whisper might have been one notable unlock for you. Once upon a time for me, we're starting to maybe move into a new era now, but I once found a huge unlock just in training on chain of thought for certain tasks.

Brendan Fortuner: Yes. Yes.

Nathan Labenz: So, yes, I'm curious as to what the big step-change moments were for you along the way.

Brendan Fortuner: There's some interesting ancient history here I feel comfortable sharing. I joined the company three and a half years ago. At the time, this was sort of pre-inflection point for generative model capabilities. We were actually using architecture based on BERT. We were still doing scribing, focused on telemedicine. The particular architecture was actually ColBERT. So instead of just using a generative model, an autoregressive model to generate tokens immediately, we actually broke the task down into this extreme classification task over transcript chunks. Basically, the chunks would flow in, and then we'd classify them based on this huge ontology of medical concepts, like what were the symptoms? So we'd get this big list of concepts, and then at the very end, we'd have this procedural compiler which kind of glues everything together into a templated, very templated note. It was actually based on BERT and ColBERT, and I think that got us through the early generations of the company. But the big inflection point, of course, was like, wow, okay, hold on, GPT-3, when they finally released that instruction following version. I remember, like three years ago or so, nobody really knew what was happening, but when you saw that and started playing around with it, you realized this is a different era. Everything has changed. I think that was the first inflection point in our industry. I do think some of the things with Whisper, on the ASR side, were another kind of inflection point. Since then, we've had other step-function improvements over time. But if I were to say, it's one thing to have that technology, but you can still see even now in the landscape with many products, it's still not working, right? Like you said, in many specialties, in certain products, it's not working. I think what Ambience did is we took these generative models. Instead of having just one pipeline where we tried to have one-size-fits-all, you know, generate for all specialties, we built this system where we could plug and play, like, in a very granular way. Which note sections do you want as a clinician? Do you want a history of present illness, followed by an ROS, followed by a physical exam and assessment plan? We found cardiologists want it a little bit different, so we'd create a separate model for them, or a separate prompt for them, and we let them select. With our users, there are probably 500 different permutations of outputs we'll give just for a Scribe product. Then the clinicians can choose them, compose them, and we then give them styles so they can style them. Do they want them concise or bulleted, narrative? Do they want to use layman's terms? Do they want to use military time, clock time? I think that level of customization, even if it's built on the same models, is what really unlocked the utilization.

Nathan Labenz: How much simpler has the architecture become as the models have evolved? In my case, with Waymark, we have a similar history. People have heard me describe this before, so very briefly, for images, we used to compile a bunch of content for a small business off the web and then try to put that into a video form for them. Around the same timeframe, three years ago, you could caption an image, but the captions you would get would be very generic. For example, you might have an image that was clearly a doctor and a patient in a medical setting, and the caption would be,

Brendan Fortuner: Yep. Yep.

Nathan Labenz: That was the best image understanding we had, so we had all these hacks and whatever you name it. It was pretty gnarly. It was actually fun to build. I look back on that era of hacking-

Brendan Fortuner: Yeah.

Nathan Labenz: on different models and trying to figure out how to make clip embeddings useful to sort things for what I was looking to bring to the top. I look back on that fondly. But now we can just, as you'd expect, dump 100 images into Gemini Flash or Haiku or 4o mini or whatever, and just pick these images, should I pick, and it does a much better job than we used to do.

Brendan Fortuner: Yeah.

Nathan Labenz: So I imagine that you have a similar simultaneous simplification, but then also expansion of possibility. Could you tell the story of simplification but expansion as it's unfolded over the last, say, three years?

Brendan Fortuner: Yeah. I want to hear your story too. It's a great question. I would say...

Ben Shahshahani: Certain things have gotten easier, but as we've learned more about healthcare, we've uncovered new use cases and things we didn't anticipate that require additional levels of complexity, right? For example, in primary care, if you want to summarize a note, off-the-shelf models with a prompt in primary care can get you so much further than it was three years ago. It's incredible. It's an incredible inflection point. But if you want an interactive agent that can safely, in a safety-critical environment, speak with the patient and collect intake, that actually does require a still high level of craftsmanship, guardrails, and complexity that looks as much or more complex than what I was describing three years ago. So I think it is so use case dependent. Something simple, like predicting ICD-10 codes based on what was discussed, the complexity has moved from maybe modeling with the architectures and tweaking your parameters and your loss, to actually, what is the right way to formulate this task? What is the right grader in the semantics? And how can we actually annotate a dataset that's high enough quality that models subjectivity and then deal with all the repercussions and reward hacking, et cetera? So, complexity is still there; it's evolving, but it really depends on the actual specific task.

Nathan Labenz: Yeah. I want to hear more about the patient-facing stuff as well. But let's go down the coding rabbit hole for a minute. Maybe, Ben, do you want to tee up again from the provider side? Like, what is this whole coding thing? Why does it matter? I was also really amazed to see the baseline numbers. It's a good reminder, I'll let you tell it, but it's a good reminder that human baseline does not mean perfect.

Ben Shahshahani: Mm-hmm.

Nathan Labenz: That's always a common refrain, like Biden used to say,

Ben Shahshahani: Yeah. So this-

Nathan Labenz: So tell us more about the challenges of coding in the real world.

Ben Shahshahani: Yeah, this coding is part of this larger revenue cycle management in healthcare, which is fairly complex. And back to your question as to what changed that caused so many administrative roles in healthcare, I would imagine that's part of it. It starts from early on with things like getting pre-authorization, then going to mid-cycle, which is this coding, and after that, once the coding is done, you submit it for reimbursement. In a lot of cases, they may get rejected, and then you have the denial processing. Each step of it is actually human and manual. There are people calling from payers' organizations, from hospitals to insurance, asking for pre-authorization. Believe it or not, a lot of this happens over the phone. Healthcare has not really adopted technology the way that we think they should have. So coding is basically the process of classifying diseases into a standard set of codes, which is based on a taxonomy, and that taxonomy is adopted internationally. I think right now the latest is ICD-11, but most organizations use ICD-10, and you are talking about something in excess of 70,000 different codes. So every disease and condition is coded based on the evaluation that the doctor does. Then there's another set of codes that classify procedures. They combine these things and say,

Nathan Labenz: To describe this process, I think it's helpful for people to consider as they think about automating business processes. Our primary audience profile is the AI engineer, and their main challenge is often automating a business process. First, they have to map out the process. I always emphasize the importance of thoroughly understanding and mapping out the existing process, as this is critical for effective automation. Anyone can create a workflow, but it won't necessarily achieve the intended result if the original process isn't well understood. So, let's take an extra moment to discuss the pre-AI scenario. Tell me if I'm getting any of this wrong: The doctor has a visit, and may or may not record audio for later reference. Afterward, either from memory or from notes, they sit down and write their notes. Presumably, these notes are relatively brief compared to what AI might produce in the future. They then have to code both the diagnosis and the services rendered into a taxonomy.

Ben Shahshahani: Yeah.

Nathan Labenz: Are they doing that through some kind of type-ahead search where, for example, for common illnesses—like with my kids, we get things like pink eye—they start typing 'pink eye,' and a numeric code pops up? Is it similar when prescribing drops, where they type it in, and a code appears to select? Then that information gets sent off, and I'm imagining a sort of back office with coders...

Ben Shahshahani: .

Nathan Labenz: At the hospital, coders receive the doctor's notes, review these codes, and try to spot errors or improve the documentation before sending it to the insurance company, which has its own review process. Sometimes the insurance company approves the claim, other times they reject it and send it back for correction. This exchange can go back and forth. Am I accurate about the pre-AI process, or am I missing something?

Ben Shahshahani: Yeah.

Nathan Labenz: What's the naive effect?

Ben Shahshahani: I think that's generally correct. I should probably sit down and go through the whole process myself, as it takes a long time. There are professional coders trained in ICD codes, for example. Starting from the doctors' documentation, they finalize the codes, add procedure codes if applicable, and group everything together before sending it to insurance companies for payment. If the claim is denied or questioned, a separate team manages the denial process. They may have to pull the doctor back in to provide further evidence and documentation. It's a fairly manual process that requires domain knowledge of the codes. There are tools that make it easier—like the search functionality you mentioned—but nothing comprehensive. There are companies now focusing on processing the documentation and coding after the doctors complete their notes. In theory, you could train a system for this, but it's hard to define ground truth. However, we do have historical logs of documents and associated codes, as well as records of what was challenged, changed, and finally approved. That gives us a human baseline to compare against AI. To evaluate AI, we can have it do the coding, then have multiple human coders review whether the AI coding was comprehensive and correct, and compare with what the humans did. Often, the AI may find things the human coder or doctor missed—or vice versa. Overcoding is problematic, as it carries penalties, so we don't want the AI to add incorrect or unnecessary diagnoses or procedures, since that has serious legal consequences.

Nathan Labenz: In reading the case study, the baseline human doctor correctness rate for the OpenAI website was 45 percent, which was shockingly low to me. I came into this naively, so I'm curious—what does that actually mean? Does it mean that 45 percent of the time, what the doctor said didn't need to be changed by the coder and was approved by insurance? If that's correct, how much can coders improve that rate? Do they take that 45 percent up to 90, or are we seeing more than one in ten things that get—

Brendan Fortuner: Yeah.

Nathan Labenz: ...sent to the insurance company coming back?

Brendan Fortuner: Yeah.

Nathan Labenz: You start to see where that trillion dollars goes.

Brendan Fortuner: No, that's a great question. I think there's actually a specific number I was reading. I think it's 20 billion each year that we waste as a country just in—

Nathan Labenz: Wow.

Brendan Fortuner: ...codes that are not substantiated or are incorrect, specifically these ICD-10 codes. Maybe it's helpful if I start with what Ambience does. I can explain why this might be challenging for clinicians and non-intuitive. As Ben mentioned, after the visit, once they're done talking with the patient, or sometimes even during the visit, clinicians fill out these ICD-10 codes using a search engine, typically powered by something called IMO, Intelligent Medical Objects. They'll type in things like "left ear pain."

Nathan Labenz: Mm-hmm.

Brendan Fortuner: Then a big list of codes appears, such as H65.195 acute non-suppurative otitis media, recurrent, left ear. You have to get that right, but there are a lot of others that look very similar, and it's easy to make mistakes. Ambience doesn't do any diagnosing. We're actually just extracting what was discussed in the visit and normalizing that onto the standard set of codes for clinicians. Doctors know the diagnosis, but with 70,000 codes evolving twice a year, that's where things can break down. We don't want to say doctors are bad coders; it's just a non-intuitive task. That's to frame what we're actually trying to do. We used F1 score, and that's where the 45 percent number came from. The big question is, every time you're fine-tuning or using RFT, what's the actual maximum you could possibly achieve? It's not 100 percent—almost never—and the reason is that this task has some level of subjectivity. You have to model that subjectivity and harness it, which we do with a gold panel. We first established a baseline for this task, mapping transcripts to correct codes, giving people the search engine in an environment similar to real life, including clinicians and auditors from AAPC. We created two gold panels and calculated inter-annotator agreement to see the maximum possible, and we maxed out around 80–85 percent F1, inter-annotator agreement. So that's our ceiling. It's a huge hill to climb. The models are at about 35 percent F1, clinicians at about 45, but there's still a lot of progress we can make with RFT. That's how we got there.

Nathan Labenz: So is that 85 percent still aspirational, even in the pre-AI workflow, after the coder process following the doctor's initial coding? Do you know how that number compares to the theoretical maximum?

Brendan Fortuner: Here's the trick. Some of this is actually—

Nathan Labenz: Nobody knows.

Brendan Fortuner: ...hard to even determine. If you think about it, Ambience has a unique artifact: the transcript of what was discussed during the visit. RCM teams and coders in the insurance companies are operating on a much less detailed artifact, which is the note. As you know, clinicians aren't fond of taking notes, and the quality varies, so a lot of rich information is lost. There's a gap in reality. As a community, we're only starting to discover what 'good' looks like and what's possible to extract from this transcript artifact. It's hard to know for sure, but we do know there's a gap.

Nathan Labenz: Yeah. Okay, that makes sense. Well, I think that's a perfect setup for the fine-tuning process that not only closed the gap to the human doctor, but also took the AIs above and takes us to a whole new world. I want to learn everything about this—in terms of what the data looks like, whether you need to collect chain of thought from people, how much we're trying to get a human reasoning prior versus this new paradigm of just having the right answer plus a scoring rubric used in reinforcement learning. Maybe in some cases you want to do both. We know frontier models use human SFT datasets and alternative learning combined. How much data do we need? What kind of data is needed? What tricks work in terms of graders? I want to learn as much as I can from your hard work.

Brendan Fortuner: Sure. So, to start, what technique did we use? We used a tool from OpenAI that they call RFT, reinforcement fine-tuning. It's a proprietary technique. However, there are some parallels in the open source community, which we could talk about, maybe DeepSeq. We used this API, so our interface for this project is a dataset integrator. I'm sure the folks here are familiar with RFT, but the basic gist is that this technique was used to get these extremely capable reasoning models. Gemini, Claude, and O3 are all using this technique in one of the steps of the post-training process. Now, OpenAI has made it available to all developers who can give this a go. I think it's very powerful and novel for a few reasons. In the past, we've been doing supervised fine-tuning our entire careers. You collect a big dataset of input and output pairs, and you're trying to get the model to mimic the outputs. But with RFT, they actually flip it. Instead of having this annotation, you can give it a programmable grader that provides a reward. You can use any different type of grader; there are many kinds. You could use a string match, a unit test, or even a language model to grade. You can even use an ensemble of multiple techniques. It's extremely powerful and allows you to guide the model in very subtle ways to what you want. It also allows you, interestingly, to optimize for the end objective. Whenever you can optimize a machine learning model for the end result you want, that's a good place to be. Oftentimes, we're just optimizing for proxy metrics, like loss or F1, but you can actually optimize for the real-world objective. And I think the last one is this technique, using RL, is extremely sample efficient. During the training process, the model generates multiple candidate answers, say four or 64. The grader then scores the answers. The model takes one example and expands it through the sampling process into 64 examples. So, your dataset, in theory, gets a lot more signal from every example. You can actually get state-of-the-art results with hundreds or low thousands of examples, when in the past it might have taken tens of thousands to get something similar with SFT. So, that's the technique. It is a proprietary OpenAI technique, but that's what we're building on.

Nathan Labenz: One huge spectrum, obviously, with this reinforcement learning is how verifiable the answer is.

Brendan Fortuner: Mm-hmm. Yep.

Nathan Labenz: I think everybody has heard plenty of talk at this point about math and coding, and then, on the far end, would be poetry or fiction writing.

Brendan Fortuner: Yep.

Nathan Labenz: Or something.

Brendan Fortuner: Yeah.

Nathan Labenz: I'm interested, and then you've got, if it's the more verifiable it is, the simpler your grader can be, right? In theory, we've seen things like with R10, just a straight-up binary signal.

Brendan Fortuner: Yes.

Nathan Labenz: Of you got it right or you got it wrong.

Brendan Fortuner: As well.

Nathan Labenz: Can work.

Brendan Fortuner: Mm-hmm. Yeah.

Nathan Labenz: If you are evaluating poetry or whatever, you have no binary signal, so then you have to start getting into reward design.

Brendan Fortuner: Right.

Nathan Labenz: So where are you guys on that spectrum? There is a sort of code answer that you could have this F1, exact match, yes/no. But I'm getting the sense that there is more to it than that, that sits between, like, 'here's the raw transcript of the visit' and 'here's the exact code, did you exact match it or not?' What does that answer look like in terms of what else is there, and then how does that feed into your reward design challenge?

Brendan Fortuner: It depends on the task. We can talk about ICD-10. It can be different. We've explored different tasks here. But actually for ICD-10, it's pretty simple. We have this dataset of transcripts in and these codes out. It's multiple codes for every visit, and we were able to get gains just using string matching. I think ours was fairly vanilla string matching. We were optimizing the F1 score directly, trying to find a balance of precision and recall on the actual codes, which has really interesting downstream impact on the revenue cycle process. We were trying to model that. This is nice because string matching is actually fairly clean. It's less hackable, which is always a good place to be. I always prefer it. It's cheaper to run. But because there are all these different codes, the model naturally gets partial credit. It gets one right, but it doesn't get another one right. That actually gives it a richer reward that can lead to faster learning. So yeah, in ICD-10, it was actually fairly simple. We did iterate and try some stuff, but we got away with string matching.

Nathan Labenz: So there's a variable number of codes for any given visit.

Brendan Fortuner: Exactly. Exactly.

Nathan Labenz: Visit. How do you think about, if there are, let's say, five ground truth codes, you can have both false positives and false negatives in the AI output, right? Do you treat those differently in your reward process? Or, let's say I got four right, missed one, and then had one that wasn't actually there. How do you give me credit for four, and then are those other two both minus one? Is there a need to do something different between the false positives and false negatives?

Brendan Fortuner: That's a really good question. In this case, we just performed a straightforward precision and recall calculation, then merged them into the F1 score and actually optimized F1 directly. So it wasn't as sophisticated as we initially thought it needed to be. However, there's an interesting opportunity to extend that further. For example, if the model gets the code almost correct but is missing one sub-component, could we use a semantic grader or combine different grading approaches? That's an interesting direction to explore, though we didn't need to pursue it for this particular experiment.

Nathan Labenz: Yeah, interesting. So you haven't stopped progressing as of the case study on the-

Brendan Fortuner: Oh, yeah.

Nathan Labenz: ... OpenAI website.

Brendan Fortuner: The hill is steep. There's a lot of room to grow. It's really exciting.

Nathan Labenz: Could you unpack a little more, either about that task or related tasks? What more do you do to make progress? Semantic matching, as opposed to just string matching, has many variations. I'm not sure how soon you would need to worry about reward hacking, but I do worry about it in the broader context of what could happen.

Brendan Fortuner: Yep.

Nathan Labenz: ... super intelligence.

Brendan Fortuner: Yep.

Nathan Labenz: Did you need to take that into account, or did you encounter any unusual issues in your experiments within this narrower domain?

Brendan Fortuner: That's a great question. We definitely learned a lot from ICDe10. It's a good time to introduce the physical exam, because that's where things got strange. When you move away from string matching to an LLM-based grader, you have to be careful—you’re entering dangerous, highly hackable territory. For us, this started as a learning experiment. Could we take RFT and apply it somewhere more subjective than string matching, like a more open-ended generative task? Now, for listeners, during a physical exam, when you're in the doctor's office, they check you out—listen to your heart, lungs, look in your mouth—then document findings in the notes section. It's typically a structured section, and for each body system, like respiratory, they write some prose; maybe sentence fragments or keywords like 'respiratory, normal effort, no audible wheezing.' We chose this because it’s not fully open-ended—there's structure. It’s a JSON file with short, fragmented sentences, so maybe we could use a basic regular expression or semantic grader. But this is where we started moving into model-based graders using rubrics. We began training models right away and ran into lots of hacking. We optimized for F1 score, and the model started gaining precision by inflating findings—outputting the same finding in different ways. So, it got points for being right, but it was redundant. We fixed that by penalizing redundancy. Then, despite high scores, we noticed degeneration in tone—the model stopped using clinical or professional terms and said things like, 'Grandpa's heart sounds good' instead of 'Normal heart sounds.' That's the risk with models. Our rubric checked for clinical quality but not for style, so we adjusted it, weighting 75% clinical quality and 25% clinical style for proper language. Eventually, we managed to guide the model, and clinicians preferred the final version. That was our first experience with more open-ended tasks, and that's where things can get tricky.

Nathan Labenz: How do you approach the art of AI automation? How do you decide which tasks to prioritize? You have to consider feasibility, demand, opportunity size, and risks if something goes wrong. That’s difficult for many Main Street businesses, and in a medical setting, there’s also a trust factor. People need to be open to using AI for a task, even if everything else checks out.

Brendan Fortuner: Right.

Nathan Labenz: Who participates in those decisions? How do you decide what to tackle and in what order?

Brendan Fortuner: This is something Ben and I think a lot about. Maybe, Ben, you can explain what’s important.

Ben Shahshahani: Yeah.

Brendan Fortuner: ... key, I guess?

Ben Shahshahani: I can explain how we determine our priorities, especially those where technology and AI can play a role. We usually categorize them into three areas. The first is patient outcomes and overall patient experience, which is core to our mission. Anything that improves patient outcomes, helps with diagnosis or treatment plans, is a major opportunity in healthcare. The goal of precision personalized medicine is now reachable because we have a lot of data in the EHR, the electronic health record system—data that was never available to commercial organizations or those training large language models. By bringing AI in, we can offer more personalized treatment and diagnoses. That's one area: clinical outcomes and patient experience. For example, helping patients navigate the VA—should they go to the hospital, emergency, urgent care, or just make a doctor’s appointment? The second area is caregiver experience. With caregiver shortages and burnout, anything we can do to make their work easier and reduce non-clinical tasks is valuable. The third area is overall cost and efficiency: scaling operations, using resources better, and reducing the cost of administrative tasks like coding or optimizing operating rooms. These priorities align with our main stakeholders: patients, caregivers, and the organization. For each use case, we consider risk, technology maturity, and ROI. We need a clear line of sight and also try to reduce risk by running pilots, like we did with Ambience, where we ran a five-month pilot with different providers.

Ben Shahshahani: The technology is there, and the product market fit is in place. We gain the value we expect, and then we gradually roll it out. In this case, with collaboration with Ambience, it made sense to do a phased roll-out based on specialties. In other cases, we may decide to start with one hospital to see how it works and then expand to the next hospital, and so on.

Nathan Labenz: How is the reception? Are you in an environment where people are saying, "Hey, my colleague in the next specialty has this," and how soon does it—

Ben Shahshahani: Yeah. Our chief clinical officer keeps calling it a magical experience. By the end of the pilot, it was so successful that some doctors were threatening to leave if we took it away from them.

Nathan Labenz: Oh.

Ben Shahshahani: It just worked out. That's not very common, especially in healthcare. Finding a product with perfect product-market fit is challenging, and this is a great example. I'm not sure if it's an exception, and other things may not be as straightforward. I see it as a productivity tool. In other industries, I've seen some productivity tools get adopted and others not. If you're building something to improve productivity and you want adoption, users form a mental model about how difficult a task is. They also imagine how much more efficient they will be by using the tool. That mental model forms before people start and continue using it. Often, with AI systems, companies get that calculation wrong by only considering when it works, but you have to think about the edge cases when it doesn't and how that impacts overall expectations. It's a calculation. Imagine two scenarios: the expected effort if the product works, with no error, multiplied by the probability it works perfectly. Then, the probability it doesn't work as expected, plus the effort required when there's an error. Sometimes errors are rare, but recovery effort is so high that overall it's less productive than not using it. For example, when I was at Yahoo, we tried to add voice search. It didn't succeed because users' mental model was, "Is clicking the microphone faster than typing?" With errors at even 10 or 20 percent, if you have to correct mistakes, it just doesn't work. Especially in hands-free cases like driving, where voice seemed ideal, there was background noise, and the microphone wasn't close enough, so errors increased. With the AI scribe, I shadowed a few doctors and observed a 15 to 20 minute patient conversation. In one case, an elderly couple chatted with the doctor. By the time we returned to the office, the clinical note was already generated. The doctor made a few small edits for errors of omission and said he probably saved 10 to 15 minutes, even when it wasn't perfect. If it were perfect, no edits would be needed. If your mental model is that using the tool increases productivity, that's promising. For roll-out, our team had a clever strategy: we didn't mandate use, but made training compulsory. Doctors must watch a video, review training material, and try it at least once. We believe that even one encounter helps form a positive mental model that encourages adoption. We haven't seen similar adoption in another application—automated responses to clinical emails. Doctors receive hundreds of emails daily, and some products auto-generate responses, but adoption rates remain low. We're working with Ambience to improve this. The challenge is greater because doctors say they have to check the auto-generated response, edit it, and the whole process takes almost as long as writing it themselves. Since these emails are quick to handle manually, doctors don't see a productivity gain and don't adopt the tool. For the AI scribe, however, we're seeing high adoption because the expected effort drops, making it a perfect product-market fit.

Brendan Fortuner: I want to give Cleveland Clinic some credit. We've been working with them for about a year, and it's an exceptional organization. It takes more than just a product like Ambience with good AI; it requires a lot of organizational effort to operationalize and scale this technology. We were impressed by what we saw. We onboarded about 4,000 monthly active users from zero in 90 days across 60 specialties and seven languages. Utilization is around 75%, which is the percentage of visits using Ambience. While I want to take some credit, it's incredibly important to recognize great health system partners, and Cleveland Clinic has been truly exceptional. We work with around 40 organizations, so we know what excellence looks like.

Ben Shahshahani: Yeah.

Brendan Fortuner: Yeah.

Ben Shahshahani: I think the technology is part of it. Workflow integration and change management are critical.

Brendan Fortuner: Yeah.

Ben Shahshahani: That's huge. The last mile is probably the most challenging part.

Nathan Labenz: I think your mental model commentary is really insightful and important for people to understand. It also helps me understand why so many people are still in the mindset of, "I tried ChatGPT when it first came out and didn't think it was that impressive, so I haven't been back since." I wonder if you have any other strategies—'try one' is a really interesting suggestion. Many organizations could generalize that. It's a small ask to say, "You have to try this one time." Do you have other thoughts on how to create the right culture of adoption? Maybe another strategy could be, "Try one once a quarter," because things are constantly improving and you don't want your first impression to outlast its usefulness. I'd love to hear your thoughts on how—

Ben Shahshahani: Yeah.

Nathan Labenz: What the right level of mandate is, the right expectations, management, and culture that helps drive adoption, because that's a major challenge in many organizations.

Ben Shahshahani: Yeah. I think one key difference between a place like Cleveland Clinic and others that might not be academic centers or as forward-looking is that our doctors are passionate and willing to try new things. Not everyone, of course, but that's an advantage. There aren't many off-the-shelf solutions you can just buy and immediately use; it needs a lot of iteration. You have to work with vendors and technology partners. We provided a lot of feedback during the pilot phase, and that feedback was incorporated into Ambience for product improvements until we reached the point where the doctors were satisfied and the pilot metrics showed it was working. Then we did a phased rollout. Some industries can buy solutions off the shelf more easily, but in healthcare, it's much more difficult.

Nathan Labenz: One thing we might have glossed over, but was a standout for me, was the story of how we accidentally burned a lot of money on the 01 grader at some point. This goes back to the physical exam, and I forgot to cue you on this. Do you want to share how we ended up spending so much in the learning process of RFT?

Brendan Fortuner: For builders, especially those using OpenAI, this relates to the cost issue. It's definitely worth discussing. For SFT, on their platform, you'll have a few thousand examples; the job takes a few hours and costs about $100. For RFT, you might have a few hundred examples, a job could take a day or two, and it could cost thousands of dollars right away. But it depends on the grader mechanism you use—string matching is much cheaper. We made a mistake and used the 01 grader and ran a small experiment with 100 examples, and quickly burned through $25,000 just on the grader. So watch out. Whenever you can, use something simple like string match or a unit test. That's often a better fit for RFT anyway.

Nathan Labenz: How do you recommend people approach that? If the number of data points isn't high and you want the best grader you can get, like 01 back then or 03 now, is there a way to validate your techniques before risking $25,000? It doesn't sound like that was a misconfiguration.

Brendan Fortuner: Yeah.

Nathan Labenz: It just sounds like that's what it cost.

Brendan Fortuner: This is great. The OpenAI team recommends a good approach. It starts with evaluations. Before jumping into RFT, make sure you have a strong evaluation that's representative of the full distribution—this is where you refine your grader and check if it's correlated with human judgment. This helps you see what progress you need to make. For example, if models are only getting things right 35% of the time, you understand the challenge ahead. You need a grader that's well correlated with human judgment.

Brendan Fortuner: Absolutely. Now we have more confidence that we can actually use this without reward hacking in an RFT loop.

Nathan Labenz: That's very interesting. What's going to happen to the coders? At the beginning, I mentioned there's been less closing of ranks in professions than I might have expected. It feels like this could be an area where jobs might be thinning out or possibly going away at some point.

Ben Shahshahani: Within healthcare, there are obviously many different job functions. Some jobs may change, and those people may work on different types of things or focus on reviewing the work of AI systems and ensuring the final verification and decision is made by a human. But yes, some areas will obviously be more impacted by AI than others.

Brendan Fortuner: Yeah, Nathan even addresses this question that we're all asking across different industries.

Brendan Fortuner: If you have super capable AIs, are they going to take our jobs? There are early indications in fields like coding that this might actually be true and happening now. In healthcare, it's a bit different for several reasons. One, I don't think the demand for healthcare will stay fixed; it will likely grow with the population. Around 10,000 people enter Medicare every day. Patients are getting sicker, and new technologies are being developed that people want. Demand will likely increase rapidly as AI agents improve what can be done with pharmaceuticals and surgeries. So, the need will grow, not shrink. There's already a projected shortage of 125,000 physicians in the next ten years. There will be plenty of jobs across coding, administration, and clinical roles.

Nathan Labenz: I totally believe demand will grow. No doubt about that. I don't anticipate much slack in the system anytime soon, especially with regard to actual physical treatments and procedures involving patient care. There will be plenty of people wanting care. Honestly, one of the great hopes I have for the whole AI revolution is that we'll cure all diseases and—

Ben Shahshahani: Right.

Nathan Labenz: Live longer than past generations. I'm all for that. On the coding side, if I spent all day coding and saw these results, I might feel the same way as a taxi or truck driver seeing autonomous vehicles getting better. Reports from San Francisco suggest Waymo rides now cost more and are in high demand because people prefer the experience. Personally, I think podcasting might not last long, since NotebookLM can generate any topic and—

Ben Shahshahani: Right.

Nathan Labenz: Immediately produce a custom podcast that's also interruptible and interactive, and I struggle to see how I can compete with that. When I explore this domain, I'm not under any illusions about how long I'll be doing this. It does seem like that's probably an area that's—

Ben Shahshahani: Yeah. If you believe in AGI, then presumably all jobs will go away at some point, right?

Nathan Labenz: Yeah, I think—

Ben Shahshahani: That's what it is.

Nathan Labenz: I think it's coming for all of us at some point. That's definitely my default position. What else are you looking at in healthcare more broadly? Earlier you said, 'We don't do diagnostics ourselves; that's the doctor's job.' There has been interesting research out of Google showing, if you trust their results, that AI systems are outperforming doctors in diagnosis these days in certain contexts. There's also the issue you mentioned earlier—the enormous amount of data currently locked up in systems where it hasn't been used effectively. I think there's a bit of a mess—

Nathan Labenz: That's because maybe we didn't know how to use it, and partly because certain—

Ben Shahshahani: Institutions.

Brendan Fortuner: Institutions or entities believe it's in their interest not to share it freely. What do you see—

Ben Shahshahani: Yeah.

Nathan Labenz: —as the next big unlocks, whether that's liberating data or something else, that could change the landscape in even more profound ways going forward?

Brendan Fortuner: Yeah, we were talking about this before the episode. I do have thoughts here. You mentioned Google's excellent AMIE results on diagnostic reasoning. One thing I want to highlight is that base models are becoming increasingly capable, but there's a difference between research results and deploying in the real world at scale—like processing millions of patient records daily. Data in EHRs isn't as clean as what you see in evaluation sets. So, what we see internally versus reported results are two different worlds, though that doesn't take away from the significant progress being made. Yes, models are getting better out of the box, but we're seeing challenges that will shape future directions. There's a robustness problem. Ben alluded to this: Often, academic benchmarks feature the best result—'We tried it 64 times, here's the best one'—and that's what gets shown. In healthcare, we're much more concerned about the worst-case scenario. In internal evaluations, we sample multiple times and take the worst result as what's likely to reach users. The industry, especially teams at OpenAI and HealthBench, are thinking this way. In the real world, though, there are still gaps. This data is out of distribution. A big reason for this is the walled garden—about 80% of patients have records in Epic's database, and for regulatory and privacy reasons, that's not on the internet or in the model's training data. Also, going to medical school is just the start; doctors do four to eight more years of hands-on training in residency, learning in clinics. Much of that learning isn't online. We're bumping into this limit, then add on complex, changing billing rules and country-specific medication names. This research-to-production gap is something we're actively working on as a company and as an industry.

Nathan Labenz: So, tell me about the patient-facing aspect. Obviously, that's a set of solutions and accessibility requirements, and I'd love—

Brendan Fortuner: Yeah.

Nathan Labenz: —to hear, especially as it compares and contrasts to everything we've discussed so far, what challenges and lessons have you experienced in actually creating something that interacts directly with patients?

Brendan Fortuner: This is a great question. This also relates to agents, which is another interesting topic. Imagine if you had ChatGPT inside Epic, with access to all the HL7 FHIR private APIs and high-quality medical reasoning. What could you do? What would medicine look like in that world? The patient-facing solution is something we developed very early iterations of, about a year or two ago. It's a great demo, but there's actually potential here. One thing that stresses doctors out isn’t just taking notes; it's also answering in-basket messages. Hospitals often have entire staff, sometimes nurses, dedicated to calling patients and answering questions, which can be really inefficient. If you've ever used MyChart and asked a question, you know that's a lot of time and money. Some of these tasks are manageable even for current generations of models, especially with proper guardrails. I believe there’s a way to do this safely. For example, after a visit, patients often don’t follow their care plan. If they don’t, they get sick, return to the hospital, and it costs more for the health system. Why not keep following up? Traditionally, it's too expensive to do. But with AI, we can. In early development, our Ambience product will call the patient after the visit and ask if they've completed their labs or taken their medication. We then synthesize that information into a note and put it back into Epic for clinicians or nurses to review. They can see if a patient is on track or not and intervene as needed. We're not providing medical care or decision-making, but we're automating a difficult and time-consuming task for the system.

Nathan Labenz: That's interesting. And that’s a voice modality, right? You're calling and having an interactive conversation, or is it—

Brendan Fortuner: It could be both, yes. Absolutely.

Nathan Labenz: Any architectural lessons? I think a lot of people are trying to build voice products right now. I don't think there's a consensus yet on the best approach. I've heard of various methods, including background agents that monitor in real time and provide coaching to the main voice agent. What have you experimented with, and has anything emerged as a clear winner?

Brendan Fortuner: That's a great question. Like other medical domains, interacting with patients raises the stakes. Hospitals are risk-averse; they value their reputation and want great patient experiences, so the safety bar is much higher. You have to consider guardrails early on. Even if a fully end-to-end voice model performs well, is that the right approach? Or do you want a model with interpretable stages? My background in self-driving cars influences this—should guardrails be built in at checkpoints during agent interaction? Off-the-shelf voice-to-voice solutions aren't ready. I don't think they're interpretable enough. There will need to be some fine-tuning of the underlying model, plus guardrails. Whether we break it into ASR transcription, followed by language model processing and then voice synthesis, is still under exploration. My guess is the final system will be a bit more heuristic and decomposed than what you see in consumer apps online.

Nathan Labenz: Yeah.

Brendan Fortuner: More auditable, with guardrails built in. Absolutely.

Nathan Labenz: I know we're just about out of time. Any closing thoughts?

Ben Shahshahani: Thank you. This was really fun. It was a great conversation, and I see tremendous opportunity in healthcare. For your audience—people in tech, machine learning, and AI—consider working in healthcare. There's real potential to make an impact in a field we can all relate to. We're all patients or know patients, and even roles that seem back-office make a difference. Over half the hospitals in the country are losing money.

Brendan Fortuner: Yeah.

Ben Shahshahani: Organizations like ours—the Cleveland Clinic—are nonprofit. When we focus on reducing costs, it's not about increasing a stock price; the savings go directly into serving more patients. There's enormous opportunity, the technology exists, and it's a way to make a real impact and improve the world.

Brendan Fortuner: I had a lot of fun. Incredible host. I really appreciated it. Loving the episodes. Thank you. Please keep it up. Hopefully this was interesting to some folks out there. I agree with Ben. I've worked in self-driving cars and had so much fun analyzing millions of bounding boxes on cars. But healthcare is even more exciting to me now. There's this vision of driving the cost of healthcare to zero in the long term and scaling it around the world. That's incredibly cool. But also, it's still in a very early stage—a field that hasn't seen real innovation for decades. They've just kept adding buttons and dropdowns to meet regulatory requirements. Now, AI agents are becoming capable enough to handle these tasks. It opens up a vast design space, a whole candy store of opportunities. That's why we're rapidly growing into this AI platform, rapidly hiring machine learning engineers, researchers, clinical scientists, and every discipline. We're creating these small pods to tackle new use cases that keep emerging as we go deeper. This is the time to build in healthcare. It's one of the best product-market fits of this new generation of agents and generative technology. I really want more builders in this space. I think it's exciting.

Nathan Labenz: Yes, a call for technologists to move into healthcare, and for doctors to become the medical experts that technology companies need. May we all live to be 500 and beyond.

Brendan Fortuner: Sure.

Nathan Labenz: I used to dream about it and feel crazy for doing so. Now I feel a little less crazy. Keep up the great work. We're all counting on you. This has been great. Thank you for taking the time. Brendan Fortner,

Brendan Fortuner: Thank you, Nate.

Nathan Labenz: And Ben Shahshahani, thank you both for being part of the cognitive revolution.

Brendan Fortuner: Thank you. Have a great day.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.