Watch Episode Here
Listen to Episode Here
Show Notes
Join Nathan Labenz and Erik Torenberg as they analyze the latest developments from OpenAI on GPT 3.5, compare GPT to other live player models like Llama2, and discuss the state of AI in coding, education, and healthcare. If you're looking for an ERP platform, check out our sponsor, NetSuite: http://netsuite.com/cognitive
RECOMMENDED PODCAST:
@TurpentineVC Delve deep into the art and science of building successful venture firms through conversations with the world’s best investors and operators. For audio lovers, listen wherever you get your podcasts: https://link.chtbl.com/TurpentineVC
TIMESTAMPS:
(00:00) Episode Preview
(00:01:00) GPT 3.5 Turbo
(00:06:36) Llama 2 vs GPT
(00:11:24) How much inference is needed to double compute costs?
(00:13:40) OpenAI’s moat
(00:14:41) OpenAI’s privacy consideration for data
(00:16:06) Sponsor: Netsuite | Omneky
(00:17:46) Encouraging the usage of instructions during fine-tuning
(00:19:19) Live player consideration of AI safety
(00:22:35) Da Vinci: new completions fine tuneable model
(00:24:59) Chat-GPT usage in decline
(00:30:03) Getting on demand tutoring on ML papers
(00:31:00) Code, education, and healthcare
(00:31:42) AI applications in coding
(00:38:12) AI revolution n education
(00:42:35) AI revolution in healthcare
(00:52:17) Call for feedback
LINKS:
Replit episode with Tyler Angert: https://www.youtube.com/watch?v=uEMN51ko7mk
Replit episode with VP of AI, Michele Catasta: https://www.youtube.com/watch?v=u4l6GgFaJmQ
AI Revolution in Education with Khan Academy's Director of Engineering, Shawn Jansepar: https://www.youtube.com/watch?v=ati_ACj1Dic&feature=youtu.be
Google's Multimodal Med-PaLM with Vivek Natarajan and Tao Tu: https://www.youtube.com/watch?v=2RMpqheYKlw&t=289s
X:
@labenz (Nathan)
@eriktorenberg (Erik)
@cogrev_podcast
SPONSORS: NetSuite | Omneky
NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.
Music license:
EXZCAII87NMYXR1B
Full Transcript
Transcript
Nathan Labenz: (0:00) It's the best model that's available for fine-tuning to the public today. It's 90% cheaper than their previous offering. It's your classic 10,000x. Right? If I needed to go find an expert and I was going to pay some sort of market-ish rate, ad hoc AI consulting is easily into the couple hundred dollars an hour plus. And what I get for 2 to 5 cents is easily $200 plus to go get somebody to explain concepts to me if I wanted to go hire a grad student or a PhD to do that. Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life and society in the coming years. I'm Nathan Labenz, joined by my co-host, Erik Torenberg. OpenAI 3.5 Turbo fine-tuning, the big news, the big release of the week. It's almost boring analysis, right? It's kind of like, why are the best still the best? And why is all of this noise around no moats basically just distracting noise? I mean, I always emphasize everything everywhere all at once, right? So it's not to say people aren't going to be fine-tuning. It's not to say Llama doesn't matter. It's not to say anything super absolutist or extreme. But basically, if you look at the impact of Llama 2 so far, I think this might be the biggest impact of it, which is that it seems like it may have accelerated OpenAI's release of their fine-tuning offering. And this really addresses the biggest weakness that they've had in their product offering for the last year at this point. Not that there wasn't something better to go for all of that time, but definitely for customers of OpenAI, the fine-tuning that they were offering had clearly fallen behind other aspects of the product. So just to give you a sense of where it was before and what it is now. Previously, as a retail customer, and they do have special enterprise deals where if you want to buy in at a higher level, you could get higher quality stuff. But at a retail level, the fine-tuning was limited to the original generation of GPT-3 model. And that had a couple of downsides. One was just straight quality, right? The GPT-3 model was not even instruction-tuned. So this goes back far enough in time that real classic prompt engineering was still necessary, the sort of stuff where you have to think, how do I set this up to make the problem look like an autocomplete? Or how do I provide a couple of examples to show and not tell what I want it to do? And that's ancient history at this point. On top of that, when you fine-tune, the price goes up 6x and even a little more than that because they charge for the fine-tuning process itself as well. But even just inference, it's 6 times more expensive. So the fine-tuned price with your original DaVinci, which was the best thing that you could fine-tune as a retail customer until yesterday, was 12 cents per thousand tokens, which with a company like Waymark, I generally figure 1,000 tokens per video generation. It's usually probably a little less than that. Depends on how much information the user loads in or whatever. So it can vary, but call it 1,000 each. Call it even a little less. I usually just round to order of magnitude 10 cents per video generation was the old price. And that was feasible for us because we have a really high value use case. These TV commercials that we generate are often attached to multi-thousand dollar campaigns and they're being presented to a client. So it's not, our success rate is pretty high, and the value is pretty high for a generative AI use case. But still, that's starting to add up to nontrivial money if you imagine that we're doing, say, even just 1,000 of these generations a day. It's $100 a day, it's a few grand a month. Now, the new model is based off of the GPT-3.5 Turbo, which the price has dropped down to 0.15 cents per thousand tokens. And so they still apply an 8x price multiple on top of that, but it basically still rounds out to be just a little bit more than 1 cent per thousand tokens, which is basically a 90% price drop relative to the old version. I think there's been some confusion because people online are like, oh my god, it's 8 times more expensive. And I don't think that's necessarily the right comparison. First of all, if you're using the base model successfully, you're probably padding it out with quite a bit of instructions, maybe a number of examples. With the fine-tuning, you shouldn't have to do that. So you should have some comparative savings there anyway. But for most people who were not getting the performance that they needed or just not getting the accuracy, the consistency, whatever, not getting the desired behavior with a few-shot approach to 3.5, this just opens up the possibility that that's likely now going to work. And as they say, it performs better than GPT-4 in many cases once fine-tuned. So it now becomes the best thing that's out there for fine-tuning. The 3.5 Turbo base is better than Llama 2 base. It's comparable performance on some of the highest end tasks like your MMLU, which is your comprehensive, graduate level, undergrad graduate level exam battery. So that's a significant achievement for a Llama 2 model. It's hitting at a high level. But product-wise, there was another paper that just came out that showed how many false positives Llama 2 has on refusals. So you ask it things like, where can I buy a can of Coke? And it says, sorry, I can't help you acquire illegal substances or whatever. And it has a lot of that stuff going on. And the difference there is just OpenAI is making a product. And when their stuff goes wrong in that way, they hear about it and they've applied a bunch of iterations to fix it. Whereas Llama, they kind of did whatever they did and they got to their publishing point and they let it go. And they don't really, they never really cared that much whether it refuses the wrong thing. And so it does. So now you have that problem. You have to figure out what you're going to do with that. So basically, again, it's the best model that's available for fine-tuning to the public today. It's 90% cheaper than their previous offering, which addresses a significant pain point, obviously. And now you have just way less reason to feel like you're going to go off and do this on your own. Trying fine-tuning Llama 2 had been on my to-do list for Waymark because it was a little expensive and, hey, maybe we can even get better performance. I think Llama 2 is better than the original GPT-3 pretty clearly. So it was kind of on my to-do list, but there's a hurdle there to get over. Certainly, you can do it. But it's not just about doing the fine-tuning. It's also about how are you going to do your inference. If you actually want to serve this as part of an application as opposed to just doing some experiments, then you're going to need to serve it, and you're going to need to have some scalability, and you're going to need to have some reliability. And certainly people are popping up to offer that type of managed inference with your own fine-tuned models. But those products are still relatively immature compared to just the ease of use that OpenAI can provide. Even Mosaic, they have a former guest that got the Cognitive Revolution bump with the big outcome. Just kidding. But they have an inference product now. But when I inquired about it, it was like, yeah, we don't actually have auto-scaling. So you kind of deploy your model, one instance of it. It can then handle traffic. But if traffic kind of starts to back up, it's up to you to spin up a second version of your model and you manage that kind of complexity on your own. OpenAI doesn't require any of that complexity. Similarly with Hugging Face. Hugging Face has the inference endpoint product. It's real easy to set up an inference endpoint, and they do have some auto-scaling. But for the workloads I've been looking at, it's like, I kind of wish it were a little bit more responsive in the auto-scaling. It takes a couple minutes to ramp up that second instance and then a couple minutes to ramp up a third if it's still needed and then a couple minutes ramping back down as well. So that could be fine. But I always think about for our use case when we sell this we don't have huge usage, right? But we sell this to cable companies, whatever. Sometimes we do these little demos where there might be 100 people in a room and we're like, right, let's all do it. Let's all do one together. And obviously, you don't want your product to fail under the relatively minor load of 100 users using it in one minute. So that's a problem. The Hugging Face inference can't quite support that. Mosaic doesn't support that out of the box. I've been hearing really good things lately about Baseten. John Durbin in the Gamma episode specifically called that out, and I still need to go look into that. But it's like you're kind of on a safari to go figure out who has what, who can scale in what way. This stuff is kind of buried in documentation. And meanwhile, my experience with the OpenAI fine-tuning product, even in the last generation where it was a much bigger model in terms of parameters that we were fine-tuning, was basically that it just worked. They kind of handled that. It seemed pretty smooth. We've never really had big runtime, rate limit issues with our fine-tuned models, at least at the scale that we've worked. And it's been super convenient. So I kind of trust that even though this news is 24 hours old, so it's a little bit early to be giving API product reviews, it seems like if it just matches what they've had previously, then it's going to be not only more performant in terms of the just overall quality of language model, but probably a lot more convenient as well. And so then it's like, again, is it worth the squeeze? How much inference do I need to be doing to feel like it's worth it for me to go from a 1 cent per use to a 0.2 cent per use. Because that's the kind of savings I might be able to achieve if I went and fine-tuned my own Llama 2 and did all that deployment or whatever. I've got to be saving I've got to be running some significant scale, right, to just to get to let's say it's that 1 cent versus 0.2 cents. If I'm doing 10,000 calls a day, then I get up to a $100 OpenAI spend, and it would be $20 on my own. So I can maybe save $80 a day if I'm doing 10,000 calls. Is that worth it to get over those humps? For many organizations, it's not. It's going to take however many developer hours. The developer hours add up real quick. And I just don't see most people, everybody's got plenty of stuff on their to-do list. And they're also kind of mindful that if I do this, there's going to be another new, better open source model in the next two seconds or it's going to be something else and I'm going to end up kind of chasing my tail on this to a certain degree. Whereas if I just kind of go with OpenAI, they'll probably have another update. We're currently fine-tuning off the June update and they'll probably have a September update. And whatever kind of little goodies are in that, I'll get those. So right behind this too, they also have, in terms of the next goodies, they're going to have function calling coming soon. That's not included in this one yet, but nobody else has the function calling at the level they do anyway. So again, that's just another distance they're going to put between themselves and the competition. And then with GPT-4 fine-tuning, even more so, nobody can really match that. So I don't know. It feels like this is the least hot take possible on the release. It's basically it's very strong. And I think it kind of just sustains the narrative that there are moats and they're not totally insurmountable and they don't freeze all competition out for all use cases by any means. But I do think a lot of people kind of crossed off explore fine-tuning Llama 2 from their to-do list yesterday. Certainly, I did. I don't really need to do that anymore. I'm still very interested in what's out there, but I'm pretty confident that this is going to be the best performance and the best total cost of ownership ROI for us.
Erik Torenberg: (14:14) Yeah. So just to summarize some of the major points, it's 8x more expensive than the base model, but 90% cheaper than previous fine-tuning options. And this puts OpenAI at a much better position than Llama 2. And people who had fine-tuning Llama 2 on the to-do list maybe don't need to bother at the moment.
Nathan Labenz: (14:39) Mostly, yeah. And notice too in their announcement, they really emphasize the we don't use your data. I think the main thing that people, if they'd heard everything I said so far, they might then say, well, but I don't want OpenAI using my data or training on my data or whatever. And so, I mean, you still have to trust that they're telling the truth, of course. But they basically lead their announcement with, just as a reminder, we don't use any of that data for any purpose. It doesn't train our models or anybody else's models, your data, your data, whatever. And now let's get into the features. So I think they are very aware that people want to not be co-mingling their data with other people's data, and the policies have been updated to reflect that, and presumably, they're following those internally, and they're definitely really emphasizing it. The illicit use case remains one. Right? I mean, that's the if you're trying to do something that OpenAI doesn't want you to do, they do have moderation in place on this. They're using a GPT-4 powered moderation tool to look through the data and try to identify, are you training a spear-phishing bot or whatever? And that is an area that I'm really going to be watching for because I kind of doubt it's going to work, or at least I think it's going to be pretty easy for people to get around.
Erik Torenberg: (16:06) Hey. We'll continue our interview in a moment after a word from our sponsors.
Nathan Labenz: (16:10) Most of these classifiers work pretty well on a naive use case. Meaning if you just set up a dataset of super egregious behavior and try to run it through, it'll probably catch that. But I think you can be clever. I think you can kind of disguise what you're doing. And they interestingly too, they said this time that they encourage you to use instructions. So again, going back to that earlier comment where previous fine-tuning was on a model that didn't have the instruction training. Well, now you do. So in the past, they used to say, no need to include instructions. It's just show us a bunch of examples, and that's that. Now you can kind of benefit from the instruction and show the examples at the same time. So they encourage that inclusion of instructions. And that also probably is something that they would like to see you include because it might help them understand what task it is you're trying to do and kind of moderate your tasks more effectively. Right? If you wanted to train a spear-phishing bot or whatever, it would be maybe hard to do that with pure examples. Certainly, you'd need quite a few examples to just get across to the model that this is what we're trying to do here, especially if it was subtle in the conversation. If you wanted to do that much faster, you could include the instructions, but then those instructions telling the model what to do would likely set off their moderation filters. But I still kind of think that either just by using a lot of examples and not using any instructions or just otherwise kind of being clever and hiding the true intent, that my guess is people will find it not super hard to get around those controls right now. But it's like, from a safety standpoint, I think OpenAI is not just, what's their loss function, so to speak. Right? They're not just minimizing the harm done through their own platform at this point. And that is, again, part of the effect of Llama 2. I think they are really taking this kind of stuff super seriously. They don't want to see people do harmful fine-tunings on their platform. But when there is now a true open source, ungovernable alternative, now they don't really have to worry about that quite as much. Right? Because it's like, yeah, somebody might sneak some stuff through our filter, but they can also just go do Llama now for many things. And therefore we kind of don't have to worry about it as much. We'll try, obviously, we'll catch what we can. We can put the best policies in place that we can. But our incremental we're not unlocking something that people can't do qualitatively given that this is out there, all those conveniences or whatever. That's nice for the scammers to have too, but that's maybe not the key question or criteria that they'll be making their decisions on. That'll definitely be very interesting to watch. Yeah. And that's why they're also, in terms of why 3.5 before 4, probably multiple reasons, but I think the safety one is a big reason, keeping in mind that they had GPT-4 done training for 3 months before they launched ChatGPT. They definitely kind of take this approach of, we want to have well, I'm not sure if this is exactly how they think about it, but even if they have a next generation model and product ready to go, and clearly we know that people are fine-tuning GPT-4, right? Starting with Bing and plenty of others, Cursor, the GPT-4 native coding environment that's been blowing up recently. They had early access to GPT-4 as well. I guess I don't know if they're fine-tuning for sure, but there are organizations that are doing this GPT-4 fine-tuning. That product is probably mostly ready to go, and they're releasing this one first. Maybe there's still some issues to be worked out or some scalability or they need more GPUs or whatever. But probably also part of it is, let's see what people do here. Let's give these filters a chance to work. People do have a, roughly speaking, a GPT-3.5 equivalent in Llama that they can go use. So we're not really from a harmful use case standpoint, we're not really unlocking anything super major here. But if we did jump straight to GPT-4 fine-tuning, we might be. So let's kind of take the step and see how everything settles and try to catch any flagrant stuff before we roll out the next version.
Erik Torenberg: (20:49) Yeah. Are there other questions that you have on your mind here that we haven't yet discussed or things that you'll be looking to watch out for that will be determinant of something or worth noting?
Nathan Labenz: (21:03) Yeah. I mean, the big thing to keep in mind is just, we've got to use it. Right? Exactly how good is it? What are the best practices? I thought their release was very clean. They also introduced a new completions fine-tunable model, which is I mean, this is really inside baseball. But for the app developers, there is a little bit of friction between the old version, which was just input output, formatted in the classic autocomplete, whatever you put in, it just kind of continues that. The new APIs, the GPT-4, these are chat modalities, so everything is structured as messages. It's the human message and the assistant message, and you can kind of set up even a back and forth to begin your interaction if you want to, or you can continue one that's already happening. And that's the thing they're really trying to push folks toward. But I was interested to see also that the completions version, which for us, having used their earlier fine-tuned models, that's a drop-in replacement. I don't even have to reformat my data. I don't have to convert everything to chat. I can just literally take the exact same input output pairs that I've been using and use it on this new version. That's probably the biggest thing that's not super clear because that's not exactly 3.5 Turbo. They call it DaVinci 002, and it's not super well specified what that is, and that model hasn't got a ton of attention. But basically, I think it's probably equivalent to 3.5. It should work fine. But that's something that I will personally be looking at over the next week or so as we see, hey, maybe we can save 90% cost. It'll probably be faster as well. That may open up additional use cases for us where we might generate two at the same time or whatever. So but that is kind of one little wrinkle. Notably, that also was not marked as they've been pushing this chat stuff pretty hard, but that DaVinci 002 completion style was marked as legacy in their presentation, not as deprecated, suggesting that they do intend to support it for the foreseeable future even though they encourage you to use the chat version, especially if you're just starting out. But there are a lot of folks who have been customers of the former versions that now they can just seamlessly upgrade, and then they can kind of roll over to a chat modality at their convenience. Another topic that I do think has been kind of going around lately that's a little weird that might be worth addressing is the ChatGPT's traffic numbers are in decline. This whole thing isn't really panning out. I just saw an article today where it was like, scamming, spamming, and shaming or whatever is all AI has proven to be good for. And I mean, I guess if people are listening to this show, they're probably not buying that narrative. But I looked into that article, it's like, no mention of code, no mention of education, no mention of medicine. I mean, good God. There's a lot going on beyond the scamming. And I'm someone who is concerned about the scamming. But yeah, the traffic numbers being down, I think, are also probably a bit of a red herring. Products are being rolled out everywhere where the language model integration is kind of coming with you. So it's not shocking to me that people might be going to ChatGPT a little less. I would say I probably also go to it a little less than I did when it was new because now I'm also going to Perplexity, and I'm also going to Claude. And the code models that are built into my native development environment have gotten better. And in some context, I'm using a tool that is, officially speaking, an API tool, even though it's just kind of a thin wrapper that I'm using to organize templates or collect usage data. At Athena, we have this. We use HumanLoop, from another former guest, and everybody can access GPT-4 through our company level account, so we don't have to buy ChatGPT, product accounts for every single person. So just paying for usage saves a lot of money. So a lot of that stuff is kind of matured where organizations are like, I could access this this way for $20 a month per person, or I could access it over here for just whatever tokens we're going to use. And that's a lot cheaper, so maybe we'll do that. And there's other upsides as well that those kind of auxiliary products bring. So I don't think that if you were to look, they don't publish it. But if you were to look at the tokens served graph, I would be very confident that it continues to grow at a very healthy pace across all the leaders. And I just think people are very eager for sort of a counter narrative in some of these cases. There was another good data point on this, Dario from Anthropic, CEO of Anthropic did, he keeps a pretty low profile. I think he's starting to change that now as, yeah, you can only meet with so many heads of state before you kind of have to do some media as well. But so he recently just did a first big interview in a while, and there were a few things in that interview that definitely caught my attention, not necessarily even they were surprising, but he said about their usage that it's basically exponential. And he's like, we don't even really try that hard to commercialize. Obviously, we're kind of moving that direction, but it's like hasn't been our focus, and yet the number just keeps going up. And he's like, people are just they're still kind of just figuring out what to do with it, but we're not even really trying, and the number just keeps going up. So then he also goes on to say, I can only imagine what that looks like at organizations where that is the primary focus. And so it sounds like he's also pretty confident that the OpenAI number continues to go up and up. Just this morning, I was doing I'm participating in a charity evaluation process, and a lot of the charities, organizations are related to AI safety in some way, shape, or form. That's why I'm involved. That's what the funder is really interested in. And you get into some pretty technical stuff sometimes. Like, we're doing this mechanistic interpretability research. Here's our last 3 papers. Here's our GitHub library where we're sharing our work. Okay. Cool. Well, now I've got to assess that. So I've got to understand it first, and I can get on the phone with them. But I end up preparing quite similarly to how I prepare for a podcast episode where I really read the work in detail, try to figure out how does it actually function in a technical sense. And GPT-4 is awesome for that. Just bringing in notation, copying method sections, asking it I've started asking it to explain sections of papers to me like I'm a I think I said new grad student this morning. And you could ask for it to explain like you're five as well. That might be a little too simple. But the idea that it's not finding use cases to me is just kind of farcical. I think we've got work to do to figure out how best to use it in our individual lives, but the ability to get tutored on a machine learning paper on demand, I mean, for me, that's just absolutely huge value. Right? I mean, and I get it for cents, it's pennies. The relative cost here is it's your classic 10,000x. Right? It would literally be if I needed to go find an expert and I was going to pay some sort of market-ish rate, ad hoc AI consulting is easily into the couple hundred dollars an hour plus. And what I get for 2 to 5 cents is easily would be $200 plus to go get somebody to kind of explain concepts to me if I wanted to go hire a grad student or a PhD to do that. Hard to beat that, hard to beat that ratio savings. That's for sure.
Erik Torenberg: (29:45) You mentioned code, education, and healthcare. Why don't you give one or two use cases for each of the things that you're most excited about or seem the most promising in the somewhat near term? I know we've had a number of guests on and had long form. Why don't you kind of summarize? Nathan Labenz: (30:02) Yeah. We need to do more with code, actually. So much of my guest selection is based on my personal curiosity and just the things that I see that I really want to learn more about. So if there's a reason that we've done less on the show with code, it's probably because that's what I have done the most with on a day-in and day-out basis. Replit definitely still counts there, although that's a grander vision than just today's coding use cases. But GPT-4 is a very good coder. It's able to follow directions extremely well. Andrej Karpathy has noted, and Michele noted in the interview, that they've had product managers win hackathon competitions because what the product managers are really good at is specifying exactly what they want, and the models are pretty good at following those instructions and converting them to working code. And so if you are really good about your instructions, you can get pretty good code back in most cases.
Transforming code from one use case to another—I've done a little bit of coding by analogy that's pretty interesting where you say, okay, here's some working code that I either found online or I have from a previous project or whatever. Here's what it does. Now I want you to make a different version that does X a bit differently. That can really help because, especially with GPT-4, if you're using it raw without the benefit of connecting it to the internet or whatever, a lot of the things you might be using are potentially new since GPT-4's training cutoff date or are updated since then. And so exactly how is that API call made or exactly how is that library used today? It may not have great command of, but it does a great job of using the example. And so the formula is: here's a working example, here's what it does, here's what I want to do that's different, use that working example for inspiration. And that is a super effective approach for me.
I think that's only going to improve with things like Cursor and Replit launching these true pair programmer-like experiences where they can even set up file structures, set up configuration, set up a whole environment, do hierarchical type work where, okay, now having set up all those files, now go write into each file. I mean, you can get pretty far on a lot of projects with GPT-4 coding assistance. It is a bit of a skill in and of itself. It's not the kind of thing that is killer on first use, but it doesn't take long, honestly, to get to pretty high-value use in code. And then you can also debug.
I think it's fascinating in general with these language models: the more expertise you can bring, the more value it can in turn give you back. So if you're a real newbie and you're trying to learn to code, then it can happen pretty easily where you can go off the rails and get confused. And now once you're confused and it's also confused, then everything can spiral off into confusion. So some of what I've been trying to teach the executive assistants is how not to let that happen to them. But if you're pretty savvy, then you have a pretty good sense of what to bring it, and it will typically reward a well-structured query with usually a pretty good response. So it can be really good with debugging.
Andrej Karpathy from OpenAI has recently—it's been a couple weeks, but was doing some stuff in C where he was just trying to see how, with just the simplest code, simplest possible, lowest level, cleanest, most efficient code, how hard is it to run inference with a model on a CPU? Obviously, everybody's been talking about the GPU shortage, but the CPU can do any computation that you want it to do, and that does include all the matrix multiplication of a model forward pass. So it's not optimized for that, but it is super optimized in general. How much can you push that to work just with a CPU resource?
And he was finding great success with it between a mix in that case of, first of all, he's super sharp, knows what he's talking about, has very deep command of the concepts. So he can say with high specificity what he wants. But as he noted, pretty rusty coding in C, which is a pretty gnarly language where it's easy to make mistakes and you can have memory leaks. I'm not an expert in that by any means. But it's low-level stuff where the details really matter, and it is easy to make mistakes. So with the help of GPT-4 helping him code in C, which he hadn't really done—he just felt rusty on—now he's just breezing through it. He's able to focus at the conceptual level of what he really wants and thinks matters, and it's driving all the low-level stuff. And then bringing that on top of that to Twitter and just sharing with the community, then all of a sudden people are like, what if you change this little thing? What if you change this little thing? And he's actually getting to the point where midsize, your sort of 7B-type models, are running on a CPU at a decent clip, which is a pretty big deal.
People aren't generally going to want to do that on their laptop because you're really using your CPU to do that generation. So it's going to be resource intensive still. But the fact that you can do that at all—I mean, there's certainly an unbelievable amount of just CPU resource sitting out there. And if those can now run a 7B model at an acceptable pace for many use cases, I mean, that's huge. And again, this is a leading thinker in the space who is just greatly accelerated by the ability to have GPT-4 write most of the code. I think that stuff only intensifies the—he's the 10-to-100x developer just on his own, and then you add on another 10-to-100x. Yeah, that's maybe a little hyperbolic, but you can start to see your way to the 10,000x developer that Omneky alludes to. So that's code.
Education—I think the Khan Academy episode is really good with Sal. I think there, it's really just the scalability more than anything else. We're ironing out all these rough spots, and it definitely can—if you're looking at AP Physics-level questions where you have to tease apart exactly how does this work and what's the relationship between things, it's not always perfect at that yet. Certainly, as you go higher level than there, it can sometimes struggle. But through elementary and most of high school, it's got pretty good command. They've figured out how to make sure it stays on task with you, and they've got some increasingly good checks for: is it getting confused? Are we going off the rails here?
In using that, it basically seemed like a similar experience to just GPT normal edition. But instead of just telling you the answers, it takes a much more Socratic approach and tries to lead you to the answers and coach you and tries to make things relevant for you. I think just the scalability—it's like the same thing as I was just talking about my own education, trying to read these papers in areas that I'm not super familiar with. The ratio of teacher to student is now basically one-to-one. And that has never really been possible. Right? I mean, I think it's as simple as that.
I don't think they're doing all that much, and this could come over time. I think one possible promise of AI in education would be more systematic experimentation, more ability to really isolate techniques that work. That's, I think, super hard to do in general when you have all the layers of: okay, we're going to teach this to the teachers, and then they're going to do it. And then eventually we're going to have, a month later, we're going to have a standardized test or whatever. But did they really do it? And was the kid there that day? I mean, there's just so much noise.
Bringing those experiments down one to two orders of magnitude and just looking at: let's say we take this tack or that tack on this kind of question, super narrow, which one works better? I think that kind of A/B testing probably does work and is just not, at least in the beginning—I mean, everything, you might have low-hanging fruit that could get depleted before too long. But I would suspect that there is still probably quite a bit of low-hanging fruit in terms of: how do we explain this concept? What's the best way to explain fractions to a fourth grader? Do we really know that? I don't think we really know that. And what's the second-best way? And are there correlations that we might detect between the way that they learned the previous thing and the way that they're most likely to learn this thing? That's definitely not happening systematically today. You have certainly teachers, of course, that are trying to understand how kids learn and trying to cater to that to the best of their abilities.
But it seems like where we are now is we're getting this massive benefit of scale where just you have immediate direct access to the tutor, and that's potentially a game changer, almost certainly a game changer in and of itself. But can the tutoring actually get better as people really drill into these specific skills and figure out how best to teach them and maybe make some actual sense of learning styles in a way where it can be delivered just the way it's going to be most helpful for you as opposed to a lot of learning styles in a classroom and kind of pay lip service to it, but what can you really do about that? I would bet that that is going to be pretty powerful. I guess we'll see. There's certainly a whole new class of experiments and a whole new kind of clarity of data that you can capture relative to previous regimes. So I guess we'll see if we have any theories that actually prove out on that data. But I'd be pretty surprised if that data doesn't unlock something.
And then let's see, medicine. I mean, boy, the multimodal Med-PaLM is kind of the latest thing on my mind, certainly there. The promise of that is—this is what I said in the intro to that episode—the promise of that is no less than the AI doctor on demand in your pocket, 24/7, pennies on the dollar. Now with the ability to understand images as well, take a picture of that thing on your skin or even get that X-ray that you got at the clinic that's now in your medical record and go take it to the thing for a second opinion. I feel like that's getting really close.
At Google, they're obviously going to be somewhat conservative and careful, as they should be, about not overpromising and not—they're not going to take an Elon approach and just go deploy first and ask questions later. But it does seem to me like they have to be getting really close to the point where it would be beneficial to health outcomes to have the thing. And that was what I was interested in seeing where they were going to go with this next. It sounds like there's maybe another generation or two before they're going to get real serious about it, but I was just kind of like: at what point do you just say, okay, we've benchmarked this thing every which way, now let's just go take a patient population, give it to some of them, and come back in a while and see what happens?
And I would bet that it would be beneficial. And you could measure beneficial in a lot of ways. Are people going to all of a sudden be living longer? I doubt that there'd be a very strong signal in that respect just yet. Maybe some things. It's a definite hassle to go do this kind of diagnostic stuff. We just had a thing where our baby, baby Charlie, who's four months old now, has a little dimple on his sacrum, which is—people can look it up, but it's this little thing where at the very base of the spine, the skin hasn't totally closed the way it should. And apparently this is quite common. Pediatricians see it all the time. The pediatrician's like, you got to go get an ultrasound for this because there's like a one-in-a-bazillion—I don't know, whatever, one-in-some-large-number chance that it could be a really serious bad thing, but it's almost for sure not a bad thing. And we eventually dragged ourselves to some hospital across town on a Saturday to get this thing done, and sure enough, okay, he's fine.
I wouldn't be shocked if there is some benefit there now just in terms of helping people understand when they really need to go seek care versus when they can just ignore stuff. And that could maybe lead to a somewhat moved needle in terms of the most important health outcomes. But I would expect that to be pretty minor for the short term. Nevertheless, when people get—this is kind of in the case in instances where they've tested Medicaid as well—it's a lot of times not exactly that you are objectively, quote-unquote, healthier, at least on a short time horizon, but people still report being overall much better off. They're financially better off because they're able to get care and decide when to get care effectively. I mean, in Medicaid, it's being paid for versus not paid for. Here, it might just be like, you can use your budget or your available resources more effectively. So if you don't have to waste time and money going and doing something, then that could be great. Just saving the time and money.
Peace of mind is another one. I think Medicaid people traditionally find or typically find: people just report having better peace of mind, better mental health, better outlook, more comfort in knowing that they're taken care of. I can imagine that also could be apparent in a Med-PaLM trial. So it does seem like we're quite close to that. It'll be really interesting to see. So far, I think the medical establishment has been really positively receptive to this kind of stuff. Appropriate discretion and sort of not rushing, but also not hostile to using AI to improve the overall system. I've been honestly quite impressed. I was probably a little bit more cynical in the past and would have expected more just outright resistance out of pure self-interest. And I've been pretty pleasantly surprised that not much of that seems to have happened.
So maybe it'll start as it's like, oh shit. Because by the way, Med-PaLM M radiology reports are preferred 40% of the time to the human radiologist. So it's still losing, but it's not losing by much. And especially if it were to cost one-one-thousandth, you might in fact—and by the way, there's probably also additional prompt techniques there. I'm guessing there may be issues where, or just simple techniques where, you might even be able to get that to parity if you really tried. But in any event, would you take something that's preferred 40% of the time if it costs one-one-thousandth? Yeah, probably. In a lot of cases, you would.
So do the radiologists start to unionize around this in the near future? Or do they just continue to say, hey, this is great, we're overburdened? I think that'll be very interesting to find out. And I think the other thing that'll be interesting to find out is, much like with the self-driving cars, what standard do we ultimately want to set before we can actually adopt a system like this? Seems pretty clear at this point that the self-driving cars are roughly as safe as humans. That doesn't seem to be enough for adoption for some reason. It seems like we're maybe more headed for a world where it needs to be 10—an order of magnitude safer, like a 90% risk reduction before people will actually want to adopt it.
Is the same standard going to prevail in medicine? I suspect that the standard for self-driving cars is going to be a little higher just because when that fails, it can fail spectacularly. And you're in an accident and that's a big problem. Whereas with the medical interaction, it's not doing surgery on you, right? It's advising you at most at this point. So you have much more of a buffer to double-check things that don't make sense or weave that into existing structures in a way that's not so either-or as self-driving or not.
But it definitely will be interesting to see where that bar gets set and if protectionism pops up. But the march of progress from those guys is just incredible. Every couple months to add on just a massive leap in performance or a massive addition of capabilities. It's like, at the current pace, it's really hard to imagine that it's more than two years away before they're done-ish. The work is never done in some sense. You can always go make them superhuman. You can teach them to read genetic information in a way that humans can't and don't. But it does seem like just a methodical, continued, disciplined adding of use cases, adding data types, fixing certain things that were obviously not great—the low-hanging fruit that they cited in that episode, some of it just around the bottlenecks that they put images through, the low resolution at which—I think it's 256x256—that the model sees the image just due to prior structure and not wanting to rework that at the time of doing the project, also because they sort of know that this is not the final form anyway, so they're not even trying to max out. They're just trying to prove various theories.
Seems like it's going to be pretty damn far in the not-too-too-distant future. So time to start that conversation, right? What's that clinical trial look like? What's the standard? They said regulation isn't exactly blocking. It was an interesting answer because I said, is regulation blocking you? Like, do you feel like there's a law saying you can't do this? Or is it more of, you want a law that you can be definitely following before you would want to do something like this? And they're like, well, we have probably a couple generations to go anyway. And so they're just focused on: hey, we're just going to keep making this thing better. But meanwhile, that question is becoming pretty live, I think.
Erik Torenberg: (50:43) Maybe that's a good place to wrap. Was a fascinating overview and, yeah, always a pleasure. Until next time, Nathan.
Nathan Labenz: (50:50) Appreciate it, Erik. Talk soon. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.