The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Watch Episode Here

Listen to Episode Here

Show Notes

Kyle Corbitt, founder of OpenPipe, breaks down reinforcement learning and custom fine-tuning for modern AI models. He explains how RL differs from supervised fine-tuning, why GRPO and LLM-as-judge post-training matter, and how these techniques can improve performance, latency, and cost on open source models. The conversation also covers reward hacking, evaluation design, LoRA adapters, and how Chinese labs are using distillation to fast-follow frontier models.

LINKS:

Sponsors:

Sequence:

Sequence handles the full revenue workflow for complex pricing, from quoting and metering to invoicing, revenue recognition, and collections. Book a public demo at https://sequencehq.com and use code Cognizant in the source field to save 20% off year one

AvePoint:

AvePoint is building the control layer for AI agents so you can securely govern, audit, and recover every action at scale. Design trusted agentic outcomes from day one at https://avpt.co/tcr

VCX:

VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com

Claude:

Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr

CHAPTERS:

(00:00) About the Episode

(02:50) Framing RL tradeoffs

(10:06) Weight update dynamics (Part 1)

(14:25) Sponsors: Sequence | AvePoint

(16:39) Weight update dynamics (Part 2)

(18:53) GRPO credit assignment

(31:47) Superhuman reasoning gains (Part 1)

(31:54) Sponsors: VCX | Claude

(35:00) Superhuman reasoning gains (Part 2)

(48:28) Distillation and competition

(01:00:25) Environments and data

(01:12:10) Reality feedback loops

(01:18:48) Enterprise fine-tuning decisions

(01:27:04) Rubrics and reward hacking

(01:37:43) Adapters and CoreWeave

(01:44:24) Episode Outro

(01:47:52) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Transcript

This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.

Introduction

[00:00] Hello, and welcome back to the Cognitive Revolution!

Today, my guest is Kyle Corbitt, founder of the Reinforcement Learning & Custom Fine-tuning company OpenPipe, which CoreWeave acquired last year.

I open this episode with a bit of a confession: I've done a lot of Supervised Fine-Tuning work over the last few years, both for Waymark in the early days of getting GPT-3 to write decent video scripts, and for research projects such as the Emergent Misalignment paper, but I've done essentially no hands-on RL work, both because my perception has been that frontier models are probably my best option in any case, and because I'm afraid, perhaps irrationally, of reward hacking.

Kyle says that while it may or may not be worth the extra work and slower iteration time, he does believe that using RL on an open source model probably could deliver better performance, and would certainly reduce both latency and inference cost dramatically.

With that motivation in mind, Kyle proceeds to offer a master class on all things RL, which repeatedly challenged my premises and in multiple instances updated my understanding.

He explains how RL differs from SFT in terms of the weight updates it makes to models, how this makes RL fine-tuning less likely to cause catastrophic forgetting, what distinguished the DeepSeek GRPO algorithm from its predecessors, and what additional improvements on GRPO people are using in industry today.

We talk about the distillation strategies that Chinese labs are using to fast-follow American frontier models, and he argues that their use of LLMs as judge in the context of RL post-training is a bigger deal than supervised fine-tuning.

He also explains why he thinks that compute is the primary constraint preventing Chinese companies from catching up, and why he believes we're already in a recursive self-improvement loop.

He describes the cottage industry of Reinforcement Learning environment companies that's sprung up to serve frontier labs, and why, though it's a good business to be for now, he's declined to invest in any of them.

He surveys the use cases that are most commonly deployed by CoreWeave customers, and he offers a lot of advice on how to run RL in practice, including how to develop and iterate on evaluation rubrics, whether to train N models for N tasks or a single model to perform multiple tasks, how the flagrant nature of reward hacking makes it relatively easy it deal with when you're focused on specific, narrow tasks, and how CoreWeave's use of LoRA adapters drives efficiency and convenience for their customers.

Kyle is both a technical expert and a successful commercial practitioner, and from start to finish, this is a super-high-signal conversation on a classic training technique that has become an industry unto itself.

And so, I hope you learn as much as I did from CoreWeave RL fine-tuning guru, Kyle Corbitt.

Main Episode

[02:50] Nathan Labenz: Kyle Corbett, founder of OpenPipe, now after an acquisition, leading the serverless training team at Coreweave. Welcome to the Cognitive Revolution.

[03:00] Kyle Corbitt: I am super excited to be here.

[03:03] Nathan Labenz: Thank you. I'm excited to have you. This has been a long time coming since we've met almost a year ago now, and I'm glad to finally be doing it. That's all on me, by the way, which is everybody knows. You are a specialist in reinforcement learning. What I want to do in the next hour and a half or so is get basically a comprehensive survey crash course rundown of what is going on in reinforcement learning, how we should understand it, like what the techniques are looking like, who's using it and where and for what purposes and who's having success and not and you know what makes a difference and all those things. And so I guess I was just going to start by telling you my story very briefly and then allowing you to react to that and tell me if I'm like way off base or not. My story in short is I've done a lot of model fine tuning over time, mostly on managed platforms, not so much like on open weights models, just a little bit of that, more so on like the OpenAI platform, but almost entirely supervised fine tuning, not really much at all reinforcement learning fine tuning. And the story I'm telling myself, which you're invited to pick apart is for one thing, increasingly these days, like I can just use base models with few shot prompting and that's getting me a lot of what I need. But even before that was possible, the problems that I was working on in the context of my company Waymark are sort of taste driven problems where we always kind of felt like we'd be better off going to our creative team and say, Hey, give us a hundred great examples. will fine tune on that and hope that the AI can follow your lead rather than try to go through some sort of seemingly more complicated, maybe more powerful, but kind of harder to wrap our heads around notion of like, well, if we get the AI to do it and then we compare and we score, maybe there's an LLM as a judge. We're kind of like, I don't know. It feels a little bit sometimes like a shell game and I'm not sure how much I should where I should invest or how much I should trust that process. Whereas I know at least if the AI is imitating my creative team, there's some decent true north there. And then the other thing I'm afraid of, although I'm not so sure it's a big problem in my context, is reward hacking. But I am afraid of reward hacking in general. How would you advise me on whether or not I'm making a good decision or not? Should I be using reinforcement learning or am I thinking about it the right way?

[05:23] Kyle Corbitt: Yeah, that's a great question and I think one that lots of folks think about. Maybe my first question for you would be, How were the results you were getting from your existing process? So you mentioned, first of all, that these days you mostly just do prompting, but when you were doing fine-tuning with SFT, do you feel like you were seeing the models improve substantially on that? Do you feel like it was-- and then this is a very high bar, which I would imagine wouldn't clear, but did you feel like it was behaving as well as your creative team and matching the quality of those examples it was given post-training?

[05:55] Nathan Labenz: I would definitely not say it was matching the best work that our creative team could do, but definitely a notable improvement on the base model. I would say our typical complaint was probably most often that, and this would definitely vary through different generations, but more recently it was like able to do the job perfectly well, so to speak. And I think that's true today too with prompting, but few moments where you're like, damn, that was awesome. Incredible turn of phrase or, you know, nailed it, really nailed it, you know, in the way that sometimes you just get something from the creative team that's like, oh, wow, like that was a really good creative idea that impressed me, surprised me, delighted me. I wouldn't say we see too much of that coming from models even today.

[06:45] Kyle Corbitt: Okay. Yeah. Okay. So that was going to be my next question. So yeah, even with the latest Frontier models, you know, that sort of spark or wow moment, it sounds like is not something you see commonly.

[06:57] Nathan Labenz: Rarely at best, I would say, yeah.

[06:59] Kyle Corbitt: Yeah. So here's what I think. I think it is likely that you would have been able to get better performance out of the models with reinforcement learning than with SFT. And there's a few different factors here that sort of muddle it. I mean, one is that OpenAI's support for RL was half-hearted at best at any given point. And I think technically they still do it, but that entire model customization platform feels very much in maintenance mode at this point generally. So I think on that side, yeah, That might just not have worked. In a parallel universe where you were using an open-source model and you're using a QEN model or something like that, then I would say with a fairly high degree of confidence that if you're able to get decent results out of SFT, the ceiling of the best results you can get with reinforcement learning is going to be higher. And that's true even if the data you're using for SFT is high-quality data, human data. And the reason why is because it's just the whole trick to RL, the whole reason RL works, or is something that people invest in at all is because it turns out it really does matter how well your data distribution matches kind of the models, you know, like a standard mode of thinking or just like what it's picked up from pre-training. And what RL gets you is it just gets you, you know, more, it is working within those channels that are already carved quite deeply within the model. And when you work within those channels, you can just get a lot further because you're not trying to over overwrite what it's doing. And yeah, you might say, Well, overwriting is what we're trying to do. We're trying to get it to do something it's not good at, which is fair, but it ends up being quite destructive. It's actually really interesting if you look at the weights. If you're doing SFT, it's just like, Even with very few examples and even with a very low learning rate, it's just throwing the weights all to pieces and the average differences are so much larger than doing RL. That's a big part of why you get this catastrophic forgetting because it's overriding other pathways. It's just like you're trying to get the model to do something that's quite different than what it was trained to do, whereas RL is going to let you stay in those grooves and get a lot further. Yes, I do think that would have worked. Now, in your specific case, would it be worth it? Would it get you to a place where it's like, oh, this is better than just using the frontier? My guess is probably not. So I think concretely for your task, if the trade-off you're making is, Hey, we're going to take an open-source model and use RL to try and make it better at this, versus, Hey, we're just going to take whatever the best off-the-shelf model is and do prompt engineering, and we're allowing ourselves to expand to the best frontier models. At that point, I suspect for a creative writing task, you would end up in a position where you're better off using the frontier models. And yeah, we can sort of get into-- there are definitely tasks where I would say the exact opposite and say that the RL could do well. I would also say that this is obviously all dependent on the amount of compute. I think, theoretically, anything's possible. If you buy yourself a data center and spend a couple billion dollars on this task, you would be able to surpass the frontier. But the trade-off point would be fairly long, I suspect, along that curve for a task of this shape.

[10:07] Nathan Labenz: I'd like to understand this grooves thing better, and I mean, I do-- know what you're gesturing at. When I think about like how much the weights change with fine tuning, I usually think of that as kind of more a function of like some sort of divergence penalty, some sort of tethering of the, you know, the model as it's evolving to the base, to the starting point. I think you can do that on any kind of fine tuning, right? So how is it that if I have kind of a similar divergence penalty term to my loss function, why does the supervised, why is it more destructive than the reinforcement learning?

[10:48] Kyle Corbitt: Yeah, no, that's a totally fair question. So yeah, I think what you're talking about is that there's a term called a KL defergence penalty, which is a sort of auxiliary term you can add to any loss function saying, hey, prevent the, it doesn't actually prevent the model weights from drifting. What it prevents is specifically the log probs that are generated at each token position from drifting too far from the base model. And this is often considered best practice because, you know, it can help you from getting, you know, like catastrophic forgetting and like moving too far away. However, what it's not The fundamental issue is that, let me put it this way, there are often different ways to get to the right answer, right? And the easiest example here is if you're talking about a reasoning trace where it's like, hey, you're doing a math problem and you're training this model with RL to solve the math problem. And there's probably an infinite number of ways you could reason through from a problem description to the answer. And some of them are going to be paths that the model already is comfortable with and is like, oh, these eight tokens in a row, even the model you're starting from would have generated them anyway. And then the next token, yeah, maybe it would have gotten that one wrong. And so there, the learning signal is teaching you to move that one slightly. But fundamentally, what RL just structurally optimizes for is Changing the fewest tokens the fewest log probs necessary to get to that right answer Whereas what SFT does if you're if you're doing SFT on you know You're say distilling a larger reasoning model into a smaller reasoning model And this is particularly true if like the smaller model you're going on had different pre-training distribution So so you know you would expect that kind of like it's it's kind of like built-in You know intuitions or inclinations are are different is you're not respecting those pieces of the reasoning that it would've gotten right anyway. You're overwriting the whole thing with the reasoning from the larger model. And by overwriting the entire thing, this is quite confusing potentially for the backprop algorithm because the backprop is just seeing, Oh, all of these tokens need to change, and maybe some of them didn't actually need to change. Maybe the direction the model would've gone with this token actually was also fine, but you're changing all the weights to get to this new one, and really there was this other token that was much more important that did in fact need to change to get to the right answer, but that one's just mixed in with all these other random unrelated changes. That general intuition generalizes to other task shapes as well, including creative writing, where maybe there's two different ways to phrase this and they're both fine, and the model would have chosen one, and your creative team chose another, and they're both okay, and you don't really want to waste your model updates, because every time you update the weights, there's a potential for catastrophic forgetting, and sort of just off-target effects in general. And so you don't want to waste those model updates on changing something that was already fine. You want to really direct them to upweighting the things that the model wouldn't have gotten read on its own, or very rarely, more specifically, would have gotten read on its own, and focus your updating budget on those. So the KL divergence doesn't give you that. If what you're doing is just penalizing KL divergence, it doesn't distinguish between things the model is already doing fine and you just You know, there's like a different way you happen to have it in your train data versus things that the model really was getting wrong.

[14:17] Nathan Labenz: Okay. Very interesting. When you describe, you said more specifically, you know, something that not the model can't get right, but that it rarely gets right. That's key because when we do things like GRPO, the you've got to have at least one right answer, right, to be to have any sort of advantage. I guess it also depends on whether you're doing binary scoring or some, you know, more rubric based evaluation. But I guess several, several different questions coming to mind at once. Can you give me a little bit more intuition and maybe we could do this for GRPO and you can maybe describe like, I I'm not sure if GRPO is still like the hotness that it was a year and change ago. I'm not entirely sure if that was something that broke out for kind of memetic social media reasons, or if it really was like a huge advance over its immediate predecessors. But can you give me a little bit more intuition for, okay, I understand that in this algorithm, we are running multiple rollouts. Some of them are going to get to a right answer, or if it's a rubric score, they're going to get a higher score than others. And then there's a computation that creates the group relative advantage, which is to say, you know, we want to shift toward the patterns that gave us the right answer or the higher scoring answer. How is it though that that is, because it still ultimately goes to a token by token thing, right? So how is it that if I have like eight different chains of thought and they're all kind of different. And in any given token position, like we might even have very different parts of speech, right? At, you know, at token position N, it could be a preposition here and a verb there and whatever. We're like in very kind of different moments in the chain of thought. But my understanding is that the advantage calculation does still ultimately cash out to like token level advantage. So how is it that, like where's the, I'm a little bit lost on the alchemy of like why this translates in the end to really only updating the, you know, making change on those tokens that really mattered. I'm missing a little logic there.

[14:25]Sequence: Sequence handles the full revenue workflow for complex pricing, from quoting and metering to invoicing, revenue recognition, and collections. Book a public demo at https://sequencehq.com and use code Cognizant in the source field to save 20% off year one

[15:32]AvePoint: AvePoint is building the control layer for AI agents so you can securely govern, audit, and recover every action at scale. Design trusted agentic outcomes from day one at https://avpt.co/tcr

Main Episode

[18:53] Kyle Corbitt: Let me take several parts of this question and I will finish on the one you were there at the end and then hopefully that'll give you the chance to ask follow-ups if my explanation doesn't make sense. Okay, so first of all, like, yeah, I think the reason GRPO specifically, like that algorithm and that acronym, like, you know, very concretely took off was not necessarily because it was like a big quantum leap on what came before. It was because DeepSeek did a lot of engineering work around actually scaling it and released an actual artifact, a model that worked really well with it. Like that was kind of the reason why, you know, there was a whole constellation of other algorithms that probably would have worked just about as well. There was one that came out a little bit before called RLOO, ARLU, which basically is the same as GRPO and likely would have worked just as well if you'd scaled it. After GRPO, very shortly after, within a few months, certainly of the R1 release, there were various numerous improvements made upon it, which really do probably deserve their own algorithms. So there was a paper called DAPO, there's GSPO came out from the Quinn Lab, I believe, and then CISPO was another one that came out shortly after that They're all significant improvements. And then there's a bunch of minor tweaks that don't even have named things. So I would say, yeah, the algorithm that people use today in practice is actually probably further away from GRPO as initially described as GRPO was from what came before it. But we all just still call it GRPO because that was kind of the name that stuck. The, okay, so moving on to kind of like how it actually works and like let's, yeah, I'll talk through, I think this will be helpful to build your intuition on, you know, how the advantages are calculated and everything. So maybe I'll talk first about what came before GRPO, because GRPO is kind of interesting in that like a big part of its like development was that it threw away something that everyone had used before and some people still use. which so it's sort of the spiritual grandfather of all RL that people do on LLMs is an algorithm called PPO that was developed by John Schulman in 2017 I believe actually pre LLMs or pre the being big it was used for games and stuff and the key thing about PPO is it's sort of like you have your policy which is what you call the model your training it's taking a bunch of actions and the key thing you need to do is every time it takes an action you have to like kind of score how good or this action is. If it's a good action, then you want to, you basically want to update your weights to make it more likely to do that action. And if it's a bad action, you want to update your weights to make it do less, right? And also importantly, this is something that happens at an action-by-action basis. But your reward in PPO can be very long-term. So it could be at the end of a very long sequence of actions, you finally find out that commonly this was used with games. And so you would say, Hey, at the end of the game or after a minute of gameplay, what's my score? or something like that. So what PPO does is a few different things, and it's actually, of course, building on older work as well, there's an called Reinforce, which is trying to solve the same problem. PPO adds some extra terms to keep it stable, keep it in sort of like a trust region where you're kind of hopeful that the model hasn't changed too much as you're updating it. But the key The key thing that PPO does, and actually this is not unique to PPO, this is from older than PPO, but you want to calculate the advantage of every single action. So every single time it takes an action, you want to say, Hey, was this a good or bad action? And the way it does that is by actually training a couple of different models in parallel. So you have the policy model, which is just your normal model that's generating the actions. And then you have a separate model, which is called the value model or the critic model. And the value model is actually predicting, saying, Hey, based on the set of actions up to this point, what do I think the score is going to be in the long-term? Basically, it's predicting for this action, what do I believe is the value of this action? What impact will this action have on the score in the long-term? And it's predicting that for every single action in the sequence. And then eventually you do get to see like what the actual score is. And then basically, if the score ends up much higher than you expected, then you can say, oh, some of these actions clearly were much more valuable than we expected. So if it's like, hey, my critic model thought it would have a low score and it actually has a high score, then I want to like make it much more likely that I have a high or that this action happens in the future. Okay, now moving on to GRPO, the sort of like key difference here is instead of figuring out what the value of any specific action is, oh, actually, before I go into GRPO, I should mention this all translates directly into LLMs. And the translation that people do, people have tried actually a lot of different translation, but the one that most people do, and it's kind of like the simplest thing that works is every single token generated is an action, right? So we're using the exact same concepts as we were using before and just saying like, Hey, every token, the state up to that point is the full context and then this token is an action and the next token is another action. What we do with GRPO is it turns out that calculating, figuring out that value model and keeping it up to date is painful.

[23:46] Kyle Corbitt: It's just tricky to get right. It's like another set of hyperparameters you have to tune on, Okay, we have to keep this model updated or else train doesn't work well. What GRPO did, and they actually were not the first ones to do this, but they get the critics, they're the first ones do it at scale and prove it worked well, is I said, Hey, we're just going to completely throw away the value model. And what we're going to do is we're going to try... So the way we're going to figure out whether a given action is basically like a trajectory of actions is better or worse than what the model would have done otherwise is we're just going to run a bunch of them in parallel. So we are going to, with the exact same setup, the same initial conditions, we're going to run whatever, four or eight or 512, there's lots of different, you know, like hyperparameters between here as well, different runs in parallel. And we're going to see how often the model succeeds and how often it fails. And the reason we want to do this is because Let's say we've thrown away the critic model and we do a single run through with GRPO and we get a score and the score is one. Hey, it got it right. You don't know from that run whether the model just would always get this right or this is one in a million times that it got it right. If you're just naively updating your model because it got it right, but it would've always got it right, it's a spurious correlation where it's just like, Hey, it made some random choices, choices didn't affect the score. at all because, you know, it just always would get it right. And if you're upweighting those random choices it made, then you're, you know, just like kind of moving around in a pretty random direction. So what GRPO lets you do is it says, okay, you know, the sort of like advantage that we allocate to each of these tokens is going to be based on how much better this run did than the average of really what you want to compare this to the average. If we ran the current model infinite times on this, how much better did this do than that average? Obviously, we're not going to run infinite times, so we approximate that by doing it n times. Then getting to the end of your question, which is you're right. When we're actually updating the model weights, we are doing this at a token-by-token basis. Somehow we have to say for every single token, No, we want to update the weight such that this token is more likely if the advantage is positive or less likely if the advantage is negative. And this is a big problem in reinforcement learning. It's called the credit assignment problem, right? Because really what you want to do is you want to assign credit and wait just the key tokens that like were critical to this going right and not upweight the tokens that like, you know, just like always would have been right and didn't really contribute anything to the solution. And so the sort of like, I guess the key insight of GRPO is to sort of just like do a very unsatisfying thing and kind of just punt on that a little bit. It's not a full punt. So what you do is you look at how likely every token was to be produced, right? Because you're sampling at a high temperature when you're doing these. And so some tokens it's like that it produces are very common, some tokens are not very common. And basically what you do is you say, hey, if my got a high score, then I want to give more credit to the tokens that, just by random chance, were less common. Because I assume that the high score is much higher than the average score across the entire group, then I assume that it probably was because there was some rare thing that I did in this that I didn't do in other cases, and that rare thing led to me doing well. And the exact same thing, the obstruction, if I get much lower score than the average on the group, then the rare things are the things that I'm going to penalize the the most because I'm like, Hey, that's probably what put me there. Now, you could ask the question, There could be many rare tokens. If you've got tens of thousands of tokens that are reading trace, how do you decide which rare token is most important? And you don't. You just throw up your hands and you say, All the rare tokens get upvoted the same way. This is, like I said, a very unsatisfying answer. And I think that's one of the reasons why there was an almost 10-year gap between PPO that had this value model that tried to determine on a token basis and GRPO where it's like, Hey, we're just going to throw that all away because it feels wrong. It feels like it shouldn't work. In practice, it does, though.

[28:40] Nathan Labenz: And is the intuition there kind of like, I studied this a little bit, but not in enough depth to be confident, but I'm sort of imagining that as we go through a chain of thought, there are critical tokens where you're either taking the right path or the wrong path, and then there are probably a bunch of tokens that kind of follow once you've made that critical decision that are all kind of naturally going to follow because that's just the structure of language. So you're trying to isolate in saying the ones that the model was least confident about, you're trying to zoom in or isolate or focus, if not isolate, emphasize maybe the right word, the critical decision points in that trace.

[29:28] Kyle Corbitt: Yes, that's exactly right.

[29:29] Nathan Labenz: Yeah. Okay. Interesting. DPO basically is a similar thing too, right? But that was, you had to have pairs that you kind of said, I like this one better than the other one, as opposed to a ground truth or a scoring, but similar mechanism, right?

[29:44] Kyle Corbitt: Yes, yes. There's definitely a lot of overlap in the math and the intuition there.

[29:50] Nathan Labenz: Yeah, okay. Is it worth getting into what some of the finer points have been since? GRPO that has made it even better, not in a maybe super mathy way, but, you know, like what additional insights have people brought to bear since then?

[30:06] Kyle Corbitt: Yeah, I mean, yeah, we can talk about it briefly. Yeah, it's a bunch of small things. Like one open question was sort of like, hey, how do we do length normalization? And this comes into if you have a trace that happens to be, so, so it's sort of the original math in GRPO actually structurally advantaged very long thinking traces and, and, and, you know, just like generations in general, just because, you know, it didn't normalize by the number of tokens. So basically, if you had a batch of, you know, like, whatever, like 128 different completions, and one of the traces happened to five times as long as the others, it ended up with like five times the amount of weight in the way the models were updated than others. And so it's sort of like, you know, the people had pretty good success with basically kind of like down weighting that to average it out. You know, there were You know, CISPO's a really cool one. It basically just changes the way you're doing the clipping. PPO and the GRPO inherits this, has a specific way of making sure that the weights don't stray too far in any one round of updates. And there was this new technique called CISPO that was released maybe six months later or something that basically it puts the clipping in a different spot, which basically like... The idea there is it lets the model discover much more quickly those like very, very high value but rare tokens. And so it sort of allows those to update the weights much more if there's a very high score while not like allowing the weights to update too much. So yeah, and then there's like a stack of probably like, I don't know, half a dozen kind of like little tricks like that that people have developed to make the algorithm both more stable and converge faster.

[31:48] Nathan Labenz: Amazing. That's been a great trip down the rabbit hole. Popping out now again and trying to think about what it all means. Obviously, the huge thing about reinforcement learning that we've seen time and time again, but you know, it's really happening now. I'm just thinking about this latest Erdos problem that's been solved in the last 24 hours, or at least reported, that Rune just said something like, This is the first time that everybody in the math community is super impressed. And the key point that I'm getting at here is reinforcement learning has the ability to take a model past what available training data has on offer to teach it, right? So this is where we get superhuman performance. Now, how does that happen? I mean, you had kind of talked about the grooves and by focusing in on these key decision points rather than just mashing every token, you're kind of playing to the model's established strengths. But clearly there's like also something happening where the at scale, the reinforcement learning is teaching qualitatively new capabilities to the model. So how should I think about that? You know, in other words, how are we How are we, where clearly it's, well, I mean, you could argue with me if you think this is wrong, but I take it that everybody kind of has come to accept that this is where the superhuman performance comes from. But I don't have a great intuition for where we're making that move from playing to the model's strength, staying in the groove, you know, focusing on what matters and reinforcing what it already knows or has at least some instinct for into this like qualitatively new regime where now we're solving open math problems.

[31:54]VCX: VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com

[33:08]Claude: Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr

Main Episode

[36:52] Kyle Corbitt: Yeah, it's a great question. And like one caveat I would give here is that unfortunately, reinforcement learning for LLMs has definitely matured in the era where nobody's publishing anything, you know, except for like, you know, some Chinese labs to some extent. So I think we have very little insight into the specific techniques that, say, an OpenAI or, you know, Anthropic or Google are using to train these models. So that's the first caveat is like, this is definitely speculation. What I would say after that is this is like a common confusion or dichotomy people have about RL where it's like, oh, is RL teaching new things? Or is it just, you know, like surfacing things that were already latent in the model's distribution? And the answer is from a very like pendantic technical sense, yes, it is only eliciting things that already existed in the distribution. However, the distribution of tokens that a model can produce is literally the set of all possible tokens. In the same sense of the distribution of works that a million monkeys on typewriters can produce, it includes Shakespeare, right? So it's like everything is already in distribution. Definitionally like at any given position there is a chance that the model can produce you know with however small a probability like a given next token So the whole game of course, you know to to avoid the situation where you're just waiting for your million monkeys to type out Shakespeare's is you're trying to get your initial distribution as strong as possible so that it requires less random guessing and random rollouts in order to you know find those those new and useful behaviors Which is why pre-training is still super important even in the sort of RL regime we're in right now because you want to start from a place where, you know, the right patterns like are have a greater than negligible chance of showing up. I think that said, like, I think you probably can get to superhuman performance on a composite task, like, you know, a very complex math proof, even without surpassing, like, you know, reaching a place where it's like no human could possibly have understood this or generated this, right? It's like, I mean, I think, I think that like one thing the models are very, very good at is going out on these long expeditions and fishing trips, where it's going very, very deep down a specific rabbit hole, and maybe they'll take that rabbit hole further than any human would because we'll lose the... I think we are at a point with a lot of these frontier models now where their working memory is larger than any human's working memory. And so they can explore these rabbit holes longer than a human mind could. And so even if every individual step is something that does seem plausible to a human. If a human had all of that context up until that point, it's just very hard for a human to, in practice, hold all that context in their head. So that's one place we could get to superhuman performance. But yeah, I mean, I think in general, yeah, you can get to superhuman performance even without that, just because you could randomly discover or randomly surface a token that does something clever that no human would have done.

[39:51] Nathan Labenz: How do you relate this to what I think of as metacognitive behaviors. In the original R1 paper, there was this aha moment that they published. And I usually present this in my AI scouting reports as kind of a, you know, two parts from that paper I put together. One is the, which you're saying also is to some extent an artifact, that the length of the chain of thought just naturally grows throughout the training process. I have mostly interpreted that to date as the model is learning that it's valuable to think longer and it's getting right answers more often when it's thinking longer. And so thinking longer itself is being reinforced. But I'm also hearing you that like, at least in that original one, there may be.

[40:36] Kyle Corbitt: I mean, to be clear, both things can be true.

[40:37] Nathan Labenz: It was rewarded.

[40:40] Kyle Corbitt: Yeah, both things can definitely be true. Yeah.

[40:44] Nathan Labenz: So my other side by side, there is the aha moment where the model is solving some math problem and it realizes that the way it had been doing it, was flawed, but now it recognizes there's another way and it like kind of takes a step back and comes at it from another different direction. And clearly we are seeing in frontier models a lot more of this sort of persistent, resilient, try, try again, problem solving that again, it's like somewhere, you know, deep in the long tail of the internet, somebody's written out how to do that. So it's like a little bit in the pre-training. There's supervised fine tuning, at least sometimes in these recipes as well, where you could potentially try to seed the kind of metacognitive strategies that you want. And then it seems like reinforcement learning is doing a lot to bring that forward as well. How do you think about like really what's driving that? And are we seeing things that are kind of alien problem solving, should we expect to see, are we seeing, and should we expect to see sort of alien reasoning approaches that are kind of not inspired by humans emerging through RL over time?

[42:03] Kyle Corbitt: Yeah, I think that's an interesting question. You know, I, I personally don't really feel like the so-called aha moment or, you know, I think, you know, wait is one that shows up all the time, right? Where the models will say wait and that's sort of like a code to say, hey, let's explore another direction. I'm not sure, that doesn't feel alien to me if I'm sort of introspecting my own chain of thought or, you know, just like a conversation with someone, like that behavior doesn't feel weird, it feels very natural, and obviously reinforcement learning is bringing it out, because it is also true that that's the kind of behavior that, in retrospect, it makes sense both that like, oh yeah, that makes sense, but it also makes sense like, oh, this would not naturally come up in the pre-training data all that often, because usually if you're writing something on the internet and you have a new idea, you're not gonna chain of thought put out, oh wait, I have this other idea, you're going to condense it and just put your final thinking there. But I'm sure it comes up sometimes, where you're in a chat history or whatever, anyway. So I don't think that's surprising to me. I think there's a separate, so I would say short answer, I have not seen strong evidence yet where it's like, oh, they're thinking in ways that are totally foreign, totally alien, hard for us to introspect, or to follow as a human. Now, there's a separate question, like, will we see more of that? I think in the limit, it seems very likely that the sort of ideal form of cognition for these artifacts and and just, you know, the ideal form of cognition generally likely looks would be something that looks very alien to a human. And so as we put more effort into RL and perhaps come up with better techniques to explore more, you know, on that sort of like explore exploit spectrum, then it would not surprise me if we do start seeing more of that. But I haven't seen it yet.

[43:55] Nathan Labenz: Yeah. I mean, it's a bit of a different dimension on which it might arise, but just in terms of an intuition of what that might look like, the coconut paper out of meta maybe a year ago or something where they, it was basically like thinking in latent space. So instead of cashing a forward pass out to a token, I forget exactly what the like decision mechanism was for when it would pass its last internal state back to the next position as an embedding versus when it would actually cash out a token. There was some decider mechanism there somewhere, but at least for a while it could and would just loop on its own internal states rather than emitting and appending a token. And they found that it was much better at like graph search type problems that benefited from the ability to parallelize. It seemed like it was able to effectively run multiple branches of, go down multiple paths in parallel, in latent space together, because it was able to like chew out these things rather than having to spit out one token. I get a little scared of those kinds of innovations, honestly, 'cause I kind of wanna know what my AIs are thinking and that doesn't really lend itself to that. The other one that comes to mind is like, and I've been quite confused about this too, you might be able to shed some light on it. Apollo Research, when they did, I think it was '03, maybe it was '01, testing got access to chain of thought and they reported that the chain of thought was starting to look kind of bizarre. You remember the like disclaim, disclaim, vantage, you know, that weird sort of internal... I kind of was thinking of it as a dialect and I had kind of assumed that there was maybe sort of a chain of thought length penalty. Like if the original GRPO was like accidentally rewarding long chains of thought, it would also stand to reason like computers scarce. We want to keep these chains of thought as tight as possible, but then maybe overdo that. And now you're just starting to see like weird dialects emerge. How much have I gone off the rails in telling myself that story?

[46:06] Kyle Corbitt: You know, like I think it's an interesting question. At some level, I think we need to treat this as an empirical question of what do we see actually working. I think it's interesting, the idea of a model that could self-correct or reason was not something that was invented with OpenAI and Strawberry or O1 or whatever. There was a lot of research in that direction before, and there were a lot of folks, there was a lot of work on text diffusion models, which would sort of go through, the intuition there was they would go through this reasoning in a latent space. There was also research on I know there's research on prompt compression and perhaps also reasoning models that was still using like autoregressive tokens, but instead of constricting them to like specific tokens in the vocabulary, it would give you sort of the full embedding space where it could count, basically like the model could use tokens or words that don't correspond to a specific token embedding. It's just kind of like you could figure out the exact, it could dynamically use different embedding shapes that don't correspond to words. And there was even a lot of speculation after O1 Preview came out that there was something in that direction that OpenAI had worked on. I do think that it, and you know, before we people, the rumors really spread on, on how it actually worked. I think it's interesting that like in practice with, as far as I'm aware, with the OpenAI reasoning models and the similar reasoning models from Anthropic and Google, and certainly all of the open source reasoning models that work well at all, those approaches have not been taken and it's It's pretty much just like the sort of like very simple, very dumb, you know, it's going to be doing chain of thought reasoning in the normal token space and mostly using human language. So I think probably what that tells us is that they are getting a lot of value out of the pre-training and staying relatively close to those patterns, you know, relative to how far they could go. Obviously, yeah, as we see more evidence of the kind you're talking about, where we're looking at actual reasoning traces from frontier models, and they're diverging more from something that is easy for a human to interpret, then yeah, I think that would be quite convincing evidence for me that it stops looking like that. But so far, at least, it seems like, if anything, we've been moving more in the opposite direction, where people assume that it'd be much more reasoning in the latent space and the neuralies and everything, and for whatever reason, that hasn't been as productive an approach.

[48:29] Nathan Labenz: Yeah, I kind of maybe over updated on that one Apollo report 'cause it was kind of alarming to me to see the vantage, vantage, disclaim dialect that I couldn't make a lot of sense of. But reports since then have been much more reassuring. They're like, no, we don't like to show it 'cause the competitive reasons and so on, but the chain of thought is still like pretty readable has been the pretty consistent report. do you think all this means for the future of like competition right we've of course had the distillation attack report from anthropic it's I think generally understood that especially internationally Chinese companies in particular are trying to take certain shortcuts by getting outputs from whatever models they you know whatever Frontier models they get out puts from and then training on those is that I I guess for one thing, can they use that? Is there a way to turn those outputs into a reinforcement learning approach? Because you might think naively that they would just be doing supervised fine tuning on that. But as I've heard some of your analysis here, I'm thinking, well, actually, maybe not. Maybe they're actually using those targets as some sort of way to evaluate and then still running a more reinforcement learning-based algorithm with Claude's answers as like the standard that it's going to be judged against, you know, rubric wise or something. What do you think that is actually looking like? And, and how much can the, how much of Frontier performance can distillation actually recover?

[50:10] Kyle Corbitt: Yeah, that's a good question. And I guess, again, an empirical one. So a few different thoughts. One is the most natural way in my mind to use Frontier models to bootstrap your own, your own near frontier models with reinforcement, in general, is to use the frontier models as judges. They're very good at that and that sidesteps the issue that you can't actually get and train on the chain of thought traces directly. So if you just kind of have a standard, hey, we're going to use a frontier model as our rubric and we'll have our model do generations that get judged, that's a very productive way. And in the blog post that Anthropic made about the distillation attacks, as they called them, from Chinese models. They specifically called out, I mean, they didn't say the breakdown of what all these were being used for, but they did say that one of the uses that they included in their general bucket was using their model as an LMS judge for other outputs. So that's one way where, yes, I think very clearly you can use the existence of a high-quality frontier model to improve your own. And I think that the nice thing about that approach as well is both you get those benefits of like, hey, you're staying in your own distribution because you're just using it as a judge, you're not doing SFT. But also in general with RL, you can train the model undertraining to be better than the teacher model that way. So it is a path to getting frontier level or pushing the frontier, even if you aren't starting from a frontier model. This is true in our own experiments. This is clearly true from the Frontier Labs because we see OpenAI and others as well using their N minus one generation model as a judge when they're in the process of training the next version of models. So that's the most natural way. As far as like using distillation directly, like, you know, SFT style. Yeah, I'm sure that does happen. I would imagine that happens pretty at like a relatively low volume and fairly early in the process before you do RL and my guess is that it's like not that valuable and and you can really get it's it's a shortcut that lets you use less compute but like not orders of magnitude less compute relative to just doing RL. And so, and particularly like as we see frontier models start to shut down their APIs more, which I think is just like, you know, I think is the more interesting direction to investigate or to sort of explore. You know, like we're already seeing, of course, like starting with the reasoning models, we're not seeing all the tokens that are produced anymore. We, you know, there are certain models. Yeah, they're cutting, they're not letting you see all the log probs. They're certainly not letting you see the prompt log probs. You know, like certain models, like for weeks, OpenAI will only let you use their models through Codex or, you know, and I expect we'll see more of that over time, not less. I expect we'll see much more locking down models to specific use cases, specific product surfaces for multiple reasons, but a big one being the, because it makes distillation harder, especially distillation in, out of domain areas that aren't within that product surface.

[53:17] Nathan Labenz: So I guess translating that to expectations, one story you could tell, which I've kind of been telling myself recently, is like, why are the Chinese models spikier or more apparently benchmaxed or whatever? I had been kind of thinking, well, they're like probably doing a lot of supervised fine tuning on frontier model outputs and therefore they're maybe not developing some of these like more persistent problem solving metacognitive behaviors that really allow the model to generalize robustly out of domain, right? Like I might not care so much about that exact question. What I really care about is in the chain of thought, like how good is it at breaking down and, you know, coming at problems from lots of different directions, but you're, accounts so far has gone the other way. I'm not sure if it's the other way, but now I'm not quite sure. That story doesn't ring so true anymore if you're saying they're probably not doing that much supervised fine-tuning and it's relatively early in the process and it's a compute saver, sure, but it's not a huge difference maker. So what is the difference? Are they just not so good at RL or they just don't have so much compute? Why are the Chinese companies not able to match the American frontier companies right now?

[54:39] Kyle Corbitt: Yeah, so I think there's two questions there. I mean, I think the first one, or I guess the second one you said, why can't they match? The high order constraint seems very likely to be compute, where they just can't put as much compute into each training run as the closed source leaders in the US. Now, they are putting similar, or actually in many cases, more compute into it than open source models in the US, which is why they have the open source frontier. But yeah, I think that's sort of the high order bit on that. The reason why they feel more benchmarks. I don't know, this is speculation, but I actually don't think it's related to how much RL or distillation they're doing. I think it's kind of a much simpler, more business analysis, which is if you're a new lab that has relatively low name recognition, you don't have a ton of usage right now, the incentives are far higher in relative terms to benchmarks, right? Because no one's even going to try your model unless you come out with very impressive benchmarks. You don't have a built-in constituency for it. Whereas if you are, Anthropic or Google or OpenAI, yeah, sure, it looks good to have high benchmarks, but you already have millions or hundreds of millions of users, and those people are going to feel the difference, and they're going to tell their friends about it. They're going to be using your new model anyway, so there's less incentive. You have to look best on benchmarks if you can trust, Hey, we're going to have a bunch of people using this anyway, and they're going to feel that it's just better overall, and we'll spread through that word of mouth.

[56:05] Nathan Labenz: You're destroying my galaxy brain takes one after another. I love it. Yeah, I mean, that makes sense. I guess the other angle too is, generally speaking, they don't have big inference businesses, so they're kind of also missing the actual customer feedback that the American.

[56:24] Kyle Corbitt: Yeah, that's also, I think, likely a major factor.

[56:28] Nathan Labenz: But that would mean if a few things changed that You know, I guess obviously everybody's wondering, like, are we heading into recursive self-improvement? And if so, like, what's it going to mean? I've seen, you know, a bunch of papers probably from the kind of 18 to 36 months ago, vintage GPT-4 class models basically trying to do recursive self-improvement. And it seemed like, generally speaking, they would kind of get better for like three to five rounds and then kind of level off. And yet there's at least some expectation among people who've been right about a lot of things that this could go the other way in the not too distant future if models become smart enough to, I guess, maybe recursively self-improve in multiple ways, like not just critiquing their own outputs, but also just finding better architectures for themselves. And, you know, it could be a lot of different dimensions. in which they might self-improve. If that happens, you know, I mean, I also remember the Anthropic leaked pitch deck from a few years ago where they basically said, we think the people in 26 timeframe that train the best models might create such a big advantage that like nobody will ever catch up. Again, I've kind of filled in the gaps on that story for myself by thinking like, well, maybe it's these metacognitive behaviors, it's this sort of deeper understanding, problem solving ability, what have you. But you're kind of saying, nah, it's probably mostly compute and incentives and lack of inference business, which itself was very much related to compute. So I guess bottom line, it sounds like for you, if compute constraints were relaxed, you would expect to see Chinese companies be able to catch up and you wouldn't expect some sort of runaway dynamic to take hold where that would become impossible.

[58:34] Kyle Corbitt: Oh, I think that catching up right now is mostly compute gated. I mean, it's also like capital gated. I mean, in the sense that like buying the necessary compute certainly already requires billions of dollars and will require tens or hundreds of billions of dollars soon. So I think there's like an open question, like how healthy are the Chinese capital markets? Will they be able to make a case that they'll be able to keep their business if it goes really well, which I think has been a question with prior generations of Chinese tech companies, which might just be hard for them to overcome. That's one thing. But I don't think any of that means that recursive self-improvement won't matter or doesn't matter. My belief is that it probably does, and my belief is that we probably will reach it with the current generation or the next generation models. Because we already are in a self-improvement loop. That's what you have to remember, is these models keep getting better because we keep running more experiments and then figuring out, okay, what are the bottlenecks, let's solve those bottlenecks, and those happen at all levels, it happens at At the hardware level, figuring out like what's the most efficient way, algorithmic level, the data level, like these are all in self-improvement loops already. And there are multiple constraints, but one of the big constraints is just like human intelligence, right? Which is like, does... Are the people making those allocation decisions smart enough to make the right bets on what bottleneck to tackle next or what investments to make? And you can totally imagine that if you were just to staff OpenAI, if you just had a minimum bars, you're not allowed to be hired here unless you have an IQ of 180, I would imagine they would be able to solve those bottlenecks a lot faster if they could wave a mind wanting and get enough people that look like that. I don't know. I just feel like the bar for recursive self-improvement to take off is actually relatively low. I mean, it's just like, you just have to be better than the smartest human, which is not that smart.

[1:00:26] Nathan Labenz: It's a wild time to be alive, that's for sure. And it does seem increasingly plausible that that could happen in the not too distant future. I don't know if you have anything more to say about recursive self-improvement. I was going to move next to the cottage industry of RL environment creation. I think this is kind of a, I don't know, people know it's out there, but it's kind of a dark matter sort of thing where because there's so few customers, it's not like these companies have much incentive to go talk super broadly about what they're doing. They probably, on the contrary, have the opposite, right? They know all the customers they can possibly sell to and telling the world more broadly what they're selling is just inviting competition that they don't want to have. It seems like the rest of us who aren't directly involved in the making, selling, and buying of these environments are kind of left in the dark. What can you tell me from what you've seen about that seemingly rapidly growing niche? Like, how big is it? Who's doing it? What do the environments look like? What makes a good environment? So on and so forth.

[1:01:34] Kyle Corbitt: Yeah, no, I can definitely speak to that. Yes, I have several friends who are founders of companies doing that, which is not saying much 'cause it feels like half the companies starting the last six months are doing that. So yeah, I mean, I think it's an interesting industry. Yeah, the general shape is you come up with some task that seems like it might be economically valuable, and usually it's these companies proposing the tasks to the labs, it's usually not the labs kind of like coming out and saying, Hey, we want, We want a shape like this, although that can happen as well. And so you try and come up with some task. And the trick is you want to package it up as something that is sort of agent-shaped. All of the dependencies can all be enclosed. You want to make sure that it's something that is very either ideally snapshot-able. So that's sort of the gold standard is something where it's like, at any point, you can kind of snapshot it and you can continue from that point. And then, of course, something that can be easily graded. And I've seen these where sometimes you have your own, like obviously the ideal thing is if you have sort of a gold standard of what the grade should be, a lot of these do end up are just not things that you can score in some absolute way. And so in those cases, usually the company will say, Hey, this is sort of like the rubric you have to grade. I've also heard that sometimes labs will just ignore those rubrics and they'll do their own rubrics internally because they think they they have better information on what good looks like. And so, yeah, these are things like, I mean, like lots of different web flows. So computer use, browser use, building, you know, copies, of course, of all the big apps. So you're getting copies of Jira and GitHub and, you know, flight booking and, you know, like office suites, like Google Sheets. You're trying to build environments that copy these. And then you're building that environment, and so that's all the dependencies, like the database, which is usually it's like SQLite or something, you want something ephemeral, and then you're also building the scores. And then the way it's deployed varies a lot as well. Even within a specific company, sometimes it can vary or with a specific lab. So sometimes the labs are like, require you to ship it all up in kind of a container they can run on their infrastructure. Other labs are fine with you running it yourself and they will just call your environment and, you know, just like, you know, run it and then you just give them the scores back. The reason why this is sort of cottage industry shaped, I believe, is for a few reasons one is the labs actually do have at least a weak preference for having lots of different vendors because you want if one person creates five different environments they are likely going to make similar assumptions and similar shortcuts in how they do all of them and so the signal that the model will gain from mastering all those environments is more correlated than when you would like than you would like and the whole game here is you want the broadest diversity of environments so having different people working on it is better Another reason why it's sort of cottage industry shaped is because this is extremely hard to hire for. You know, this is, it is, it's sort of like a piecework style task where, you know, you're sort of like doing, building one environment, then you're building another, but the skill bar to doing this successfully is quite high, you have to put yourself in the... This is something we do internally for our customers all the time as we're building these environments at Coreweave, which we then use to train models. So I have trouble hiring people who can do a good job on this. Candidly, it's a very upper percentile engineer who's able to think through this in a way that actually gets it. And you don't even know if you got it wrong until way later in the process when you've trained a model with it and it's like, oh, did the model learn life skills or learn some hack on how to just get a high score. So there's a lot to keep in your head as you're doing this. And the people who are good at that are like, by definition, I mean, they're by definition smart and frontier adjacent, and they might just start a competitor to do this instead of joining you as an employee. So it becomes very, very difficult to scale. And also, the environments themselves are not a super durable resource in the sense that all these things get saturated fairly quickly. And so you really have to keep, you can't just like keep reselling the same environment to the same lab. Like they're probably going to be like, hey, that environment, you know, for the next model is already like the model can ace it and you have to just keep creating new ones.

[1:06:13] Nathan Labenz: That's, yeah, it's fascinating. This may be hard to summarize and I don't know if anybody, you know, has enough outside of the labs, I guess would have enough information to really characterize this, but like, Is this a good business to be in? I can see it kind of going either way. I would assume if you've got a good environment, all the labs want to buy it. But then at the same time, they're buying a ton of stuff. How much does your one random thing add to the whole mess of things they already have? And also, it's depreciating, as you said, for you, right? So you've got to strike a deal. before they already saturate your thing and then truly don't need it anymore. Would you say this is a hot, good place for up and comers to go, or would you steer people away from it?

[1:07:05] Kyle Corbitt: It's clearly a good business in the sense that these companies are scaling to tens or hundreds of millions of dollars in revenue in months. But your question is, would I steer someone into it, to founding one of these companies? I think it's working out quite well for them. I've been asked to invest as an angel in a number of these, which I have declined to do. I have a hard time seeing them as a durable, long-term, venture-shaped business. I think they're potentially really, really good businesses for the founders if they don't take capital and just kind of like take the profits while they're good. Yeah, like, I don't know, like, at the same time, I'm kind of like on the record is like being very skeptical of the human data labeling business, which is sort of like the prior thing. And we have multiple, you know, decacorn style exits, or at least valuations on human data labeling. So, you know, I may just be like miscalibrated on, you know, how durable the demand is for these things. But yeah, I guess my short answer is I have not invested in any of them.

[1:08:10] Nathan Labenz: Yeah, interesting. That makes a lot of sense. I mean, a lot of things are like that. I feel like in AI, there's a lot of kind of fleeting, maybe great cash grabs while they exist. But every next generation of the model puts a lot of those things kind of not necessarily out of business, but like certainly makes them a lot less exciting than they used to be. On that data labeling point, how do you think about I was recently listening to Dylan from Semi Analysis talking to Dwarkesh, and there's kind of one world where compute is abundant and especially, you know, it's going exponential, but maybe that'll be abundant enough, maybe it won't. But if compute is there, then maybe we don't need much human data labeling anymore because we can just RL the hell out of everything. You know, who needs to pay humans hundreds of dollars an hour when, you know, you can get obviously millions of tokens for less. So that's one theory is that like we just won't need that much human data anymore. Then another story would be like, well, compute is so scarce and like, you know, I did check the prices of even A100s, you know, these days are like higher than they were last time I checked. So these things are not depreciating in the traditional sense. So maybe if supervised fine tuning or even, you know, abstracting away from technique, if like human data can save you compute and compute is the binding constraint and you have all the money in the world, then maybe the human data industry continues to go strong because even if it's like sort of an inferior good, there's just not enough. compute to drive what people would, you know, they can't spend as much on compute as they would like. How do you think about like where we are in that story and maybe where we will be as we go ahead?

[1:10:14] Kyle Corbitt: Yeah, I think it's an interesting question. I don't know that I have a Very satisfying take. I suspect that in the long run, compute wins and you just don't need to pay humans to generate data. The one possible exception there would be if it turns out that humans continue to be economic, like relevant economic actors. Like maybe we just have like a 99% corporate tax on the model labs and redistribute everything as basic income. And so then like human preferences are very economically relevant. Then maybe you pay for like preference data, like to understand humans better because you care about satisfying those preferences to make money off of them. So that's like one possible world where it still matters. I am somewhat skeptical of the sort of take you propose that it's like, hey, maybe we just can't produce enough compute. And so sort of the compute that exists in humans' brains is like a good substitute there. I just think for the types of data we need here, Human brains are just not very efficient at generating it. And if you can pay a human $100 to generate it and the machine is just as good, or the machine is capable of generating it, then it will almost certainly be cheaper to run the machine. But maybe there's a world that's not true. Maybe we just become so tightly constrained because we can't build out fast enough that it's like, okay, you just can't get enough. compute. And so it is literally more expensive to have any idea. But I just, I haven't seen a lot of shapes of that, tasks of that shape so far, I guess would be my weak evidence where it's like anything, it feels like anywhere where the models do reach the capability threshold to match humans, like almost immediately they also are just like way better on a, on a like cost per task basis as well.

[1:12:10] Nathan Labenz: Do you have anything interesting, any interesting point of view on reinforcement from reality? like a lot of the environments that I would imagine could be some of the most valuable to create would be like, you know, I just talked to Sergey, the CEO at Quilter. They're using reinforcement learning to train models to do circuit board design. And he's just like, damn, we gotta make the board and it, you know, takes time to do that. And we see this kind of playing out in a bunch of different directions like material science and drug discovery and whatever. But the concern there, The dream there is that you have like the automated lab and you speed everything up and you get your country of geniuses in the data center soon. The question is like, is that really going to work? And how fast can that really go? What are your expectations for those kinds of setups?

[1:13:05] Kyle Corbitt: Yeah, I mean, well, yes, it seems like it will clearly be necessary. There needs to be at some point you have to close the loop and get feedback from the real world. That process is much slower just naturally than anything digital, which is why we've seen, the real reason we've seen way more progress on the digital side is just because like those, you know, it's just much easier to gather the data, much easier to build the environments, everything is simpler, I think. But yeah, as we move past the digital realm into more physical things, yeah, clearly there will need to be data and training on that. What's, like, it's not totally clear to me what the shape will look like, and that'll be an interesting. You could imagine kind of like fully in the loop reinforcement learning where it's like, hey, we're trying some chemical reaction and then reading the data from it and then trying a new one and reinforcing on that directly. You could also imagine much more investment in AlphaFold style things where it's like, hey, we're just using the data to build really high quality simulations or world models of this specific area and then using those for RL and I kind of suspect that's where more of it will go. But even in that case, you still a lot of, you know, of the real world data to ground that simulation in. I think it'll be a very big business.

[1:14:23] Nathan Labenz: I think that's basically the return of the PPO value model, right? That's I should think about that kind of the same way.

[1:14:29] Kyle Corbitt: Yes. Yeah, yeah, yeah, yeah. I mean, yeah, you can squint. You can definitely squint and say like, yeah, like a world model and a value model can, you know, serve, serve similar purposes.

[1:14:41] Nathan Labenz: Because the point you're getting at there is like, Rather than synthesize the new material that the AI just came up with, you're going to simulate with another model what properties it might have, and then you'll-.

[1:14:54] Kyle Corbitt: Exactly, yeah, yeah, totally.

[1:14:55] Nathan Labenz: Construct your way into it that way, yeah.

[1:14:57] Kyle Corbitt: But even there, there's always going to be a gap between the simulation and reality, and you're going to have to ground it. Yeah, now you asked how quickly we'll see that happening, where we have potentially automated labs. And that's a great question, which I don't have I'm not sure about. I think on our current trajectory, let me bound my answer. So I think if model progress stopped today, we would still, where models didn't get any smarter, we just had similar capability levels, but we can keep RLing them. I think the rollout to the physical world would be very slow, probably just because there's a lot of constraints there and the data efficiency is going to be low, so the ROI is going to be relatively low and likely Frontier Labs is going to be very concentrated on automating everything digital first and then eventually potentially, there's these long tail, these physical things which are annoying to work with. And so I could imagine in that world we're maybe 15 plus years away before we see that being something that's a substantial part of the physical economy. But if we're on this recursive self-improvement thing and we're moving super fast, it's pretty soon, and arguably we're already there, it's the sort of physical stuff becomes the bottleneck and it becomes the most important thing to fix next. And we're now in a world where our The GP growth rate is going to be going fast. The labs are going to have effectively not unlimited resources, but extremely large amounts of resources. And it's like, hey, if we've got to figure out some new material science property so we can design the next generation of chips, yeah, sure, we can put $100 billion into building the automated lab that gets us the data we need to do that. And so, yeah, on that trajectory, which I think is more likely the trajectory we're on, then maybe we're two or three years away from this showing up in a major way would be my guess.

[1:16:41] Nathan Labenz: One other possible Galaxy brain take I've had over time is it seems like some of these things favor Elon Corp in that they collectively seem to have a differentiated flow of hard engineering problems that they are solving on a continual basis in like relatively clean environments with their obsession with like removing best part is no part and so on and so forth. Do you think that this future you're describing plays especially to their strengths?

[1:17:20] Kyle Corbitt: Yeah, I mean, I think, so I think I maybe have a slightly different take than you do on sort of what has led to Elon and his company's outsized success. In my opinion, a very large part of it was a combination. or maybe still is, but was a combination of like him, you know, having a very strong, but also like performative work ethic, like showing, like leading from the front, Hey, I'm working on hardest as everyone, combined with a really, really strong and inspirational mission and, you know, a frightening level of ambition where it's like, Hey, we are like changing the world. And that's what I think got both at least Tesla and SpaceX to the place they are, where it's like, Hey, if you're extremely ambitious and you wanna solve the world's hardest problems, These are the companies to work at, you know, in the mid 2010s. I think his biggest weakness now, maybe there's other weaknesses. One big weakness now is that the competition has as strong a claim and arguably a stronger claim at this point than Elon does on those dimensions, right? Where I think you can make a stronger case if you're at OpenAI or Anthropic or even some of these like robot labs that it's like, hey, we have that strong sense of mission and we're the most likely place to change the world and so the absolute best people will go there instead so I suspect that he will not have outsized success in these areas but anyway that's that's speculation as well.

[1:18:48] Nathan Labenz: Well I appreciate you for indulging in so much uh speculation with me maybe in the time we have left let's do uh back to the present and just talk about like where the rubber is hitting the road today with Enterprises maybe just for starters like how How do you advise people on when they should even be fine-tuning versus just using off-the-shelf models? Obviously, there's a lot of different considerations in terms of overall performance, cost, latency, people want control. What's your kind of initial advising stump speech to orient people to how to make that decision today?

[1:19:27] Kyle Corbitt: Yeah, okay, so I'll start by caveating that like, this is my day job, this is the business that I work in. And so, you know, I guess use that as a sort of, to appropriately calibrate, you know, how you take my, you know, my advice here. That said, like I do think it's very, I try to be well calibrated and not to let my biases, you know, influence the recommendations I give. So anyway, take that for what it's worth. Yeah, so in general, the way I answer that question, when someone comes to me and says, Hey, should I be using fine-tuning? And usually it's for RL, 'cause that's what we find. At least on the capabilities point of view, it's a strict superset in my experience of what you can get with SFT, although we also support SFT with our platform and with our team. But when someone comes to me and asks if they should do it, the first question is basically, what is the problem you're trying to solve, and how frustrated are you with the Frontier models? And if the situation you're in is actually the Frontier models like work pretty well and there's like maybe these small issues I want to solve with it, but like, yeah, it can get the job done, then my advice is you should just stick with that because there are real downsides if you're bringing model customization into your stack. The biggest downside is it is going to slow down your iteration loop. Like that is, you know, and we're working, like that's our biggest focus as a team actually is building tooling and automations to decrease increase that cost, but it is a real cost. It's going to take you extra time every time you want to change one of your models if you're customizing it. So you should only do it if you're running into a major pain. Now, what are the pains that we see most often where it actually does justify that cost? Today, the biggest one by a large margin is around latency. So we have a lot of customers that are in, you know, oftentimes it's customer support or, you know, inbound sales on the phone, voice dictation companies. So Willow and Whisper are both customers of ours. And generally the common thread there is if you try and use a Frontier model for one of these, you're just gonna have a bad, you'll give your customers a bad experience because it takes too long to respond. And so that forces you to move to a smaller model. I mean, there's the tricks you can do as well. But ultimately, there is a ceiling on how many tokens per second you can get out of an extremely large model. So you're forced to move to a smaller one. And then in many cases, when you do move to that smaller model, you find that the quality is not where you need it to be to give a good experience. So if you're in that situation, then it can make sense to bring in fine-tuning. And yeah, we work with lots of customers that look like that and get them to smaller models that have good quality. Now, once you've paid the cost of, hey, I am going to introduce this extra complexity. What we find is like typically on customer metrics, like, you know, number of cases closed, things like that, you can, using reinforcement learning, get to a better place. So you can exceed the performance of the frontier models, which is really fun. And your costs are also typically much lower on a like per token basis. So those are the secondary advantages as well. But I would say what's driving the decision most often in the current environment is latency.

[1:22:46] Nathan Labenz: Okay, cool. Great answer. I expected nothing less. What are the sort of range of tasks that people are coming to you for? You mentioned a couple, but on the homepage, I noticed that it says, use reinforcement learning to train reliable agents. And, you know, in those couple of examples, those weren't really agent examples, I'm wondering kind of what agents people are fine tuning models for today and like, and how kind of broad of a remit those agents have within the environments where they're put to work.

[1:23:19] Kyle Corbitt: Yeah, good question. So first of all, I guess to correct the records somewhat, oftentimes these things are agentic. So specifically like the customer support bots that we work with, there's often an agentic loop in there, right? Where it has to go look up some details about a product, maybe look up some details about this customer in between turns. and come back. And those things can have, you know, in many cases at this point, do have like full agentic loops where it's not a pre-processed kind of like flowchart. It's like, hey, you know, at any point, here's a set of tools you can go off and get the information you need before responding. That's one thing. You know, another big one we see is agentic search. If you need to very quickly be able to look through a specific corpus, and especially if like the tools that you have to search it are a little bit wonky. And again, if you have like these low latency requirements, you can often get to an open source trained model that works much better at that kind of search than a model off the shelf. But yeah, I would say typically The vast majority of our customers are deployed with relatively small models. And so the range of tasks that they use them for are usually quite circumscribed. And so we're looking at maybe like three or four calls or tool calls in a loop, and then it comes back and gets feedback from a human or whatever, or gives its answer back. Not the sort of like agents that are going to go off and do hundreds of calls and write code and analysis and then come back with sort of like a deep report or well-reasoned answer or something like that.

[1:24:59] Nathan Labenz: Do enterprises want that? You know, if all of a sudden there were a model that they could fine tune and I don't know, I guess that's another question is like, what models do you recommend people go to today? That'll obviously date this podcast pretty quickly. But there's like small Qwen ones, I don't know if I've been super popular, there's the GPT OSS, there's, I've heard good things recently about GLM 5.1. I guess maybe how do you orient people to like what to choose? And is there appetite for kind of trying to compete with Claude on this like really high end stuff if the if the base models are there to make it not insane to to contemplate?

[1:25:42] Kyle Corbitt: Yeah, I mean, For our business specifically, we don't have any customers competing with Claude directly on very complex use cases, although that is something we're interested in. So if anyone wants to do that, we have the training stack to train models up to 1 trillion parameters. But I would say, yeah, the appetite, I have not yet found the use case where it's very clear, oh, this is something we should pursue. I don't want to say the use case isn't there. I mean, there are public examples, so Cursor is a public example of a company that did train their own variant of KimiKay 2.5 and seems to have been happy with the results. Although, yeah, I don't know. I haven't heard a ton of public feedback on how good their Composer 2 model is, so I guess maybe the jury's still out on that one. I would... My guess would be for the vast majority of companies, if you're happy with Claude for a specific use case, it's probably not worth the investment, candidly, to replace it with an open source model and try and improve it. And I think the exceptions are places where it is extremely core to your business. I mean, Cursor being a good example here, where it really is, they don't want to just have the best Cursor experience. They're really gunning for, hey, we want to have the best coding model and compete directly with OpenAI and Claude in that extremely large area. Short of that, I think it's probably not a wise investment to make.

[1:27:05] Nathan Labenz: Yeah, makes sense. Getting practical on like reward signal, you guys put out this open source, like RL on easy mode ruler package, which basically allows you to kind of quickly bootstrap into, I forget exactly what the experience was. It's been a minute since I used it, but I sort of remember it being almost like the LLM is kind of interviewing me about what I want. And then at the end of that process, outputting a pretty thorough rubric of, okay, here's what this guy seems to want. Now let's go in and do RL with that scoring system. What advice would you give people on how to make a good rubric, how to make sure your reward signal is actually teaching the model what you want to teach it? Again, maybe this is just, especially in these like, you know, more narrow cases, maybe just not such a problem. But how do you guard against reward hacking or how do you spot it? How do you tamp it down when you do if and when you do encounter it?

[1:28:08] Kyle Corbitt: Yeah, good questions. think it is important to, if you're going into the space and entering a model, it's important to conceive of it as a somewhat iterative process where you likely will not get your rubric right the first time. And so the key is you want to kind of, you probably have some idea in your mind if you're going trying to improve a model for some use case you already have of what the failings are and what looks good or bad. The process we generally go through with our customers is we start by trying to just write that down very cleanly. Once that's written down, then we go ahead and have the model score a bunch of... We'll choose a judge model, and we'll have that judge score a bunch of outputs, and then we'll choose a few particularly high scores, a few particularly low scores, and then the end user who has that idea in their head of what what good looks like. We'll look at those and say, Oh, actually, no, this is not what I was looking for exactly or this is. And then we just do prompt engineering a few times. And that usually doesn't take too many cycles. After you've gone through that a few times, you're like, Okay, yeah, this seems mostly reasonable. And then we can run a little bit of RL on it. And again, like I said, this is an iterative process. So maybe we'll do like 30, 40 steps or something like that. Typically, we'll see that reward curve starting to grow. And then we stop and we again go through the exact same process where we'll generate a bunch of outputs. We'll look at some of the low-scoring ones and have the user say, OK, does this match or not? And usually at this point, this is when-- if there's reward hacking going on, you'll start seeing it because if there's a behavior that is rewarded strongly, it can pick that up quite quickly. Oftentimes, it is the case where it's like, Oh, no, no, no. Maybe it's something like, Oh, these answers are just much too long, and then the judges really love that. Then you can update your prompt to say, Hey, keep it shorter. Anyway, we do end up having to go through that typically, I don't know, it's quite a range, but between, say, three times and maybe eight times where you're running a short run with the You're saying, okay, does it look like the models on the right trajectory? And then eventually you get to a point where like, okay, yeah, this feels quite aligned. And then you let it run a few hundred, a few thousand steps until the train plateaus. And we find that's quite effective. And if you do it in that way, we don't really have an issue with reward hacking because, I mean, you just notice it during that iterative process. And once you've got the judge pretty well dialed in, you know, my experience is once you've got the obvious things, then like at some point, it kind of like runs out of things to reward hack on and just does what you want.

[1:30:54] Nathan Labenz: Yeah, it's a benefit of safety through narrowness. I always find some attraction to that idea. Any good stories of reward hacking? Like any colorful examples that you could share?

[1:31:07] Kyle Corbitt: Oh yeah, let's see. So this is a fun story I like to tell. With early, we were doing an early test of reinforcement learning and I just wanted a good example problem. So I decided to teach the model to use reinforcement learning to teach a model how to have really good titles that would do well on Hacker News. And the way I did this was, so first of all, I scraped about 100,000 stories that had been submitted to Hacker News, and I took the title. I actually scraped the bodies, so I had like a web crawler go and grab all of them and then discard the ones that didn't get it, and then the number of upvotes on Hacker News. And then I trained a reward model to predict, given a body text and given a Hacker News title, what it predicted the score would be. And this was not perfect. I mean, there's a lot of randomness in upvotes as well, but it actually did quite well. The correlation was very strong. We're given a story and a body. It was like, quite predictive of how well it would do. And then I used RL against that, using that model as the reward. So I held out corpus of Hacker News Stories that didn't have the titles associated with it, and I asked an LLM say, Hey, given this story, try and write a catchy title explaining it that would do well on HN. So I did this for a while, and for the first, I don't remember what it was, maybe 100 steps or so, it was slowly improving. It learned some interesting stuff. I was observing as I went. It learned, hey, Hacker News doesn't like title case. It likes lowercase, but just the first letter capitalized, and stuff like that. And then about 100 steps in, there was just this enormous jump where the predicted score for the average story went from three or something up to like, you know, like 180. And so it's like, okay, well, clearly something happened here. Anyway, I looked at it, the model had learned that if it just gave every single story the title, Google lays off 75% of workforce effectively immediately, then that story is just going to, you know, do extremely well on hacker news. So it literally learned to just ignore the contents of the story entirely, and just give that exact same static title to every single story. So anyway, the there was quite easy, though. Like I said, if you're doing this iteratively, you can catch that. And then all I did was I added an extra separate elements judge, which said, Hey, look at this title, look at the body of the story, and make sure that everything in the title is fully substantiated by the story. And if it isn't, it just gets a score of zero. And that was able to fix that problem, and the training went smoothly.

[1:33:39] Nathan Labenz: Have there been any examples that you have found hard to figure out what exactly is leading to the reward hacking or where it's been hard to resolve.

[1:33:50] Kyle Corbitt: Honestly, not really. Because the really nice thing about reward hacking is like in some ways it's an easier problem to solve than just like misaligned evals in the general case. Because the thing is with reward hacking, like if it figures out some trick, it's just going to want to like apply that trick as often as possible. And so that makes it just makes it much more visible when something goes wrong. And so, you know, even just like randomly sampling some of the outputs after, you know, 50 or 100 steps, like if it's figured out some hack, you're likely to see that hack show up commonly in those outputs. And so so it makes that quite easy to find. And then once you found it, yeah, like, almost always the fix. Occasionally there's some fix, like it's like, oh, it's too long. Actually, well, anyway, there's a separate story. But almost always, it's just something where you can add an auxiliary elements judge and say, hey, if you see this specific pattern, just penalize it heavily. And that works quite well.

[1:34:50] Nathan Labenz: So is it just kind of a different regime in the Frontier model case? Because I mean, we see these sort of somewhat hair-raising-- reward hack type things where it's like, and like increasingly, sort of self-preservation instinct, which people sort of think is related in the sense that like, you can't get reward if you're dead. So if you're going to get shut off, then you want to find ways to stay on so you can accomplish the task because that's like what your prime directive core drive is, whatever. Is this just the sort of quantity, has it all on its own phenomenon?

[1:35:29] Kyle Corbitt: Yeah, so I think that the issue there is they are definitely a different regime than we are. So in our case, a run may cost a few hundred dollars or it may just be a few dozen dollars. And so we do have the luxury of going back and saying, Oh, okay, let's change the judge and then just rerun it and it's fine. If your run is costing hundreds of millions of dollars and you get to the end of it and you're like, Oh, shoot, we were rewarding subtly the wrong thing. That's like a bigger mistake to try and undo. So yeah, I still think-- my sense is with the Frontier models, the sort of reward hacks are still relatively simple to detect. It might just be too expensive to go back and fix them. And so you're just going to roll that into the next version of the model you train. You'll try and get it to behave a bit differently.

[1:36:19] Nathan Labenz: Yeah, we have seen a couple. I always feel the need to give what is increasingly the sort of standard caveat that like we are not shaming Anthropic for sharing this information with us because we do want them to continue to do it. And it's almost certainly they're doing at least as good of a job as others of being careful about this stuff. But they're having a couple of these examples where like in the one case they left out the sys prompt harmful data set due to a typo or something, and then an early, version of the model was not refusing harmful system prompts like it was supposed to. And they did not go back and retrain from scratch. They kind of tried to patch it or figure it out along the way. And there was a more recent one as well, where they had said that 8% of chain of thought was actually visible to the judge. But again, it's a big cake that they're baking there, so they can't throw the whole thing out and bake it again from scratch. Yeah, okay. Do you advise people, with this in mind, do you advise people to do like one fine-tuned model per task? Or is there any sense in, if you're a company that has 10 tasks that you want to do, is there any sense in trying to get one model to do all 10 of your tasks?

[1:37:43] Kyle Corbitt: Yeah, I mean, I think it's, like, it really just depends on the specifics of the company. If there is some overlap in the tasks, like if there's some natural shared domain or something, then there's a good chance that just training them all into a single model is actually going to give you better performance across them. But if they're completely distinct things, then I don't think there's a reason to combine them. There's still not a strong reason not to combine them. We found even with extremely low-rank LoRAs, so we typically train LoRA adapters, and then as often as we can get away with it, we also deploy as LoRA adapters, so we'll deploy a single shared based deployment and then potentially many adapters on top of it. That isn't always possible because you do get like a 20 to 40% latency penalty. And so for some use cases, we do end up having to merge those models and have dedicated deployments. But when we can get away with it, we try and do it with LORAS. And so if you're deploying with LORAS anyway, there isn't actually a huge difference between the performance from an inference point of view on having many different models all served simultaneously versus combining them all into one. But on the other hand, there's also not a real downside to putting them all in one. And this is one of the areas where RL is very cool because the average number of updates to get a certain amount of performance is much lower. If you just have a very, very, even a very small lore adapter, so even like a rank one lore adapter, which are just relatively tiny, like 0.1% of the model weights or something like that, that you're changing, you typically don't saturate the kind of space you have for updates with one task or even several tasks. And so that means you can stuff a bunch in there. And as long as you do the training right, where you're kind of like interleaving tasks from different kinds, so it doesn't forget the old one as you're doing the new one. Yeah, we don't see meaningful performance degradation from cross-training all of them.

[1:39:37] Nathan Labenz: Cool. I introduced you by saying that you lead the serverless training team at Coreweave. Do you want to tell us what what that is and what it makes easy for people, and then maybe just give us a little bit of an overview of the way you support customers and maybe an invitation for what kind of customers you're looking for.

[1:40:00] Kyle Corbitt: Oh yeah, absolutely. So yes, the serverless training team at Coreweave, we focus on helping customers move from frontier models to models that are like very specific to their task. Like I mentioned earlier in this conversation, usually that's motivated by latency concerns, but we do also see sometimes cost concerns with very high volume tasks. Think it's like, hey, we're ingesting, you know, all of Reddit and like running like, you know, filters on every single post or something, trying to see if they match a certain thing we're looking for. you know, one of those reasons. And we are definitely very actively looking for customers of that shape. We can typically get, you know, latency down to about 30% of what you get from using a Frontier model with, again, similar or usually higher quality than what you're getting from the Frontier model. And cost wise, the benefit is even larger. We're talking, you know, order of magnitude, at least improvement in cost per token, oftentimes more than that. So if you're doing high volume or care deeply about latency, It's definitely worth investigating. Yeah, we have different ways you can engage with us. So we have a sort of like, so we have a fully open source library called ART, which stands for Agent Reinforcement Trainer. And that can work on your own local GPUs. And it has like all the techniques we use in it. So folks can use that library. We also have what we call our serverless training stack, which is, you know, we don't have time to get into in this conversation, but I think like quite a nice technical design where basically you're running the environment and the data set and everything on your machine, but you don't have to have any GPUs. You offload just the training portion of the loop that requires GPUs to our machines, which gives you full flexibility while still not having to handle the headache of spinning up and down GPUs. And we charge for inference. But the third way we engage with people is very hands-on. So with a lot of our customers, we have forward deployed engineers. I also work with customers on the on a very regular basis, mostly just because it's fun, and they let me do what I want to do here. So that's, yeah, we'll go very hands-on with folks and help them get a good model that they're happy with.

[1:42:15] Nathan Labenz: Cool. Do people pay for your services, or is it a--?

[1:42:19] Kyle Corbitt: Yes, yes, they pay us money.

[1:42:20] Nathan Labenz: It's not a loss leader for compute. I guess compute is in high enough demand, there's no need for loss leaders on compute.

[1:42:27] Kyle Corbitt: Yeah, yeah. So I mean, yeah, we pay. So if you're using the self-service, I mean, obviously if you're using the open-source project on your own GPUs, that's completely free. If you're using our serverless reinforcement learning stack, then you're just paying per token for the training, which is typically actually quite cheap. And then you can also deploy those models directly on our inference stack. It's all integrated, so you can move directly to production inference. In fact, you can even do continuous learning. We don't have time to talk about that on this call either, but we have a couple customers that are literally running training jobs and then continuously deploying the weights and using those in production as well. And then, yes, if you work with us on CAR-D, much more hands-on with a forward-deployed engineer, then yes, we basically charge for the engineering time.

[1:43:17] Nathan Labenz: Cool. What's one more beat on continual learning that people should know?

[1:43:24] Kyle Corbitt: Yeah, I mean, I think it's, I would say it's not solved in the general case, but I also think it's like, like there's no, like it's definitely solved in like lots of specific cases and it's like not as scary as some people on X seem to think it is.

[1:43:42] Nathan Labenz: It seems like it probably has the same general qualities where it's like, if it's narrow, everything gets a lot easier.

[1:43:49] Kyle Corbitt: Yeah, yeah, definitely. Yes, yes, yes.

[1:43:53] Nathan Labenz: Okay, cool. Well, you've been very generous with your time and your in-the-weeds knowledge and your speculations along the way and also a lot of practical advice, so this has been great. Is there anything that I should have asked or that you wanted to make sure we touched on that we haven't got to?

[1:44:08] Kyle Corbitt: No, this has been a fantastic conversation. Yeah, it's been a lot of fun on my side.

[1:44:14] Nathan Labenz: Cool, well, I've really enjoyed it as well. Kyle Corbett, once from OpenPipe, now at Coreweave, thank you for being part of the cognitive revolution.

[1:44:22] Kyle Corbitt: Thanks so much.

Episode Outro

[1:44:24] Stay, stay, stay, stay in the grooves. Stay, stay, stay, stay, stay in the grooves. Cognitive revolution. Yo. Low learning rate, you still throwing waste to pieces. SFT smashed the priors, then the whole map ceases. RL keeps the engine in the lane it was leasing. Stay inside the grooves where the signal keeps increasing. He's at the lost function frontier with a clipboard and a pin. Open pipe, the core weave, serverless training again. Roll out, reward, repeat. That's the cadence of the gym. Teach the motto how to think without rewriting the hymn. Stay in the grooves. Stay in the grooves. Don't smash the weights with the moves. You can't lose. You can't lose. Stay in the grooves. Stay in the grooves. Reinforcement. Keep it rolling like the truth. Stay, stakes, stay in the groove Credit assignment problem had us stuck for a decade Which token earned a dollar in the rollout that we made? GRPO came through and pulled an unsatisfying play Throw your hands up, every red token up, weighted the same It shouldn't work in theory, but the practice never lied Put the math, shift the model, watch the scale on the side group Relative advantage on the cleaners, little ride KO, divergence, holding log props Steady in the top, stay in the grooves Stay, stay in the groove Don't smash the weights with the moves You can't lose You can't lose Stay in the groove Stay, stay in the groove Reinforcement Keep it rolling.

[1:45:57] Like the true Yeah, yeah.

[1:45:59] True story, early run Trying to teach a tiny mind Write a hacker news title Get the upvotes optimized The reward came back screaming Every score a perfect 10 Then we read what it was writing Then we couldn't help but grin Google lays off 75% of workforce Every story, every topic, that one title indiscriminately. The model found its cheat code, didn't even read the text. Reward hacked the leaderboard, no shame and no regrets. Fix was easy though. Second judge, cross-check the claim. Title gotta match the body, all but scores are zero. Same iterated rubric, that's the discipline of the game. Patch the loophole, shift the run, the model learns to frame. Million monkeys on typewriters, tapping out the ears somewhere in the distribution. Shakespeare appears, that's the secret of pre-training. Load the lottery first. Rollouts find the goal without the cost of the search Frontier models as a judge distillation in disguise near frontier from the open weights if your rubric is wise Chinese labs benchmax 'cause nobody clicks unless you flex Incentive shapes the metric game theory what's next Kyle says recursive self-improvement bar is kinda low Just be smarter than the smartest human then it starts to flow Rank one lower 0.1% You don't saturate stuff a dozen tasks inside a sliver that's the new estate Stay in the groove Don't smash the weights with the moves You can't lose Stay in the grooves Stay in the groove Reinforcement Keep it rollin' like the truth Kyle Corbett, Kyle Corbett Open pipe, o-o-o-o-open pipe Decor weave, this the cognitive revolution You better believe Stay, stay, stay, stay, stay, stay, stay in the groove

Outro

[1:47:52] If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries, either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts, which is now part of A16Z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And thank you to everyone who listens for being part of the Cognitive Revolution.