The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Hello, and welcome back to the Cognitive Revolution!

Today, my guest is Kyle Corbitt, founder of the Reinforcement Learning & Custom Fine-tuning company OpenPipe, which CoreWeave acquired last year.

I open this episode with a bit of a confession: I've done a lot of Supervised Fine-Tuning work over the last few years, both for Waymark in the early days of getting GPT-3 to write decent video scripts, and for research projects such as the Emergent Misalignment paper, but I've done essentially no hands-on RL work, both because my perception has been that frontier models are probably my best option in any case, and because I'm afraid, perhaps irrationally, of reward hacking.

Kyle says that while it may or may not be worth the extra work and slower iteration time, he does believe that using RL on an open source model probably could deliver better performance, and would certainly reduce both latency and inference cost dramatically.

With that motivation in mind, Kyle proceeds to offer a master class on all things RL, which repeatedly challenged my premises and in multiple instances updated my understanding.

He explains how RL differs from SFT in terms of the weight updates it makes to models, how this makes RL fine-tuning less likely to cause catastrophic forgetting, what distinguished the DeepSeek GRPO algorithm from its predecessors, and what additional improvements on GRPO people are using in industry today.

We talk about the distillation strategies that Chinese labs are using to fast-follow American frontier models, and he argues that their use of LLMs as judge in the context of RL post-training is a bigger deal than supervised fine-tuning. 

He also explains why he thinks that compute is the primary constraint preventing Chinese companies from catching up, and why he believes we're already in a recursive self-improvement loop.  

He describes the cottage industry of Reinforcement Learning environment companies that's sprung up to serve frontier labs, and why, though it's a good business to be for now, he's declined to invest in any of them. 

He surveys the use cases that are most commonly deployed by CoreWeave customers, and he offers a lot of advice on how to run RL in practice, including how to develop and iterate on evaluation rubrics, whether to train N models for N tasks or a single model to perform multiple tasks, how the flagrant nature of reward hacking makes it relatively easy it deal with when you're focused on specific, narrow tasks, and how CoreWeave's use of LoRA adapters drives efficiency and convenience for their customers.

Kyle is both a technical expert and a successful commercial practitioner, and from start to finish, this is a super-high-signal conversation on a classic training technique that has become an industry unto itself.  

And so, I hope you learn as much as I did from CoreWeave RL fine-tuning guru, Kyle Corbitt.

Watch now!

Thank you for being part of The Cognitive Revolution,
Nathan Labenz

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.