Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Hello, and welcome back to The Cognitive Revolution.

The Cognitive Revolution is brought to you in part by Granola. Just yesterday, I happened to see Ramp's monthly report on the fastest growing software vendors, and the #2 company, that is adding the most new customers right now, is Granola. Why? Aside from advertising on The Cognitive Revolution, I would chalk it up to an extremely smooth and easy-to-use product experience.

If you're listening to this show, there's a good chance you could, in theory, build AI workflows that capture audio, transcribe it, use it in downstream prompts and workflows. But … can your teammates? That's where Granola really shines. By delivering a polished product experience that anyone can immediately install and understand, and by introducing AI capabilities in the form of Recipes made by trusted thought leaders, Granola is making AI accessible to everyone.

See the link in our show notes to try my Blind Spot Finder Recipe and explore all the ways that Granola can make your raw meeting notes awesome – not just for you, but for everyone on your team, regardless of their relationship to AI.

Now, today, I'm speaking with Dan Balsam and Tom McGrath, CTO & Chief Scientist of mechanistic interpretability startup Goodfire, who in less than 2 years since founding the company, have assembled an all-star research team, landed a first wave of blue-chip customers – including a couple that discovered Goodfire via Dan & Tom's first appearance on The Cognitive Revolution back in August, 2024 – published a remarkable series of results, and most recently announced a $150M Series B fundraise – at a valuation of $1.25B dollars.

Along with the fundraise, they've announced a new pillar in their research agenda: Intentional Design, a push to expand the scope of what interpretability science can do, by complementing the current paradigm of reverse engineering how trained models work with a new approach focused on understanding and shaping the loss landscape to control what they learn in training, and ultimately, how they generalize.

We begin with a discussion of interpretability developments broadly, with Tom emphasizing the shift from techniques, like sparse autoencoders, that transform a network's internal representations to sparse vectors where each node represents a distinct concept, to new approaches that attempt to understand the intricate geometric structures these concepts inhabit in the model's latent space.

From there, we dive into their plans for Intentional Design, and their first proof of concept, a technique for reducing hallucinations that uses a probe trained to detect hallucinations both to steer the model at runtime and as a source of reward signal for additional RL training.

Such training setups are not without controversy – people worry, understandably, based on results like OpenAI's "Obfuscated Reward Hacking", that model will simply learn to fool their monitors rather than truly correcting their bad behavior, but Dan & Tom meet this concern head-on, agreeing that "paranoia is a way of life" in alignment research, acknowledging that Intentional Design techniques are immature and probably shouldn't be used on frontier models today, while also arguing, first, that the pace of AI capabilities advances require us to explore any & all possible paths to understanding and control, and second, that specific details of the approach make all the difference.

In the hallucination reduction work specifically, the key trick was to run the hallucination detection probe on a frozen copy of the model during training so that the modified model would hopefully find it easier to learn not to hallucinate than to find a way to evade detection.

More broadly, Tom asserts that a key principle is to avoid fighting back-propagation. Because models are such high-dimensional beasts, gradient descent will inevitably find ways around any attempt to prevent the model from learning what the loss function directs it to learn. Winning techniques must instead find ways to shape the loss landscape so that the model naturally wants to learn what we need it to learn.

In the final part of the conversation, we discuss some of Goodfire's many other recent papers, including their work with Prima Mente, which suggested a new research direction by revealing that a state of the art model for predicting Alzheimer's diagnoses was basing its predictions on the length of cell free DNA fragments, and a project that showed that it's possible not only to determine which model weights are used for memorizing facts and which are used for more general-purpose reasoning, but that you can actually improve model performance on some reasoning tasks by removing the memorization weights from the model entirely.

Along the way, we also touch on how Goodfire intends to balance its need for business growth with its public benefit mission as they decide what research to publish and when, briefly consider how well we should expect today's interpretability techniques to work on new and different architectures, get Dan's thoughts on the possibility of AI consciousness, and more.

As usual when I catch up on interpretability, I left this conversation impressed by how much progress has been made so quickly, and also mindful of just how vast neural networks are, and how much we still have left to discover and understand.

With that, I want to thank Dan and Tom for giving me this chance to drink from the Goodfire research fire hose, and I hope you learn as much as I did from this survey of mechanistic interpretability advances and introduction to the new paradigm of Intentional Design, with Dan Balsam and Tom McGrath of Goodfire.

Watch now!

Thank you for being part of The Cognitive Revolution,
Nathan Labenz