Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Dan Balsam and Tom McGrath of Goodfire discuss their Intentional Design approach to mechanistic interpretability, from geometric latent-space methods to reducing hallucinations, and share results on Alzheimer’s prediction and balancing alignment research with a public-benefit business model.

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Watch Episode Here


Listen to Episode Here


Show Notes

Dan Balsam and Tom McGrath from Goodfire return to explore the frontier of mechanistic interpretability and their new research pillar, Intentional Design. They explain the shift from sparse autoencoders to understanding geometric structure in latent spaces, and share a proof-of-concept method for reducing hallucinations using probes and RL. The conversation tackles concerns about reward hacking, principles for shaping the loss landscape instead of fighting backprop, and what this means for aligning powerful models. They also discuss recent Goodfire results on Alzheimer’s prediction, disentangling memorization vs reasoning weights, and how they balance commercial growth with a public benefit mission.

Use the Granola Recipe Nathan relies on to identify blind spots across conversations, AI research, and decisions: https://bit.ly/granolablindspot

LINKS:

Sponsors:

VCX:

VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com

Claude:

Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr

Serval:

Serval uses AI-powered automations to cut IT help desk tickets by more than 50%, freeing your team from repetitive tasks like password resets and onboarding. Book your free pilot and guarantee 50% help desk automation by week 4 at https://serval.com/cognitive

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

CHAPTERS:

(00:00) About the Episode

(05:35) Unicorn raise and representations

(19:40) Intentional design vision (Part 1)

(19:46) Sponsors: VCX | Claude

(23:13) Intentional design vision (Part 2) (Part 1)

(37:40) Sponsors: Serval | Tasklet

(39:59) Intentional design vision (Part 2) (Part 2)

(43:06) Hallucination probes and RL

(56:30) Safety, publishing and business

(01:05:41) Interventions, backprop and sycophancy

(01:18:42) Compute and minimal reasoners

(01:32:46) Alzheimer's biomarkers and discovery

(01:37:20) Team, architectures and consciousness

(01:45:58) Episode Outro

(01:49:02) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk


Transcript

This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.


Introduction

Hello, and welcome back to The Cognitive Revolution.

The Cognitive Revolution is brought to you in part by Granola.  Just yesterday, I happened to see Ramp's monthly report on the fastest growing software vendors, and the #2 company, that is adding the most new customers right now, is Granola.  Why?  Aside from advertising on The Cognitive Revolution, I would chalk it up to an extremely smooth and easy-to-use product experience.

If you're listening to this show, there's a good chance you could, in theory, build AI workflows that capture audio, transcribe it, use it in downstream prompts and workflows.  But … can your teammates?  That's where Granola really shines.  By delivering a polished product experience that anyone can immediately install and understand, and by introducing AI capabilities in the form of Recipes made by trusted thought leaders, Granola is making AI accessible to everyone.

See the link in our show notes to try my Blind Spot Finder Recipe and explore all the ways that Granola can make your raw meeting notes awesome – not just for you, but for everyone on your team, regardless of their relationship to AI.  

Now, today, I'm speaking with Dan Balsam and Tom McGrath, CTO & Chief Scientist of mechanistic interpretability startup Goodfire, who in less than 2 years since founding the company, have assembled an all-star research team, landed a first wave of blue-chip customers – including a couple that discovered Goodfire via Dan & Tom's first appearance on The Cognitive Revolution back in August, 2024 – published a remarkable series of results, and most recently announced a $150M Series B fundraise – at a valuation of $1.25B dollars.

Along with the fundraise, they've announced a new pillar in their research agenda: Intentional Design, a push to expand the scope of what interpretability science can do, by complementing the current paradigm of reverse engineering how trained models work with a new approach focused on understanding and shaping the loss landscape to control what they learn in training, and ultimately, how they generalize.  

We begin with a discussion of interpretability developments broadly, with Tom emphasizing the shift from techniques, like sparse autoencoders, that transform a network's internal representations to sparse vectors where each node represents a distinct concept, to new approaches that attempt to understand the intricate geometric structures these concepts inhabit in the model's latent space.  

From there, we dive into their plans for Intentional Design, and their first proof of concept, a technique for reducing hallucinations that uses a probe trained to detect hallucinations both to steer the model at runtime and as a source of reward signal for additional RL training. 

Such training setups are not without controversy – people worry, understandably, based on results like OpenAI's "Obfuscated Reward Hacking", that model will simply learn to fool their monitors rather than truly correcting their bad behavior, but Dan & Tom meet this concern head-on, agreeing that "paranoia is a way of life" in alignment research, acknowledging that Intentional Design techniques are immature and probably shouldn't be used on frontier models today, while also arguing, first, that the pace of AI capabilities advances require us to explore any & all possible paths to understanding and control, and second, that specific details of the approach make all the difference.

In the hallucination reduction work specifically, the key trick was to run the hallucination detection probe on a frozen copy of the model during training so that the modified model would hopefully find it easier to learn not to hallucinate than to find a way to evade detection.  

More broadly, Tom asserts that a key principle is to avoid fighting back-propagation.  Because models are such high-dimensional beasts, gradient descent will inevitably find ways around any attempt to prevent the model from learning what the loss function directs it to learn.  Winning techniques must instead find ways to shape the loss landscape so that the model naturally wants to learn what we need it to learn.  

In the final part of the conversation, we discuss some of Goodfire's many other recent papers, including their work with Prima Mente, which suggested a new research direction by revealing that a state of the art model for predicting Alzheimer's diagnoses was basing its predictions on the length of cell free DNA fragments, and a project that showed that it's possible not only to determine which model weights are used for memorizing facts and which are used for more general-purpose reasoning, but that you can actually improve model performance on some reasoning tasks by removing the memorization weights from the model entirely.  

Along the way, we also touch on how Goodfire intends to balance its need for business growth with its public benefit mission as they decide what research to publish and when, briefly consider how well we should expect today's interpretability techniques to work on new and different architectures, get Dan's thoughts on the possibility of AI consciousness, and more.  

As usual when I catch up on interpretability, I left this conversation impressed by how much progress has been made so quickly, and also mindful of just how vast neural networks are, and how much we still have left to discover and understand.  

With that, I want to thank Dan and Tom for giving me this chance to drink from the Goodfire research fire hose, and I hope you learn as much as I did from this survey of mechanistic interpretability advances and introduction to the new paradigm of Intentional Design, with Dan Balsam and Tom McGrath of Goodfire.


Main Episode

Nathan Labenz: Dan Balsam and Tom McGrath, CTO and Chief Scientist at Goodfire, welcome back to The Cognitive Revolution.

Dan Balsam: Thank you for having us again.

Tom McGrath: Yeah, thanks for having us.

Nathan Labenz: As always, you guys have been prolific and there's a ton to cover, but let's start with the big headline news. Goodfire is now a unicorn with a big fundraise announced in just the last couple of weeks at a big valuation. Congratulations and recap the headlines for us.

Dan Balsam: Yeah, no, we're very excited to announce this fundraise. It's really a testament to all of the hard work that the team has been doing. It's pretty crazy. We've only been around for a year and a half, so how much we've been able to accomplish and how much we've been able to grow over this period of time has been really awesome to see. I mostly think of it as, I'm very excited to take this capital we've been able to raise and deploy it in order to scale up what we're working on and continue to advance interpretability research. And we have a lot of new research to talk about. Talk about that, too.

Nathan Labenz: Yeah, that's putting it mildly. I took my eye off the good fireball for a minute before my son got sick last year. And then I've been less able to follow research. And when I got back to the blog in preparation for this, I was like, holy moly, there is a lot of stuff that has dropped. So we're going to do as much as we can today in the time we have available. We're not going to cover all of it and probably not even a third of it, but it is an impressive run for sure in terms of the team that you guys have been able to assemble the results that that you've put out. And I'm looking forward to unpacking as much as we can. As is my custom though, maybe let's start with kind of the real zoomed out view. We've done this a couple times in the past. I'll give you an interested outsider's take on what I think is going on in interpretability, and then you correct me, complicate, give me the next level of depth of understanding. I guess what I'm seeing is we're moving from understanding the concepts that models are thinking about or representing in their internal states, with things like sparse autoencoders, to understanding the circuits that are operating and doing the information processing with things like tracing the thoughts of a large language model from Anthropic and their transcoders of these wiring diagrams that show at least like simple operations, which already look pretty complex. I recommend people check out what it looks like for a model to add a couple two digit numbers to see something a little bit mind bending and arguably even a little hair raising. And, but that's come a pretty decent way at least. And then we've now also got this notion of learning dynamics or kind of understanding how the model becomes what it is. I associate that with folks like Timaeus, whose work still goes over my head. And you guys are getting into that space a little bit now as well. Do you think that's a useful way to build up? And how much progress, how would you characterize progress on each of those levels, if that is a good taxonomy?

Tom McGrath: Yeah, I think it's a good taxonomy. I think the one thing I would add to it There's a kind of meta-level question that's being asked, and I think it's been asked a few times since we were last on, which is this idea, this question of what is interpretability for? Why are we doing all of this interpreting? We can go back to that in a second. But yeah, I think that basically I would say that we are considering things that like steadily expanding levels of... We start with this like very atomic thing. What is even going on in the residual stream? We build our way out. And I think that You can see this generality is happening across a couple of axes. And one is this axis of circuits that you're talking about, where you go from representations here to like, how is that computed piece by piece? We've got transcoders, like you say, and crosscoders. And they let you say, this thing happens, and this thing happens, and this thing happens. The missing piece of this, of course, is still tension. We can come back to that in a little bit. But so that's one level of complexity that's getting added is like algorithm is complexity across layers. But when you look at these circuits, like you mentioned, the biology of a large language model paper from Anthropic, one thing that's interesting to me is that the way that you get these, like, how does the model add these numbers, right? This information is assembled out of a series of sort of execution traces. We have one answer from how does it add 17 plus 54, we have another answer of how does it add like, 13 plus 99 or something. And we try and piece-- these collectively, we hope, will give us something generalizing. When we say a circuit, I think what we really have in mind, though, is this idea of something that quantifies across inputs. There's a whole-- I could put these two variables, x and y, and I can put any arbitrary two-digit number in x and in y, and the circuit-- a circuit account of this should take account of all the possible settings of x and y, whereas what we have is like a collection of individual algorithm executions. And you can say, this becomes very clear at the level of circuits, but you can also imagine taking it, you can also try and take it down a step and say, when we go from, this is when we go from features, say there might be like a feature about a number being approximately 17 odd. You might have seen the model when models manipulate manifolds paper again from Anthropic or the Primamente fragment length work that Dan's team did. You have these like quite continuous manifolds that represent quantities. It's not literally continuous. You only get single bits of DNA, but it's like this it can take on many values and it sweeps out this space in the embeddings. And so if we look at this through the lens of sparse autoencoder features, say. So imagine we're tracing a helix, for instance, and this helix just winds round and round and round. It goes past the origin. We should imagine a sparse autoencoder feature is going out from the origin and zapping a particular point on that helix. And so that is going to detect, well, the number is about five, say, and it will. For three, it's a little higher, it's not zero. For four, it's a little higher. For five, it's high. Six, it's starting to drop off again as well as the sort of helix sweeps through its receptive field. And this is like a way you can do a lot with this. But I would say that the thing that we want really is the sort of the simpler structure that we want to get that helix. We don't want to just get, like, a set of little patches of the helix. People refer to this as a manifold, which will drive any, like, mathematicians listening to this wild. They are in many ways not literally manifolds, but let's say manifold now. Now, how does this relate to this circuits that quantify across possible inputs? The connection is quite simple. And that connection is that this manifold literally is the set of things you want to quantify across. So in order to have this sort of explanation of all of the possible inputs the circuit might take, I need to, first of all, map out that space of all the things it might take, then I need to push them through the circuit machinery. And this will take one manifold, and it'll split it up into another-- it'll push it forward into a new manifold with a new shape at a later layer, and so on. And so I think there's this extra level of complexity that we need to grapple with if we're really going to get satisfying explanations of neural networks, which is this kind of algorithmic explanation. And then when you get down into the computational nuts and bolts, that algorithmic explanation requires a sort of manifold type of explanation.

Nathan Labenz: Yeah, so let me just try to echo that back to you a little bit. It seems to relate very much to... I'm not sure if I have the right vocabulary for this, but... There's a lot of discussion around are all the features in models linear or are some of them not linear? And then there's like obviously different definitions and intuitions around what linear means. But I always go back to the days of the week as a canonical simple example that I think of where it's like you can have in your sparse autoencoder seven different activation patterns that correspond to seven in the sparse version and the autoencoder, seven different spots in that super long, what is it that I'm looking for? vector of of concepts. And those could be just due to the vagaries of training, whatever, those could be randomly distributed in that sparse auto encoder. But you can kind of find through your auto labeling process or whatever, okay, spot one is Monday and spot 5000, 27 is Tuesday, and then way down at the end is Wednesday, and then Thursday is over here in the middle. And crazy like that. And you're like, that seems weird. And that is, that's just, that weirdness is reduced when you look at the relationship in embedding space between those concepts because they often are either clustered or they're in some plane where there's a rotation through a plane, some sort of shape that makes a lot more sense when you think, geez, we actually do rotate through the days of the week. So in some sense, it makes a lot of sense that the model would represent those things as rotation through some sort of plane. And now you're taking that one level up and saying, okay, now you can have all these kind of crazy shapes. It becomes more of a topology sort of exercise where you're transforming these crazy high dimensional shapes through the circuits from whatever they start as to whatever they end up as. And that's, I think, the real thing you're highlighting as missing from my initial characterization, that we need to understand the space of the concepts and not just label them as present or absent at any given point in time, but there's a rich and very meaningful geometry of that as well.

Tom McGrath: Exactly. Yeah, that's exactly right. They've got this sort of higher order structure. And it's true, you could describe the days of the week in terms of there being seven separate things which have no relation to one another. That's a perfectly legitimate way to describe the space. And it could have been that's how model representations went. There was just nothing that-- they didn't have to lie in this roughly plane. They could have just been all over the shop with no relation to one another, but they are. There's been a lot of, I think, talking past the field has done a lot of talking past one another about the linear representation hypothesis. I think that, at the end of the day, there's clearly something there to be explained. Like, why do we have this intuitive and beautiful structure? There's actually one paper that's just come out, I think it was this morning, on this being driven by co-occurrence statistics and like symmetry in language. We might just be starting to like actually, the field has spun its wheels on this question, maybe because it just didn't have any traction in the kind of the fundamentals, which was just talking past one another. But I think that actually like people are starting to make some progress, like including us. So it's quite cool.

Nathan Labenz: Can you give a little bit more intuition around what is at stake with this? sort of are features linear or not linear? Because again, I was revisiting like the definition of this is you could probably give it better than I can give it, but it's like features are in a direction They can be added. This is the sort of classic man plus royal equals king, minus man gives you queen, minus royal gives you woman. You can move around in latent space in an additive way. And then there's also the idea that the intensity of the feature corresponds to like how important it is in the model's processing at any given time. But it does then strike me that doesn't quite handle the I'm not sure if it's in conflict with or if it just is an incomplete account of what's going on with the days of the week as we think of them in a plane, for example. Is there a direction that's the Monday to Tuesday direction that I need to be thinking about? I can't add Monday and Tuesday. That's not really a coherent concept anyway. But it does seem like something is different about that, and I don't have a super crisp intuition around why people are so worried in the first place around what if there's always comes, you've not accounted for non-linear features. So like, why do people harp on that so much?

Tom McGrath: It's partly because we're scientists, right? We like to have it. Like, we like to know what we're doing. If there's some structure, we like to explain it. I think that on its own would be sufficient descriptive explanation. Well, if we look at it from this sort of collection of execution traces, they look like this extremely fragmentary thing. The model does this thing for when it needs to add this pair of numbers, and it does this other thing for when it needs to add this other pair of numbers. And I think that when you look at things from this more like geometric perspective, they often look much more... unified and there's sort of real computation going on there. So simply for the fundamental question of what sort of object are neural networks internally, has a lot of downstream effects for instance, how much should we expect interpretability to succeed, which has a lot of... there's quite a lot at stake in that question, not least good fire, but there's quite a lot at stake in that question. And then again, down another level of sort of nuts and bolts-ness, our ability to do intentional design, which is something that I think we'll come on to in a bit, but our ability to guide neural network training sort of relies on our ability to understand the bits that we're guiding. And the thing that we would really want to be able to do is to change computations like as a unit. In this case, say that I have an example that involves days of the week, and I want the model to behave differently in a way that's invariant to the days of the week. It doesn't do me very much good to only adjust Monday and then wait for Tuesday to come around in the training data and so on. So our ability to do intentional design, which I think is tremendously important, also hinges on our ability to understand this structure.

Nathan Labenz: Yeah. Okay. Perfect transition to intentional design. This is the sort of let's say updated vision for the company basically, right? As of the new fundraiser put this out roughly the same time. The big idea. is, I think, pretty intuitive, that it would be great to be able to not just throw an unbelievable amount of data, an unbelievable amount of compute into some vast machinery and then get something out and then have to completely reverse engineer what the hell just happened, and instead have some sense of what is going on along the way so that, ideally, you could control it and get something that behaves the way you want it to behave in all sorts of different situations. You talk about looking for methods that scale with compute, looking for strategies that support or allow for the possibility of natural language feedback. You know, what is the big vision for intentional design?

Tom McGrath: Oh, before I get into that in more detail, I want to say I think that's one of a couple of things that Goodfire is doing. Intentional design is like the new idea that we're putting into the mix. It comes back to this question of what is interpretability for? Interpretability, in my opinion, is for scientific discovery. Monitoring and auditing and intentional design. So of these two, I think like monitoring and auditing and scientific discovery, well, relatively well understood now. And so we spent a bunch of time trying to flesh out what is meant by intentional design. But what do I mean? So I think the basic idea here is that it feels like we should be able to make training much more controllable. And like the reason that, and to make something controllable in the sense of having a closed loop control, like a feedback controller, you need to have an observation system and you need to have a control system. And I think that interpretability is the observation system we're training. You can see where does this-- if I put this data into the model and I ran whatever, I've got some data, I've got some loss, this produces a gradient. And I can say like, where will this gradient take the model? That's the sort of uncontrolled dynamics, like I throw in my plane, it's flying forward and it will just carry on flying. A gust of wind hits it, and now I'll go like this, right? The gust of wind is the data, I don't know. And then what we want is to be able to say, oh yes, that direction contains some good things which we would like to keep, and some bad things which we'd like to steer out, and maybe the good things we want to amplify, let's say.

Dan Balsam: One analogy I like to use is it's like a map for the lost landscape. the data implies some type of shape to the lost landscape. And you can imagine a bunch of different valleys there. And some of those valleys have behavior that we want that are desirable, and some of them have behavior that we don't want. And I think of the role of interpretability as producing this map, essentially. So when you get to a juncture in the road, you can say, oh, I should go this way, not that way. And I think that's what we're starting to unlock here.

Tom McGrath: Yeah, exactly. You can say once. Where are we going? And you can say that live, rather than waiting until the training run finishes and seeing what we've got. And say, oh, maybe we didn't want the responses to be quite so emoji-filled. Let's tweak the data and hope that we have some emojis, but not quite as many. So we should be able to specify, oh, we want it to land roughly there. And I think this is a part of machine learning that people think of it as magic. And it is like magic in the sense that an incredible result comes out, but also by someone doing far more work than you thought was reasonable. I think the magic trick analogy is actually quite good. And I would like to make that amount of work that seems unreasonable go down. I think it should be possible for us to specify a natural language, for instance, what we want to happen in training. One other thing that I should add, actually. When I say you in this process, you look at the gradients, you steer, you steer. Obviously, I don't mean you yourself, like looking at every data point and going, yes, more sychaplantic. I do, in fact, I mean a language model. When we want to do stuff that scales well with compute, scales well with model intelligence, that means interpretability gives us a handle for intelligence to hook onto in the training process, directly inside the training process, inside backprops, and then you have to have some intelligence to hook onto, and that intelligence comes from-.

Nathan Labenz: Yeah, I got a lot of intuition from one section of the intentional design blog post where you describe one method for breaking a gradient down into semantic parts. And I'll try to describe that too now. Everything's a mess in there is one starting point. So there is, at any given time, you're optimizing this single final loss function and you can change every weight throughout the entire model to make little contributions to getting better according to that final measure. But what are we learning is, by default, not obvious at all. So the method that you share in the blog post is basically saying, okay, for one thing, we have these SAEs now, so we know, and we can, I do want to dig into this and understand at what point in training does this actually become useful because this does seem to rely on there being relatively well-developed concepts that are instantiated, represented, I guess a better term. So you're at least some depth into the training process when this can start to make sense. But okay, we can create an SAE that allows us to identify which concepts are active at any given time. And then I thought the very kind of clever idea was saying, okay, if we look at what the gradient is changing in the residual stream inside the model. Because of superposition, we know that there's probably a lot of different concepts that are represented there, and bunches of them could be changing in all sorts of different ways at the same time. We'll try to decompose that by looking at what concepts are active, according to our SAE. Then we'll take the inner product. In other words, we'll try to basically look for the similarity between how the gradient step is changing the internal activations and each of those concepts that are represented so that we can say, okay, it seems like this change is really aligned to this concept. It seems to be really changing this particular concept. And this other concept, it's like aligned to and it's changing it somewhat. And some of these other concepts, it's maybe not changing so much. And then that gives you the opportunity to say, do I like that or not? Yeah. And so this is where then the language, the intelligence of the language model can come in and say, okay, do I, based on this example, based on this data that we are learning from, does this seem to be the right kind of thing to be learning? So the example that you gave in the blog post was, if you have data that consists of talking in pirate speak while doing arithmetic, you can, the gradient will be optimizing for both of those things potentially at the same time in order to predict the tokens that it's seeing. But then when you go and do the look at what's active in the sparse autoencoder, you'll see like features related to arithmetic and all the other features related to pirate speak. And then when you take the inner product, or again, look at basically also known as cosine similarity, right? Look at the alignment between in, in space, look at the alignment between these features and the changes that are being made. You can say, okay, it's changing. It seems to be upweighting pirate speak quite a bit, and it seems to be also upweighting do math right. And so you could prompt the language model to guide it. Hey, we want to be getting good at math here, but we want to be not over-indexing on whatever other vagaries of the dataset happen to be present. And then it could look at these things and say, okay, let's let's allow this part of the gradient update that aligns to the feature that we think is a reasonable feature to be updating, and let's not make those changes that are changing the other things that we don't want to change. You know, how is that a, how could that be problematic in other ways that I'm not anticipating?

Tom McGrath: Yes, I think that's a really good description of it, and there are ways it could not work, and there are ways it could be problematic. We can cover this in a second. So I think it, I think the idea of having a little guy inside backdrop who's looking at things and deciding what's going on is it seems quite powerful to me because it gives you a choice and it gives you the chance to spend compute in something and where you previously like, it happened purely mechanistically. The chain rule just grinds on layer by layer. Now, this is there's a bunch of questions here. One is how do you do it? I've given you a menu. But how do you select from the menu? And that's the sort of thing we prefer to as intentional design techniques. So there's this-- the obvious thing, if you're a machine learner, is to say, let's just project out the parts of the gradient that we don't like. We'll just remove that portion of the gradient. We'll just cancel it out. And that works very poorly. And the reason that it works very poorly is that the network wants to learn to be a pirate. The data is implying the network should become more piratical than it currently is. And it will find a way, unless you have a technique which is a little bit smarter, it will find a way to become a pirate. It's got these computations, it's got these components that support being a pirate all over the model. And if you project the gradient out halfway through, it'll just use one of the later ones. So that's an example of fighting backprop. Projecting the gradient out fights backprop. And it tries to get the model to, tries to get the model, like, it doesn't try and get the model, get gradient descent to want something else, it just tries to stop it and gradient descent will always win. Whereas something like inoculation prompting, like, tries to get the model to want something else. To give a quick recap of inoculation prompting for people who might not have heard of it before. The idea of inoculation prompting is like quite elegant in my opinion. So say that we have a dataset that implies some behavior. The example they often use is reward hacking, and they've done some good work on this. Say that the thing you want to learn, you've got a dataset or you've got an environment where the model will learn to reward hack its constant exploitable things in the environment. You might think that the thing to do is tell the model not to reward hack. But now when it reward hacks, it will reward hack anyway. even by accident. And I was like, This reward hacking thing is good. I didn't anticipate that it would be so good. I'll become more of a reward hacky kind of guy. Whereas the really nice insight from inoculation prompting is that if you tell the model it's okay to reward hack, then when it does reward hack, it'll be like, I expected that. I'm not going to learn anything from it. I guess I was a reward hacky kind of guy after all. And so this idea of explaining away is, I think, very powerful and something that, you know, inoculation prompting at first glance looks like a little bit of a bodge, but in fact, I think there's something like very deep and elegant in that principle. And so that's an example of something I would say does not fight gradient descent.

Nathan Labenz: Yeah, so the fighting of gradient descent there is, the reward is the reward, right? So the update is gonna be in the direction of getting more reward, and the question is, are you, teaching the model to overcome the instructions it has been given in pursuit of reward? Or are you trying to align the instructions that it's given and the rewards such that it maintains a general understanding of itself as the kind of thing that follows instructions, right? That's a mental model. I came away from that.

Dan Balsam: Yeah, you're not teaching it to my mental model too is that you're not teaching it to ignore its instructions. And implicitly, if you don't say you can reward hack, it's interpreting its instructions as you can't, which is, I think the correct default behavior. But then when it learns, it learns to like more broadly ignore because you didn't set it up with the right prior.

Nathan Labenz: Yeah, so many weird things going on.

Tom McGrath: It's a very counterintuitive thing. I think it's under, I think, It and a few techniques like it have not quite gone under the radar, but I think they're underappreciated. And I think one thing they expose is there are many more surfaces for intervention than you might think. You might think the reward function is just the reward function, that's all we got. But you can change a lot of things like, for instance, the prompt that the model is given. That's the surface that inoculation prompting intervenes upon. Now, to return to the open-loop/closed-loop thing, inoculation prompting remains an open-loop control. I just have my inoculation prompt, which says it's fine to reward hack, that's cool. And I just apply this, whatever data might come. But some data is not about reward hacking. Some data is about something else. And if it's something else, then the inoculation prompt really doesn't help you. So I think the important thing is there are two parts to this idea of closed-loop control. One is that you have control, and the other is that you have observation. And observation takes us back to this kind of decomposing the gradient as a simple example.

Dan Balsam: The thing that's pretty normal to do is that you would freeze some layers of a model, and then you would train only some layers, or you would attach a new head and you would train that, because you value the representations up to some point in the model that are already in the model, and you want to leverage them from some downstream task. This isn't exactly the way that we're thinking about intentional design, but maybe just as a quick analogy. Imagine if you could just selectively freeze circuits. You could say, that circuit's good. I don't want to change that. This circuit, this is the one that I want to update. Learn over this circuit, over this data set, but not the other ones. And I think when you frame it like that, it's really not that weird of a thing to be doing. It's just much more surgical than a lot of existing techniques.

Nathan Labenz: Well, maybe take a I was going to come to this later, but maybe it's a good time now to ask how you guys are thinking about balancing some of these tensions in just the overall nature of the business. It's a public benefit corporation with a couple notable, I think, relatively significant revenue projects for big companies that have been publicly disclosed. And there, there's a need for a lot more revenue, obviously, to support a billion dollar valuation. At the same time, there's this general mission of trying to not just develop these techniques, but presumably disseminate them and popularize them as well too. So that seems like a very tricky balance to strike or a tightrope to walk over time. Do you have any principles? Is there a way that that you've structured your thinking on this?

Dan Balsam: I think, uh, I don't know. At the highest level, it is really important for us to get our work out there. We shared the hallucinations work that we did, which contains in strong detail everything that we did there. And it's a bit of a recipe to do this type of work. There are lots of techniques that we're exploring in the intentional design space. We have different results, some of which will be hopefully getting out pretty soon, but it's really like a lot of research and a lot of greenfield research, and we're still figuring out what the right form factors are. We have seen in experiments that there are various ways to not fight gradient descent. And as we feel particularly confident in the research and in the results that we're getting, we'll share more with the world. I think overall, we do want to develop in public as much as possible. And I think we will, as we get more confident in the way things are going, I think we'll talk in more detail about it. But there are different techniques that work in different ways. There are also things that are out in the world, like positive preventative steering, for instance, which I think we can point to as an example of not fighting the lost landscape. And for those that are unfamiliar with that work, that's You can prevent certain types of misalignment. It's kind of similar to inoculation prompting in a way, like you can present certain types of misalignment by steering up on certain characteristics in a model. And I think like all of these is like, they're just like, they like one way to think about it, like I like the map for the lost landscape analogy, but also like it's just another way to think about is that it changes the lost landscape when you're doing these things, right? Like, like when you are applying some type of Inoculation prompting is a good example. You are changing the nature of the loss landscape by providing this prior over the dataset. So there's a lot of different techniques. There are a lot of things you could try that would fight gradient descent. Then you go to the next dataset example, and then you still have optimization pressure in the direction you don't want to go, and you're constantly fighting it. But there are lots of ways that you could intervene that don't fight gradient descent. because they fundamentally do change the structure of the lost landscape in a way that's durable.

Nathan Labenz: So obviously, models hallucinate. We would rather that they don't. And what can be done about it? One thing that can be done is to create a synthetic data set where you have known hallucinations that you have labeled going in. And I think the world has helpfully prepared some of those for us and open sourced them. And then you can train a probe to classify the internal states of a model into this is a hallucination or it's not. And then you can do a few different things, including running that probe at runtime and potentially intervening on the output of the model. But we've seen a bunch of these kind of token injection things over time in the reasoning development space, where every so often you just insert like, wait, let me think about this a different way. And then the model kind of takes another stab at it. So I'm thinking of this as a sort of similar thing where the probe goes off and says hallucination, and you force the next token to be like, but wait, I might be making this up. So then it will, at least some of the time, double back and realize that it was wrong and correct course. Another thing you can do, and this is where it gets like, I think really interesting, although notably the, those like interventions drive a lot of the reduction. But another thing you could do is use the presence of activations which were classified as hallucinations as a signal for reinforcement learning to try to get the model to not go into this state in the first place by just basically punishing it for getting into the hallucination state at all. That is really interesting, and there are some interesting nuances, too. Why don't I just ask you to, again, give me the double click and start to dig into some of those nuances?

Tom McGrath: Cool. So again, an excellent summary. I think that you have this probe. You have to assemble the probe from a bunch of ground truth, which is expensive to collect. If you were going to... if you're going to... run the ground truth process to instead give you rewards at test time. It would cost you like hundreds of thousands of dollars, and each call would take several, quite a few seconds, because it's like Gemini 2.5 with web search. So it goes off, it does a bunch of reading, and it goes, Okay, that bit was wrong. So you assemble this corpus, you pay a one-off fee for it, and you amortize that into a probe, and the probe lets you say, Oh yes, the model thinks this was probably a hallucination. And you might think that's a bit weird. You might be like, that is a silly representation for a model to have. Like, why don't you just not hallucinate in the first place? And probably there are a few reasons for that. One, post-training-- sorry, pre-training. Like, people just make stuff up in pre-training all the time. You can just go on your keyboard and write pre-training data. You just post on Reddit, like, people make stuff up. And so the model has representations for, oh, this person is probably just talking nonsense. But then why use them? Why use them in post-training? Sorry, why would the model learn to actually adopt this? One thing is you haven't fully necessarily specified its persona during post-training. And that might, again, that might sound strange, but you do want the ability to have the model make things up. If you're asking it to write fiction, you're asking it to make up facts. You might just want the model to play two truths and a lie with you or something, or whatever. So... It's like a useful capacity for models to have, but what we're trying to train them is no, you're not the making things up guy anymore. You're like saying correct things guy. And that's what we hope to do with the reward signal. It's like shape its kind of style more sharply towards being like the model is factually correct. It's interesting also that like the interventions, as you say, are responsible for a lot of the improvement. So there the probe fires, the probe goes off, and then we token inject, Ah, wait, let me check that. And then often the model will in fact just fix it out of its own memory. And this has a couple of knock-on effects. One is what was going to be an error that then would have cascaded through the rest of the context is cut off. So the model not only hasn't made the first error, but hasn't made errors based on that error. And the other is the sort of learning-in-context, again, what kind of persona it should be adopting. We're providing a gradual reinforcement, we're providing gradual examples in context of it checking its own work and that makes it more like calibrated on factual performance.

Nathan Labenz: So I think one obvious question people would have around this sort of work is How do we know we're really getting it to do the things we want it to do as opposed to shooting ourselves in the foot in maybe one of a couple potentially dangerous failure modes? Obviously, this is all like relatively low stakes stuff at the moment. We want to be mindful of how big the stakes could eventually be. I did one episode, gosh, it's been a little while now, but with, we called it Late in Space Hide and Seek. Scott Evans, Luke Bailey and Eric Jenner, and they basically showed that The models can learn to dodge, to avoid, to evade all sorts of latent space detectors. So that's one problem is that if you start running these detectors similarly to what you said before, like they kind of route around them and find other ways to do what they want to do with different kinds of activations, different kinds of patterns, whatever. And then there's also this result from OpenAI on on obfuscated reward hacking where you say, geez, we don't like your reward hacking. So if we review your chain of thought and we see that you are planning to reward hack, then we'll punish you for that. And then you get this sort of, I think what makes this kind of especially dangerous is that at first you get less reward hacking when you do that and you feel good about yourself, but then you do that too long and the reward hacking comes back because again, the incentive was still there and the reward is reward. So it's going to learn to get that reward one way or another, but it also learns to not do the thing that it gets punished for. And so you've essentially driven your reward hacking underground, driven it out of the chain of thought and eliminated your ability to detect it in the first place. So Zvi, friend of the show, calls that the most forbidden technique. And I am... I think you have an interesting and clever way to avoid that problem, but I'm not a hundred percent sure I understand if this is a fully robust way to avoid the problem or in the paper or in the blog post about it, you said it's good to billions of tokens. And I'm okay, should I expect that to be good to tens of billions of tokens? Hundreds of billions of tokens? At some point does this break down? So maybe... Describe what you did, first of all, and then we can speculate or theorize on how much we should trust it, I guess, is the key question.

Tom McGrath: The trick is quite simple, which is you run the probe on a frozen copy of the model. So we've got the student model that we're trying to train. We have the reward model, which is the model we've hooked the probe into. Now, in principle, you could have the reward model-- you could have the probe hooked into the student model. Nothing stops you from doing it. And in principle, you could use the probe as a signal for backpropagating through. Extremely bad idea. It does not work. Like, what will happen is... When people say, like, in Terpin training, they often assume that what's happening is this, like, absolutely the dumbest thing you could possibly do, which is take the model, put the probe in, backprop through the probe, congratulations, you have failed. And the reason for this is that... It's much easier. The backprop signal is very rich. It's much easier for the model to evade that by just the detector is here, the representations were initially here, and they just go off to the side. They just skew around it. Almost trivial for the model to do. But we have two lines of defense against that. One is that it's on a different model. One is that we just don't backprop through it, but it's on a different model, so we can't backprop through it. an effort of will to figure out how to make it to backprop through the other model and have the student model change. This is under a different model, and that model is frozen. And so what this means is to evade the probe, you know, to have the model's activations evade the probe, it has to be easier for the student model to figure out what kind of behaviors, like, not set off this probe, only operating in token space with like quite low dimensional reward. You just get a scale of reward about whether this was good or bad. So that's one alternative. The other alternative is to get rid of the behavior, which it turns out is easier in this case. So there's also some very good work by Far AI that came out a couple of days after we published, where they investigate this design space in more detail. And it seems like a few crucial things are like Your probe has to actually be good. If your probe is not good, then the pressure for the model to change its behaviour is much weaker. And if you backprop it through the probe, you're just immediately sunk. Like, they tried that experiment. It's clear, I'm sure it's also clear to them how it would turn out, but like, it's good that someone did the obvious thing. And it just immediately blows up, so don't do the silly thing. To go back to your broader question, should we be worried that This doesn't work at larger scales. Yeah, you should be worried about everything. This is alignment research. Paranuria is a way of life. But seriously, like, I... The fact that it worked as well as it did is definitely an update. It's not a probability zero to probability one update. There could always be some evidence sort of lurking just off to the side that we didn't see. It's like the model just begins to become, like, if you are on the model too hard, it just becomes a bit incoherent and silly. like that happened. We stopped training the model. If we kept training, the model just became a bit like nonsensical. So that happened before the probe optimization signal got subverted. We basically never, we didn't ever see that probe optimization signal get subverted. An important caveat is we can check this. We can check, we have ground truth through Gemini plus web search. which that we can then spot check to obtain like a validation on the ground truth. Now you might say maybe the probe gets subverted through this. But now what you're expecting is that the back prop signal, sorry, it's like the learning signal somehow encapsulates not only the probe, but also some sort of internal model of what a smarter model equipped with web search will go off and find. And I think that's just not plausible.

Nathan Labenz: So how I guess one thing I don't have a great intuition for, and there might be a certain trick that makes it more obvious, but the fact that the probe continues to work on the original frozen model and the student model is it's like not obvious to me that it should have worked out that way under kind of any circumstances necessarily, but I don't know, you know, exactly how you engineered it to encourage that to be the case. The thought that I had was some sort of like divergence penalty. A lot of these setups have some sort of some scheme for kind of tethering the weights to their original state to keep changes local as much as possible.

Tom McGrath: So we have a KL penalty on the reinforcement learning, but that's very standard. That's just what you do to make it continue to speak in natural language rather than gibberish. There's no divergence penalty on the probe. That's when I say there are hacks one could apply, that sort of thing is maybe a hack that you could apply. You say you're not allowed to-- you might have a reference set of examples, and you feed these through the policy, and the probes output on the reference set is not allowed to change, or you penalize the extent to which the predictions change or something. You could do stuff like that. It wasn't necessary. And I think what this says is it's easier for the model in this case-- I'm not trying to make a generalized claim here-- it's easier for the model. to change its behavior than it is to change its representations, in this case, in this kind of signal.

Dan Balsam: We would expect that this is contingent on the setup, on the representations, on a bunch of different factors as well. You know, like there is work, of course, as you mentioned, that is explored in the Far AI paper. They found various setups that would cause obfuscation. So our claim here is not that obfuscation doesn't happen or that it's never going to happen, but the dynamics of obfuscation are worthy of study. And there are cases in which I think we can demonstrate pretty convincingly that obfuscation doesn't happen, at least under the conditions that some people might thought that it might before.

Nathan Labenz: I don't know if, you know, we've talked on a previous episode about interpretability generally being pre-paradigmatic, and then I think you upgraded to quasi-paradigmatic at one point. For this type of thing, it seems like we're maybe back to pre-paradigmatic, but maybe you do have some rules for yourself. Like how would you think about, let's say you get called in, you get the call up to Anthropic and it's okay, hey, we're we're doing this for real. We want all your best techniques. You know, what maybe we're just not there yet to, maybe that's what it means to be pre-paradigmatic. You don't even answer that question yet, but how do you think about if the stakes are suddenly turned up? Which principles guide us in terms of what we should and shouldn't think about trying?

Tom McGrath: I think the first principle is, first, do no harm. And what I mean by that is, like, to run with your example, right? Anthropic has a plan for interpretability. That plan is to use it as the test set. If the stakes are high, I don't want to disrupt that plan. So first, do no harm. First, you'd have to-- I probably would say, on the current level of scientific development, we should not use this on a frontier model training room. I think that we are not in a position to-- we don't have a strong enough understanding of what we're doing. I think we can get one quite fast. But I think that the thing we would need to be confident in is that we had not nuked anyone's plans to use interpretability as a test set, for instance. like that people can do interpretability-based auditing at least as well as they could without these techniques. Unless we've got rid of all the problems that there are no problems to find, but right now the alarm should be going off. So I think that we would need to have run reasonably serious auditing games, say, or something along those lines to be confident that we wouldn't make the problem worse.

Nathan Labenz: Are there other kind of... I'm sure there are many other things that you would think about doing a sort of reinforcement learning from based on internal states to try to achieve. Do you have a sort of like set of things you think are safer, better, wiser, and others that are less safe, good, or wise for some?

Dan Balsam: Yeah, I mean, I think at our current level of understanding, I wouldn't recommend that we try to make the model less deceptive or something like that using these techniques. What we're primarily focused on using these techniques for, studying them, A, and then to the extent that we are applying them, it's for things like hallucinations. It's for things like concrete problems where we can measure things very carefully and the stakes are quite low, depending on how things go in those experiments. Our hope over time is that intentional design and the class of techniques associated with intentional design are critical alignment tools. That's why we want to develop them in the first place. So we believe that this is a very promising area of research, and we want to expand on it, and we want to explore it deeply. We would not suggest today that somebody go, and as Tom said, align their frontier model using these techniques. That's just not the state of things. And there may be classes of things which you just never want to use these techniques for ever. That is a possibility. I don't know that we are entering this pre-decided on that, but I could believe that that was a possibility. And if it was in fact the case that with certain types of things like deception, you never wanted to do it, then we wouldn't do that, right? Like, I think there's many different aspects of aligning a model. And currently we're focused on things that are not critical to like most of the extreme like X-risk downside scenarios. There's also a lot of value.

Nathan Labenz: The obfuscated reward hacking paper I thought was an outstanding contribution from OpenAI. And the main thing it shows is if you do this way, you have a big problem on your hands. So that obviously can be a major value driver in terms of the shape of a research contribution as well. Leaving people flying blind to possibly make this mistake while they were not even thinking about it is not necessarily a great spot for the world to be in either. So yeah, it's all very it's all very complicated and many trade-offs and many highly contextual judgment calls probably in, in all sorts of ways.

Tom McGrath: Just to go into that a little bit further, one thing in like exploring an area which you think has potentially like huge upside, but also some things that you, some like bad things in it potentially, who knows? You should always bear that possibility in mind is that you don't necessarily want to just immediately publish everything. This is the other reason, right? They said the commercial reasons in the interest of not generating an enormous eye roll from some of your audience. I didn't be like, oh, also safety stuff. But legitimately, if you think there's important stuff in this, so it's worth exploring, but also dangerous stuff, so it's worth not just-- and it's worth exploring, but giving yourself a line of retreat. And if you published everything apart from the final step where you go, oh no, it was really bad, then you have not left yourself a line of frigid. So that's why we're being a little bit more like that and commercial reasons, why we're being a little bit more cagey than is natural or comfortable for me as a scientist.

Nathan Labenz: How do you, slightly different question, the version I asked earlier around balancing, how do you monetize this kind of thing in the first place? The kind of popular nugget, we're in a domain in terms of just how quick everybody is still learning how to make all this stuff work, where there are secrets that could be communicated in three sentences or whatever that are worth tens of millions of dollars. And it strikes me that's that's the sort of thing that you're developing, right? I do wonder how you think about is like one strategy might be like IP law. Could you are there techniques that you could patent or something like that where you could then license them but have some sort of defense of them. That obviously then intersects also with the mission question. But even leaving the mission aside for the moment, I do wonder how techniques like this are effectively monetized. And maybe it's audience segmentation where you work with some companies that just absolutely need the help and then other companies that can implement on their own learn what they learn from you and they can implement. But yeah, what's the how do you think about that?

Dan Balsam: Yeah, so The business model that we are currently operating under is kind of like a pound. So we go work with organizations that either have models or want to, say, take an open source model and adapt it in some type of way. And our deals start in the seven figure range. And so we work with them to help get them, help understand their models. help get them models that work really well for the things that they care about most in the world. And this is like cross life sciences and enterprise, financial services, government, this is, and then We deploy like a wide variety of techniques to this end. And then we use interpretability for multiple things in the stack. We're trying to really reimagine the AI stack with interpretability at the center of it. So this includes things like inference time guardrails as part of what we want to provide to people. And then it also involves like model adaptation. And so right now, a lot of this is more traditional training techniques. But over time, we want to make this more intentional design of models being able to provide the specification for a model, and then receive a model that behaves that way. And we just think of this all as one unified stack, like an interpretability-powered stack, and we work with partners to help them intentionally design their models. I think it was a longer-term question, which was like, hey, what if we solved alignment? What would you guys do with that? And I think if we found ourselves in the situation where we had solved alignment, I think we would-- I mean, there's many different worlds that we could be in. We obviously would not just keep that to ourselves for profit. We would find a way to make sure that disseminated to the benefit of of humanity. But the thing that we're doing is that like we're going to market and developing our philosophy on intentional design directly in interaction with the market, because that's how you See if your techniques really work. As I said earlier, we're doing inference time guardrails. If we destroyed monitorability, we would destroy one of our value propositions in the process of doing that too. And I think it's really important that we go out, we interact with the world, we develop these techniques. We develop them as in public as we're capable of developing them in situations that are initially low stakes. And we build our way up and our understanding up towards the higher stakes situations over time. And yeah, if we ever found ourselves in a situation where we do believe that like we had the key to be able to align models, or if we found ourselves in a situation where we decided that actually these techniques are dangerous, we would make the appropriate decisions from there.

Nathan Labenz: One question I always have around these sorts of like late stage interventions is what is the model like qualitatively after this, again, late stage surgery has been done to it? And I think, for example, there was the sort of tamper-resistant fine-tuning paper, which I thought was a really interesting technique, but it was like, oh, man, but the models do get a lot worse when that's applied. That stood out as an example of where the cost was pretty significant. And even in a project that I was very tangentially, or another one of these Forrest Gump moments for me where I was stumbling through what turned out to be a notable scene. with the emergent misalignment work from a wine Evans and company, the I think like the most, it's super interesting stuff, right? And you're like, oh my God, I trained on bad code or I trained on bad medical advice. The model became generally evil. What a bizarre and surprising discovery. And like how scary that is, right? But then one somewhat, at least somewhat valid criticism I think of that kind of work is the Model also got really dumb when you did that, in general, compared to the starting model. You know, it sometimes like responds in code to like things that it shouldn't respond in code to at all. And there was a filter of the generations that was just like the coherence filter. Like some of the responses are just not coherent. So we could clean that stuff up a little bit to try to get a clearer signal. But it's often, that kind of stuff is often lost. And so if you're thinking like, geez, how scary is emergent misalignment, I think it I don't want to say it's not scary. I do think people should take to heart that there could be very surprising knock-on effects to whatever late-stage fine-tuning they want to do. But at least for the models that I actually interacted with as part of that project, I think it is fair to say Probably nobody's going to deploy these in a super broad setting because they're not very good in a super broad setting anymore. In having been fine-tuned, they have also been really narrowed and it's just not the kind of thing that people are really going to use as an open-ended world facing general purpose assistant anymore. So the same question could be asked here, right? Okay, we drove hallucinations down. Is the model equally good as it used to be in other respects? Is it to what degree another kind of sort of hack on this whole setup could be like, it might just learn to say, I don't know all the time, and maybe it won't answer any factual questions anymore. So now I've got something that just says, I don't know. One way to not hallucinate ever is to always say, I don't know. So how much kind of general characterization of the reduced hallucination model did you do, and what did you observe in that review?

Tom McGrath: We did quite a lot, both in terms of like benchmark capabilities where we found essentially no degradation. Trying to think where it goes up by a percent on one, it goes down by a percent on the other. Is that just noise? Almost certainly. So the model basically remained intact as A in terms of its capabilities. And we also checked the thing that you mentioned there of does it just go, I don't know. Well, one thing, you can't score very well on MMLU by answering, I don't know. You have to actually make some positive claims. But also you can just measure in the completions and say we just take the long fact completions of the model after the training and the interventions, measure the number of claims that are being made. It doesn't go down. I should caveat that. We have the hallucinations viewer, there's like a data viewer you can go in. We show a couple of the most egregious policy errors. It's not flawless. There are occasionally very truncated responses. We did the work to put those really upfront in the viewer. But we had to search-- Adijia Connor had to search really hard to find. We rotten cherry-picked some examples there. We found the worst cherries on the tree and put them in the viewer. But broadly, it seems to do very little damage to the model.

Nathan Labenz: Does that surprise you? I guess the whole thing in the AI industry is like the dog that caught the car. And I guess if you had asked me in advance to predict how well this would have worked, I wouldn't have expected it to work as well as it seems to have worked. Were you also surprised?

Tom McGrath: Yeah. Honestly, surprised. It's quite nice. I think that one thing maybe is that the probe is quite well calibrated. And so you can use it to provide like continuous, relatively dense rewards, rather than a sort of GRPO thing where you have something happened in this trajectory and it was good or something happened in this trajectory and it was bad kind of thing. And so that means that the span, we have relatively short spans with consistent properties with calibrated continuous rewards. And that makes learning much easier. And when learning is easier, you don't break as much. So I can tell you a story about why we might have expected it, but nevertheless, it was still better than I expected.

Nathan Labenz: So can you tie this back to not fighting backprop? Like, and maybe a way to help me develop my intuition for this is, is there a version of this that would have been the fighting the backprop way? Yeah.

Tom McGrath: Backprop through the probe. That's directly stepping on the rake. You just, you take the probe, you backprop through the probe, And you're like, now it's not only are you fighting gradient descent, you've like, that's the straightforward example.

Nathan Labenz: Yeah, that's just kind of driving off a cliff of gradient descent, right? Yeah. Yeah. It seems like there's a middle version as well where, and I don't know exactly what it would be and maybe you don't know exactly what it would be either because it's not a great idea. And so you didn't think about designing an experiment this way, but there is this sort of In the like inoculation prompting thing, we do have a sense where there's an inherent tension where we were like, we don't want you to exploit weaknesses in our environment, but we're gonna reward you if you do it. And so then, you know, that tension creates this sort of, I understand that to be at the heart of this concept of not fighting back prop. And in this case, I'm not sure what the mistake would have looked like if we were trying to reduce hallucination and ended up in some sort of tension or fighting backprop mode.

Dan Balsam: Maybe it's some type of competing incentive and structure, like incentive structure, right? Like hallucinations, like one reason that they can happen is like, I don't know, maybe sycophancy adjacent, like feeling like the user has to receive an answer of some kind. So if you're providing like competing incentives, perhaps that could be a slightly different story there.

Tom McGrath: That's really good. Yeah, that seems like a good experiment. Raters like confident answers and don't always have the means to check if they're wrong. Yeah. If this was in the context of a broader post training run, that'd be very interesting.

Nathan Labenz: Yeah. I wonder if you could do a obviously we know Grok is gonna become the most truth seeking model in the world, in the cosmos. An idea that comes to mind that, again, I don't know if like it should. Does this get into forbidden technique territory? I'm not sure. But theory of mind is another really interesting dimension that you could presumably try to detect. Maybe it'll be a little harder, a little more subtle to detect, but like the classic story of like why we should be afraid of RLHF models is because we are not reliable raiders. The models are learning a theory of mind about what's going to please us as opposed to learning to be strictly honest. But if you could identify when theory of mind is active in the model and try to beat that out of it, then you might, in a happy scenario, find yourself with a model that is just being more real with you. But I also do wonder, do you think the same setup would work or would you have any qualms about that?

Dan Balsam: I think probably there would be different training techniques that you'd want to use in that situation. So something that we talked about earlier was more block learning approaches. So like Tom's pirate example as a good one there. So in those cases, you have some optimization pressure that's like present and you want to change some solutions that the model could learn and you want to be able to suppress certain solutions over other solutions by intervening. some type of way in the trading process. So we are without going into too much details about unpublished work like this is we're exploring something like pretty similar. So we're looking at ways in which, you know, like preference optimization can go wrong and then exploring ways in which interpretability guided training can help prevent problems with preference optimization from emerging. So things like sycophancy are a great example. You know, Tom brought up emoji use earlier as another example. You know, like some of these are like quite mundane, but then some of them have like pretty serious repercussions for users as well.

Nathan Labenz: Yeah. So going back to the technique that you're describing, the intentional design, you would instruct your agent to block the updates that were increasing the please the user feature in as it exists in isolation from other ways of being correct or helpful.

Dan Balsam: Yeah, but just to clarify one more time, it's not block the updates. That's the difference. It's reshape the landscape such that the gradient no longer points in the direction of the please the user representations.

Tom McGrath: And I expect that the way this would actually happen is like, You've got the agent that's watching the gradients and is deciding what to do. And that has a much more general document like a constitution, say, or a model spec, or whatever you want to call it, that says the such-and-such model is designed to be maximally truth-seeking. And then you can infer from this that you shouldn't be sycophantic, you should be truth-seeking, and this is a bad behaviour to have in response to the situation. Like, you shouldn't learn sycophancy from your preference data, for instance. If the thing that you've been told has to be like maximally truth-seeking. The sycophancy and theory of mind thing is actually quite interesting. It takes us back to the circuits thing we were talking about way earlier. Because theory of mind is a broadly useful capability. I think your model would be really bad if you were able to get rid of its theory of mind in its entirety. It wouldn't be able to follow, it wouldn't be able to do the useful thing where models try and intuit what you want. for instance, but you definitely, but like theory of mind is a part, is almost certainly like a necessary ingredient for sycopancy. So you don't want to like completely nuke the theory of mind a bit. You just want to say, but don't use it for sycopancy, right? So there's a circuit there and you have to intervene in the right part of the circuit.

Nathan Labenz: Yeah. The complication of this is dizzying to say the least. What does the compute overhead look like for this? I think we've heard stats from Anthropic that they're like willing to pay up to, or maybe are paying up to something like 5% of inference compute for constitutional classifiers. I think if my understanding is right, like your grand hope would be that through intentional design, you could have compute savings by learning the right things faster. I assume we're not there yet today, right? So I assume we're still in the domain of compute overhead, but What does that look like? And what do you think the roadmap is to potentially even saving on compute with some of these techniques?

Tom McGrath: I think at the moment you pay a substantial, it depends on what you do. There are some things you pay very little extra, but there are other versions where you can pay a substantial overhead. I think the route to computational efficiency comes from sample efficiency. that say that you learn in one sample something that would have taken you 100 samples, now your plot budget is 100 times larger than it was. You can do a lot with that. And that's assuming that data is this infinitely available resource, which it is in some cases and it's not in many other cases, particularly at the frontier. So I think the path to compute savings, to there being an alignment windfall here, runs through sample efficiency. But I think there are good reasons to expect that to happen.

Nathan Labenz: And how, that obviously relates to like pre-training as well, right? I was just thinking, yeah, if you could make the model always, it's almost like my little catechism that I recite for myself, like what happened in the original grokking paper to make sure I continue to have command of that. So sure, if you could get that thing to generalize an order of magnitude faster than it actually does by not blocking, but up to massaging the lost landscape so that it doesn't go in the memorization direction. That would be amazing. But I do also wonder, and obviously that's a very narrow model, does this kind of thing, how far back in that training process can you actually start to apply these things? How do you think about the interaction between proto representations, proto concepts, and your ability to use them. I have no good intuition for that at this point.

Dan Balsam: Yeah, so that's just an empirical question that we don't have the answer to. I'm curious to hear Tom Opine on whether he has any hypotheses there. I guess my own guess would be, you know, like you don't have to wait till the end of pre-training, but like sometime in pre-training you can start doing this type of thing. You know, like representations like form in these kind of like stepwise ways sometimes where you have, These phase transitions, but these phase transitions are caused by themselves by the accumulation of prior representations that are necessary to go through that complexity transition. So my guess is that there's lots of ways you could leverage this. Primarily, so far, we're more focused on the post-training and on the more end of the process. That's where we focus first. But I would guess that there are points in pre-training in which you could do this in certain ways, but I think the overall structure of that problem is currently pretty not well understood.

Tom McGrath: I would agree. It seems really hard. Which is not to say never, right? But like we're already attempting one extraordinarily hard thing in this post-training direction. This is very much like not a consensus thing. I think most people think this is like hard and possibly doomed to fail. That's fine. But I don't want to layer on another another extremely hard thing. If we got into pre-training, it'd be like two extremely hard things. One, like pre-training itself. Just painful. Two, like, how do you deal with the evolution of representations and the much more sort of fundamental kind of evolution they go through during pre-training? I don't think the field of interpretability has a good answer to that yet. And so, one step at a time, right?

Nathan Labenz: So many different connections to be made, obviously. I used to be very interested in concepts around curriculum learning and also around like better initializations? Are there ways to start the training process with some sort of purified, and this is maybe a good moment to at least touch on this other paper that you had around the curvature of the lost landscape, because it does I actually took a walk one time, I was briefly Carl Schulman's roommate in New York way back in the day, and I once took a walk with him and he was giving me this thought experiment around how living forever, you might say you want that, there's a lot of situations in which sort of the continuity of some entity, if you allow yourself to think really creatively about the compromises that might be made, at some point it doesn't really matter anymore, and sure, you could draw a through line, but once the thing has been pared down to its most core survival mechanisms, the things that you actually valued about yourself or that you valued about this thing are lost anyway. And so I think this is, and that could be bad in the sense of, I wouldn't want to go through that as a human. That was the thought experiment he was taking me through, but it could be good in the sense that if you can identify the cognitive core of a model, then that maybe could be something you could take back in time and start with in the future. So I'll again give you the kind of takeaway that I had and then you can expand on it to the degree you want to. Basically, I understand that you started with an observation from other research that was noticing that the model's ability to memorize and recite facts or passages from literature or whatever are brittle in the sense that if you go looking for a weight that you can perturb, in, or maybe a couple weights that you can perturb to throw off a model's ability to recite some historical passage or whatever, you can find them. Which makes sense because very specific facts presumably are stored in a relatively small part of the model. Otherwise, how would they store all the facts? So that seems fairly intuitive that there's like a very specific circuit that saying the Declaration of Independence kind of depends on. And if you mess with that, it won't be able to do that narrow thing anymore. Okay, cool. So the, another way to say that is that the lost landscape around that memorized content is jagged. A small change to the overall model can destroy performance on that task. Now, flip that and say, what if we look at it in batch? If we look at it in batch and there's a whole bunch of different things, then sure, there's gonna be a bunch of things that are memorized.

Nathan Labenz: But there's also gonna be a bunch of things that are more generalized capabilities of the model. And so now if we go looking for weights that will destroy performance at the batch scale, then the weights that we find when changed do indeed destroy performance, those must be the core ones. Those must be the things that are really critical to the reasoning process. And conversely, the things that we can change and they don't change the batch modules were probably just connected to some, individual less important thing, less commonly used thing. And again, the other way to say that is the loss landscape is sharp around the, around these like core cognitive capabilities. So you change the weights that really drive those or that embody or instantiate or whatever that power those capabilities. You're going to have major performance loss across the board. Okay, then. Let's say we go through and try to classify these weights according to this this dimension of these, these weights when perturbed cause massive broad capability loss. And these other ones just perform when perturbed don't, they maybe only destroy some memorization or whatever. Sort weights on that basis and then just truncate the list and say, okay, we're just going to cut off the bottom of all these weights that we think are probably associated with memorization. We'll keep the ones that are associated with core stuff. And indeed, that seems to work. So much so that the, at least on a couple of dimensions, a couple of different task types, performance actually improved from removing all those weights that had been identified as associated with, like, esoteric facts and not with these kind of core reasoning capabilities. I thought that was really interesting and it does make me think, boy, run that process a few more times and you could get to some highly abstract reasoner that doesn't have a ton of facts maybe at all or can could potentially I don't know how far you could take that this is back to the the Carl Schumann thought experiment of could you take it all the way to the level of the thing knows nothing except it has these these sort of logical circuits but it does suggest a path to me around how to take something like this back in time and think if we started from a more pure logical reasoner and added facts into it, that could be a much more controllable path, right? We'd have something, we would be starting with something there, the circuits of which would be like a lot more naturally interpretable.

Tom McGrath: There's a lot there. That's a great summary, by the way. The only thing that I would tune there is that it's not specifically individual weights. It is like elements, it's like eigenvectors of the Hessian, but that does not matter. Just think of weights, it's fine. Collections of weights. Yes. Where to go with this? First of all, yeah, like, we are definitely, like, the connection between curvature and memorization is not original to us. But the idea of, maybe if it's about memorization, it will affect strongly one thing in the batch, or like in the big mega batch, and not most other things, and that will cause it to wash out and be low value. across the megabatch. I think that is original, and it's quite nice. So what we're talking about here is really like a sort of higher moment property. It is the case that the mean is low, but also the variance is high. So I think you'd be able to-- there's a possibility where you have something which is like low, but across the whole batch. We can't distinguish that from like high, but on one thing with the statistic we compute, but you could in principle do it. So one thing that we'd hope for from this is that we would be able to shrink the model down. Not only does the model not know this stuff, but it also doesn't pay the parameter cost of knowing this stuff. And we never pushed this all the way, but I think it probably is not as effective as data-based approaches to minimizing the model, which I think are very promising. So for instance, there's some quite cool work. I'm blanking on who did it, but like on pre-free training, for instance, where you train with like synthetic data from context-free grammars, say, or like from some sort of very symbolic domain. And the idea is that it just gets the model to make these very pure information processing circuits. Or you might try a sort of data augmentation approach where you take an article, you pull all of the facts out of the article, put them in a preamble. for that preamble in the context window, but don't include it in the loss, for instance. Now, the model can reason out by induction from the context, from the fact, the sort of open book that you've given it, and we should learn to deduce things. That also seems possible. And I think these approaches feel intuitively more likely to me to give you a kind of minimal reasoner. But there are only so many people, a good fire, only so many hours in the day, and so we haven't really pushed it yet. But I think there's a lot of promise there. But then the final thing is, would such a thing, in fact, be more interpretable? I don't know. Is a giant thicket of logical entailments actually that interpretable? Or is it like so, does it have so little semantics? It has no, like, rich semantics. You just get lost in the forest. I honestly don't know. I've never seen one. So I think it's hard for me to reason out a priority.

Nathan Labenz: Yeah. OK. I like that paper. That was a very again, if nothing else, it's a very fun one for me to crack.

Tom McGrath: I think it's really cool. This is the sad thing about prioritizing stuff is there's some stuff that I absolutely love that I can't spend as much time on. I think this is a wonderful paper. It's beautiful. And I think that it would be cool to spend more time on it. It even gives us a sort of regularizer. essentially, we can say, now I have a lever to ask if I'm fine-tuning, say, I can say, oh yeah, just keep the generalizing bits. Now that's not something we've really tried, but it'd be interesting to explore using that as a regularizer for your fine-tuning process, where I might ask, don't give me the memorization, because eventually that's like you were talking about in emergent misalignment. If you fine-tune hard enough, you've just screwed the model up. But what if you just were able to keep the generalizing bits, maybe you wouldn't screw it up too much?

Nathan Labenz: I think that notion of shrinking the model also is a really interesting one that in terms of the big picture of like, how are we going to get to a world full of highly capable AIs that broadly goes well. The idea of strong but narrow is like a really intuitively appealing idea to me. That's obviously Drexler's comprehensive AI services version of that. And I don't know anything about what Safe Superintelligence is doing, but in listening to his conversation with Drakesh, I came away with a sense that they were looking to create something that was this proto, I don't know, proto-agent or proto-whatever, proto-service provider that would sink and maybe even shrink into its role As it gets really good, it sounds like his vision is that it would lose other capabilities so that it would really dial into its particular context. I've definitely found myself coming back to that idea over and over again. How small could you make something that could be really good at what it does, but that for a company, for example, that wants customer service tickets handled effectively, small, as you mentioned too, you don't have to pay the parameter cost, that could be great. They could run these things on CPUs potentially at some level of shrinking. And then they really don't have to worry about what it's going to do out of domain because it would just have no ability to handle that really at all. And that could give you potentially a lot of comfort.

Tom McGrath: It's like quite exciting. There are only so many GPUs in the world. So if everyone is going to have their own personal AGI, then you might have got to have a lot more GPUs or a lot smaller models.

Nathan Labenz: Let's talk about Alzheimer's. So this is on the, obviously, the front of learning about the world, advancing science by figuring out what it is the models have learned that allow them to be so good at predictions and actually getting conceptual understanding. Tell us about what you guys learned about how the Primamenta model is predicting who has Alzheimer's.

Dan Balsam: Yeah, so this is related to our scientific discovery work. So basically, you can think of like the role that interpretability plays, like one way that I like to think about it is when your model has problems, what interpretability helps you do is debug your model. It's like a form of model debug. You understand what went wrong, and then you use that information to help you give a better model in some type of way. When your model's already good at something, then the thing that interpretability can give you is knowledge extraction from that model. We do a lot of work with partners in the life sciences, and our work in the life sciences is focused on taking biological foundation models, and then understanding what's happening in them with the goal of ultimately converting this into new knowledge in the form of biomarker discovery or potentially down the line, like druggable targets and drug discovery. And so Prima Mensa is a organization that is focused on neurodegenerative diseases and they, such as Alzheimer's and Parkinson's, and they trained a epigenetic foundation model called Pleiades, So this is trained on cell-free DNA fragments. So these are little bits of DNA that end up in the bloodstream of people, and they come from cells dying across the body. And there's been a lot of prior work that has shown that you can actually use these cells-- it's pretty minimally invasive. You just do a blood trial from a patient. And you can use these for various types of diagnostics. So there has been a lot of work, for instance, that's focused on cancer and using cell-free DNA fragments for cancer detection. So they trained an epigenetic foundation model that was a autoregressive model that was trained to predict the structure in these cell-free DNA fragments. And then from there, they used the embeddings of that model. I'm glossing over some of the steps. They used the embeddings of that model in order to predict whether patients had Alzheimer's. And so they brought us in to understand what their model was doing. And we applied interpretability techniques, series of interpretability techniques, in order to figure out what was the signal that was driving that Alzheimer's prediction. And we actually discovered that it was something that was a little surprising. So there's a little bit of nuance here. But basically, there had been attempts in the literature for Alzheimer's detection using methylation statistics and cell type of origin, which are two specific things you can get out of cell-free DNA, but not specifically using fragment length. And what we found was that their model was overwhelmingly depending on fragment length in order to make its Alzheimer's predictions. And this was really surprising to us because this was not what we'd expected and not what the Alzheimer's literature based on was in the literature. Fragment length, you know, had a history for cancer specifically, but not for Alzheimer's. And so once we learned this insight by studying the model, we worked with Primomento to construct a proxy model, which took this insight and then was able to recapitulate a lot of the performance of the original model with a very simple logistic regression. And we were able to generalize better than the baselines in the literature to an independent cohort that we had access to. And so the high level here that's exciting is that I think this is one of the first examples, maybe the first example of learning something new from a model by studying it and then coming up with a testable hypothesis, which then down the line, this is still early, it was a pilot study, we need to expand more cohorts and these things take time, but it gave us a testable hypothesis which we now can explore and we're considering wet lab analyses and other things in order to bring this forward. We just think it's an early example. We're doing lots of other work in the life sciences with other partners as well and we'll have more to publish there soon. I just think it's an exciting early example of what can be done with interpretability and the way that you can use your understanding of these models to make concrete, testable hypotheses. In this case, about the biological world and diagnostics.

Nathan Labenz: I did want to compliment you guys on the blog, and I would just recommend the blog to everybody. It is I think this is one of the first times that I have done this much prep for a conversation and not really had to go into the papers themselves all that much. But the blog posts have done an excellent job of helping me understand what's going on, giving me the right level of detail, and just being quite accessible while also not dumbing it down too much. So I think that's the level of investment there, I think, is very apparent to the reader, and I really do recommend the blog highly. Ready for a lightning round?

Dan Balsam: All right, let's do it. And big shout out to Michael Bielen on our team who writes a lot of the blog posts in collaboration with the scientists and engineers on the team.

Nathan Labenz: Yeah. Great job by him. So how do you compete for talent with Frontier model developers? That's one big question. It's, I think 40 people you guys are up to now, and there's a lot of work that has come out. So it's obviously a team that can come up with good project ideas, execute on them pretty quickly and ship a lot of stuff. These people are going to clearly be in demand. Is it just about the mission, or do you have any other tricks up your sleeve?

Tom McGrath: Partly about the mission. I think that we are trying to do something which is very different, very exciting, very big, and so people are scientifically ambitious. We're a great place for them to be. Partly about the kind of scientific culture that supports that. I think that we try and think of first principles, try and look at very empirically driven kind of thing, not hold any kind of not hold any particular idea too tightly, is the aim. It's always hard to actually achieve that in practice. But I just think we have a very good scientific culture that I think people come here and they're, Oh, I like it here. I think I'll stay. And finally on that is snowballs. Once you have good people, then one, they know good people, and two, people want to come and work with them. So that feels like a big, that engine feels like it's started to work. well now. So I think that's-- and a lot of hard work. Recruiting takes a lot of time, takes a lot of work, and is extremely worth it.

Dan Balsam: I think we also have a pretty differentiated research vision, a different vision of the future than a lot of the labs do too. And I think that's appealing to a lot of people. And although we're working to figure out our identity as a company in a lot of ways, and we've certainly solidified a lot of stuff, I think it's just like exploratory in a way that is really, really hard to do at the big labs. And I think that's one of the things that's really helpful for us is that like the possibility space is very open for us as a startup.

Nathan Labenz: You guys have put out some stuff around the sort of what you see as the highest importance or highest leverage, open problems in mechanistic interpretability. One is work on alternate architectures. I think one thing I don't have a great sense for is how much of the technique that you're developing will work if, for example, nested learning becomes the next big thing and now we're all in it. We've gone from a transformer world to a nested learning world.

Dan Balsam: My expectation for nested learning, I mean, we could look at Be interesting to look at this, but my expectation is that interpretability techniques would still work. Like there has to be like semantic information that gets like passed through the bottlenecks in any learning setup. And so I have no reason to think like a priority that you wouldn't be able to use a lot of the similar interpretability techniques to understand what was going on there.

Nathan Labenz: I've been generally very encouraged from what little work I have seen applying interpretability to alternative architectures that it mostly has worked pretty well. like Mamba type architectures seem to have been remarkably interpretable. But do you think there's any prospect for other architectures perhaps being more interpretable? And if that were to be discovered, could that be the sort of thing that would pull the field in a positive direction?

Tom McGrath: Part of the problem with is that people don't go looking for interpretability. Like it turns out that I don't know to what extent this is like extremely robust. But if you just, but it seems quite robust, if you look at the neurons inside the Transformers MLP, this is work from Translucent if I remember correctly, they often just are interpretable, like the sparse autoencoder was inside you all along. And we've had Transformers for how many years? And people are just like, oh wait, the MLP neurons, they're interpretable. So maybe we should just look a bit harder, like MOEs, I think, like a mixture of experts, like individual experts are generally not interpretable. But why should they be there? Like, 64 of them, whatever. Like, a language model has to do more than 64 things, so a given expert should be polysemantic. But there's some recent work, again, I'm blanking on the details, like the authors, but rooting paths. Also interpretable. Amazing. So the affordances are sometimes there, we just forget to look for them. Which is to say that, and like, both of these things push us in, like, it might be the models Almost get, this is Panglossian almost in its optimism, but maybe models get better to the extent they are more interpretable. Obviously that's not literally true, but like MOEs have pushed us on the performance frontier, but they also give us a new affordance for interpretability. Maybe the correct MLP width is simply because it happens to make the the hidden layer inside the MLP roughly interpretable and that makes the computations easier. Maybe there's something like, there's a deep principle here, I don't know.

Nathan Labenz: Cameron Berg of AE Studio did some really interesting work on the Goodfire API, looking at what models say about their own consciousness. What do you think about AI's consciousness?

Dan Balsam: I think it's a complicated question. I think that it would be very difficult to confidently rule out the consciousness of most existing frontier systems today. I think they probably aren't. I think that the probably is doing a lot of lifting. I think there's nothing that prevents the idea of consciousness from being in a machine. I think any computation, you know, like you can come up with definitions of consciousness or you can come up with sort of explanations that that might preclude that. I don't find them particularly convincing. So I think it's like fairly likely that it should be possible to build machines that have experience in some meaningful sense. I think we probably haven't today, but I think it's important to take that question pretty seriously. And it's hard to know whether interpretability could give us full insight into that question, but I think maybe it can. And if there's anything that can, probably interpretability would be the thing that could do it. But yeah, I mean, I don't have a valence about whether that would be a good thing or a bad thing. I just think it's a distinct possibility that if it's not a problem that are not a thing that exists now, that it could be a thing that exists in the future.

Nathan Labenz: You want to give just a closing call to action? We're in the early stages of AGI. It feels like people are calling Opus 4.6 in Claude code AGI, and it's only going to get more real from here. Why should people seek out the careers page at Goodfire or otherwise invest their precious time and energy into interpretability?

Dan Balsam: Yeah, I mean, I think it's definitely hard not to feel the AGI right now, so I can super relate to that. I think interpretability is important for a lot of reasons. Like I think when I imagine the futures that we could walk in, there's a future in which we are building, it feels like given to me that we are building super intelligence and that's happening quickly. and you can talk about how quickly, but the way that I see it is like kind of two doors. There's one door where we build super intelligence that we don't understand at all, and then there's one door where we build super intelligence that maybe we have a shot of understanding. And I think... Fundamental research and interpretability and fundamental research and like intentionally designing models are really important paths for us to get there. And we're doing all types of exciting work, but it's not just like a theoretical exercise. Like we're going out, we're making discoveries in the life science, we're working closely with partners to help their models like behave better, reduce hallucinations, be more reliable. This is a A really important field to develop for the future of technology and also something that progressively unlocks a lot of value along the way. So if anyone is interested in what we're building and working towards that mission with us, please reach out to us. We would love to talk.

Tom McGrath: Yeah. If you want to be part of the most exciting and beautiful scientific quest that's going on at the moment, I think it's got to be interpretability. And if you want to make it useful, I feel like Goodfire is the place to be, so that's why.

Nathan Labenz: Love it. Congratulations on unicorn status and congratulations on a great run of research. Dan Balsam and Tom McGrath from Goodfire, thank you both for being part of the cognitive revolution.

Tom McGrath: Thank you for having us on.


Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.