Mechanistic Interpretability: Philosophy, Practice & Progress with Goodfire's Daniel & Tom

In this episode, Daniel Balsam and Tom McGrath, at Goodfire, discuss the future of mechanistic interpretability in AI models.
Watch Episode Here
Read Episode Description
In this episode, Daniel Balsam and Tom McGrath, at Goodfire, discuss the future of mechanistic interpretability in AI models. They explore the fundamental inputs like models, compute, and algorithms, and emphasize the importance of a rich empirical approach to understanding how models work. They provide insights into ongoing projects and breakthroughs, particularly in scientific domains and creative applications, as they aim to push the frontiers of AI interpretability. They also discuss the company's recent funding and their goal to advance interpretability as a critical area in AI research.
SPONSORS:
Box AI: AI is delivering truly measurable productivity — strategic companies are already turning a 37% productivity edge. Discover how in Box’s new 2025 State of AI in the Enterprise Report — read the full report here: https://bit.ly/43uVP52
Oracle Cloud Infrastructure (OCI): Oracle Cloud Infrastructure offers next-generation cloud solutions that cut costs and boost performance. With OCI, you can run AI projects and applications faster and more securely for less. New U.S. customers can save 50% on compute, 70% on storage, and 80% on networking by switching to OCI before May 31, 2024. See if you qualify at https://oracle.com/cognitive
ElevenLabs: ElevenLabs gives your app a natural voice. Pick from 5,000+ voices in 31 languages, or clone your own, and launch lifelike agents for support, scheduling, learning, and games. Full server and client SDKs, dynamic tools, and monitoring keep you in control. Start free at https://elevenlabs.io/cognitiv...
NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive
Shopify: Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive
PRODUCED BY:
https://aipodcast.ing
CHAPTERS:
(00:00) About the Episode
(04:44) Introduction and Welcome
(05:22) Framing the Field of Machine Learning
(06:11) Empirical Data and Interpretability
(09:51) Challenges in Model Experimentation
(10:28) Unsupervised Learning and Interpretability
(12:12) The Role of Compute and Algorithms
(14:48) Analogies in Interpretability (Part 1)
(16:22) Sponsors: Box AI | Oracle Cloud Infrastructure (OCI)
(19:13) Analogies in Interpretability (Part 2)
(19:40) Philosophical Questions in Interpretability
(23:19) Current State and Future Directions
(32:20) The Paradigm of Interpretability (Part 1)
(34:54) Sponsors: ElevenLabs | NetSuite | Shopify
(39:32) The Paradigm of Interpretability (Part 2)
(41:43) Competing Approaches and Techniques
(48:14) Machine Learning Techniques for Better Decomposition
(57:21) Minimum Description Length and Interpretability
(59:27) Understanding Minimum Description Length
(59:56) Sparse Autoencoders and Optimization Targets
(01:03:35) Challenges in Model Reconstruction
(01:05:02) Dark Matter in Scaling Analysis
(01:06:43) Exploring Features and Interpretability
(01:19:21) Scientific Discovery and Interpretability
(01:43:52) Applications of Interpretability Techniques
(01:50:43) Good Fire's Mission and Future Directions
(01:53:46) Outro
SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...
TRANSCRIPT
Nathan Labenz: Dan Balsam and Tom McGrath, CTO and chief scientist at Goodfire. Welcome back to the Cognitive Revolution.
Dan Balsam: Thank you so much for having us.
Tom McGrath: Thanks, yeah. Great to be on.
Nathan Labenz: I'm excited. We haven't been able to make this happen quite as often as I would have liked, but we're going to make up for it by going long and in-depth today. So, really excited to get the update on what you guys are building as a company, which I understand there's some great news on, and also just to check in on what we have learned as a community about models and how we understand how they work over the last few months, because obviously there's nowhere in the world changing faster than that. So for starters, I wanted to go high level and just ask you to frame the field. I mean, I think everybody in the general ML space at this point has internalized this data, compute, and algorithms paradigm. These are the three legs of the stool that are enabling progress. There's the sense that they all contribute equally. And then on the interpretability side, I'm tempted to slot in models for data and say, models, compute, and algorithms are maybe the things, and seemingly a lot depends on the quality of the models. But then there's still a role for data, so how do you guys think of the fundamental inputs of what you're doing?
Tom McGrath: I think one thing that's interesting here is that these inputs are very important, right? And the models have changed, and so these have changed along with them. But another thing is, it's a very empirical activity in the sense that you're looking in a fairly fine-grained way at the data of what's happening inside models. So, progress in algorithms, for instance, parameter decomposition or SAEs or whatever, we can get into these later, but one of the inputs here is actually very rich empirical data. When you just look basically, what are activations like? When we tinker around with models in a relatively hypothesis-free way, do they actually see them behave? And I guess this empirical input is always part of progressive algorithms. But I think I just want to up-weight it for interpretability because we're really doing this from natural science. But yeah, I broadly agree with the composition of models of the data, I suppose, and then there's compute, and I think as a field, we would like to be able to use much more compute. And then there's algorithms, and I think it's a good decomposition. I would say that we don't have the transformer but for interpretability, and it could be that the thing that we are blocked on for finding the transformer for interpretability is simply understanding models better in a way that we can then generalize to a new inductive bias.
Nathan Labenz: Yeah, that's really interesting. And you prompted me also to try another paradigm mapping exercise. You said we're doing this in a relatively hypothesis-free way. That maps in my mind to pre-training, unsupervised learning, right? The SAE paradigm has largely been run a ton of data through basically from the same original dataset. You can tell me more about how that's curated maybe, especially for interpretability work, but relatively hypothesis-free way. Is there any equivalent to post-training yet in the interpretability world?
Tom McGrath: Oh, so when I say... That's interesting. When I say hypothesis-free, I mainly mean a person sitting down and tinkering with models and being like, "What structure is there?" So for instance, this famous not all language model features are linear paper. They're finding structure in activations, and this structure is interesting. It's this high-dimensional manifold, and you don't necessarily get to this by being like, I have a hypothesis that things are this way. You just... Like a natural scientist, someone out observing the stars or something, they have some hypotheses in mind, but they're also just going, they're just looking at things. And so that's what I mean, but it's interesting that actually I also am very keen on the unsupervised approach to the unsupervised approach to interpret models like SAEs for basically the same reason, except that every architecture is coming with a hypothesis, the SAE inductive biases things are... Well, the SAE has a fairly strong hypothesis that features are literally directions in embedding space.
Nathan Labenz: Yeah, there's something really interesting there around... And again, the model quality seems to be really important, right? Because a lot of the hypothesis-free tinkering, even vibe coding, right? I mean, Neil Nanda recently put out a video of him doing some vibe coding research, and that's all premised on the idea that you can run a little experiment and get something back relatively quickly on an iteration timeline where ideally you could sustain focus. Maybe you have to go take a walk and come back. But it's not really long time frames or really large compute budgets, whereas as you mentioned, you have to have some conviction to throw large amounts of compute at an SAE at scale. But I wonder, how do you think about the challenge of... Can you do that rapid experimentation on the truly large advanced models? Or are you limited to working with something like GPT-2 scale, and does that create a fundamentally different regime from the things that you end up scaling up?
Tom McGrath: It's really a question of infrastructure. If you have no infrastructure, then it's hard. If you have the right infrastructure, then building that infrastructure is hard. But the right infrastructure makes experiments relatively easy.
Dan Balsam: The only thing I'll add is when thinking about unsupervised techniques as hypothesis generators for how the model could be working, there's no way we're going to be able to scale to super intelligence without making our interpretability techniques unsupervised, and that's one of the things that really motivates us and is our mission at Goodfire. Narrowly, superintelligent models already exist in scientific domains, and this is what we spend a lot of our time working on. When you're working with a genomics model, you're working with a model that we have priors about. There's lots of bioinformatic research to attempt to understand the genome, but we're also working with systems for which our statistical techniques are not as explanatory as we'd like, and that's the motivation towards moving towards AI to begin with. In the process of training unsupervised models, it really gives us a grounding about where to look in the model. It gives us candidate experiments, candidate hypotheses to run. One of our goals as a company is unsupervised learning often looks worse until suddenly it looks better. If you went back in time, there was a point where massive pre-training in an unsupervised way on large corpuses was giving you worse performance than bespoke models that were purpose-built. We think interpretability is likely to follow a similar arc. We're still not sure exactly what those pieces of technology will be. But each new item on the tech tree in interpretability unlocks new questions that we can ask, new ways that we can look at the problem. And as Tom was saying, eventually building towards a solution that we can just toss a lot of compute at in order to fully unlock what's going on inside.
Nathan Labenz: Am I interpreting you correctly to say that right now the field still feels bottlenecked more on algorithms than compute? Certainly not models, there's a lot more in models we can figure out.
Nathan Labenz: The two candidates would be compute and algorithms. It seems you're saying algorithms are still where it's at, and we need to figure out how to apply the compute.
Dan Balsam: I think that's right in some sense, but perhaps a softer version. Existing tools are already powerful enough to do things that are useful. That's a big part of what we're doing at Goodfire when working with customers; we use existing techniques to look at a model and help customers understand it better. But we are under no illusions that we've cracked interpretability. To fully reverse engineer what's happening in a model, we need brand new techniques and new paradigms.
Tom McGrath: On the compute front, it's a question of whether frontier models are bottlenecked by literal availability of compute or will be soon. We're very far from that. But we could spend a lot of compute. The question is, do you get value for money? I would like to be able to do a million-dollar interpreter model training run and say, "Yes, let's put a million dollars on it," and feel like I've got $1 million worth of information.
Nathan Labenz: Perhaps there's a way to express it as a ratio between the size of an experiment and... that's maybe not quite the right way to say it. But given the option of choosing a really big run or more people to come up with more ideas for smaller runs, it's clear that you'd rather have more interpretability researchers exploring the space more thoroughly before...
Nathan Labenz: ... scaling up.
Tom McGrath: We couldn't productively do a single million-dollar training run. That's one of the things that indicates if you're bottlenecked on algorithms: Can you productively spend that much compute? We could spend it, but at the moment, it wouldn't be a productive way to spend it.
Dan Balsam: There's also the question of how you use the tools you have. For our customers, when we've trained an SAE for them on their model or any type of interpreter model, that's where the work begins, not where it ends. I think of an SAE as a window into the model. You can only see some things, but those things can still be really useful and provide a lot of value. Chris popularized the biology analogies for mechanistic interpretability, and those really track from my perspective. If you go back to the 19th or early 20th century, what did you have to do to learn anything about cells? You had to put things on a slide, stain the slide, look in a crude microscope, and you were looking at a cross-section of something not in its natural environment. You had to make inferences based on looking at a lot of these about the actual biological structures. But all of modern biology was built on that. Over time, we developed better microscopes. We developed better techniques for looking at organisms in more natural ways. We're a little further on in interpretability maybe than that analogy implies, but it's the same thing. You can learn and make novel advancements in science at any state of the technology. Simultaneously, we want to push what we can learn with the tools we have while also pushing to get better tools.
Tom McGrath: For a while, people had to make their own lenses, right? Which I guess is like making your own SAE training code base. Probably you want to just go and buy the lenses from ZEISS or something like that, and then you can become an expert and create tools. And then the field can advance. There's some famous dictum in science about things advancing via methods and ideas and then experiments or something like that, in that order.
Nathan Labenz: Yeah. There's an Adam Smith lesson here. The extent of the market, the degree of specialization, and the sophistication of the supply chains are all pretty early. And we are kind of in that, you know, maybe just exiting the grind your own glass lenses phase of biology. Let's hope we don't find ourselves in the gain-of-function research lab leak and invasive species phase of biology before we're ready to handle them. But, you know, I guess one place where this analogy maybe breaks down a little bit is on this question of fundamental units, and this is a philosophical question I've been trying to wrap my head around better. I'm sure you guys have good thoughts on it. When we look at an organism, right, and we look at its genome or we look at proteins, we're pretty confident we're talking about real things. I guess they're maybe quantumly fuzzy on some margin. But we have a pretty good sense that a gene is a gene and a protein is a little machine. And here in the interpretability side, the features, if you will, that are identified or learned by an SAE or similar techniques, and you might want to, if appropriate, separate SAEs from other techniques in your answer here. But these things seem to be approximations, or there's some gap right there between what is going on in a model and what is going on when it's sparsified in this particular way. And so I'm really interested in how you think about the relationship between these features that are learned and the labels that we give them, and how much correspondence you think there is there, and is that on a spectrum, and how should we think about it?
Tom McGrath: Yeah. So it's interesting. I suspect if you asked a biologist, they probably have a lot of corner cases about, is it really a gene? You know, biology is, if nothing else, a great supply of corner cases just because of the rich complexity of the world. But yes, I think there is a definite sense there that a gene is a natural abstraction. It's a good way to talk about the world. And so this process of, you take a model, you sparsify it, well, now we have introduced some degree of lossiness, right, because we're not capturing all of the computation. You can see this from the loss, you can see this from the reconstruction error. But we are capturing what look like very interesting and interpretable things. But that takes you on to the next level of, okay, there's a thing. There's a feature in your sparse dictionary, and now we assign this label. And this is another area for a gap. There's a gap that you can fall in. And I think that we can be in the business of closing both of these gaps a great deal. And you can close the gaps in multiple ways. One is to have a better idea... Oh, so the first gap, right? Gap one, where we're talking about the distance between the model and our approximation of the model. Then how do you close this gap? Well, one answer is you do the machine learning better. You just make a better SAE. So you capture more of the loss. And there has been a bunch of work in this direction. There's, you know, I can provide a bunch of papers later. Another is you try and answer the question, what does it mean for something to be a good abstraction? And you use that as inspiration for new methods. So for instance, what would it be for an SAE feature to be a natural unit of computation? Well, it's not completely clear. I think there's actually some interesting but probably quite resolvable issues there. One thing it might mean is that it is involving consistent computational paths. You know, a feature is a natural unit of computation if it is involved in other computations that make sense. And so now this is maybe one way of saying this is I think that things get a lot cleaner, or feel like they will get cleaner as we move to circuits rather than just single-layer activations because you don't really have a great way of validating. Now you can validate by intervening on a feature and seeing how things change. That's like circuits, except you just haven't tracked the circuit. Now you're entering into the second part, the second gap. You know, I've intervened on a feature. You might say, "Well, I intervened on it and it didn't do what I expected," but that might be because my expectation was wrong, that I've simply fallen into the second gap where I've given the feature an incorrect description. It is a unit of computation that the model does, but I've just called it the wrong thing. Now, how do we narrow the second gap? And I think that the answer here is probably that we just get better at doing experiments on interpretability, or we kind of... So the way that we currently assign labels is... And hopefully I won't offend Nick by saying this, is a little primitive. So Nick Cammarata, when he invented this automated interpretability technique.
Nathan Labenz: And principal investigator at Gunfire.
Tom McGrath: Yes, thank you. I was a 100% true follower. He's wonderful. So anyway, hopefully I won't offend him by saying the current method of assigning semantics to features is a little primitive. What we do is we give a frontier model a bunch of examples of where the feature fired and we say, "Well, here are these examples. What's the feature?" This gets you some way, but it doesn't get you the whole way. For instance, if you were to ask me with access to the model, "What is this feature?" I wouldn't only do that. I would also try steering the feature and see what happens. I might look at other things that projected into that feature or where it goes downstream, how it relates to other features, all that sort of thing. So there are many more things that I personally would do, but we can't currently get frontier models to do this. It may just be a matter of scaffolding. We just need to build this kind of scaffolding such that they can use their capability set. That was a really run-on answer, but I think there are these two gaps that we can narrow both.
Nathan Labenz: No, on the contrary, I kind of want to expand on maybe...
Dan Balsam: Okay.
Nathan Labenz: ... both points because I mean, it's all really fascinating and important. So the first bit, I guess, I really like that it's just sort of decomposing the problem into the two gaps. First is, can we reconstruct? And I mean, this is literally what the SAE is trained to do, or the other techniques, is to reconstruct, right? It's a reconstruction loss. Do what the model originally was supposed to do. There, I think I'd love to know a little bit, maybe, you said you could provide a bunch of papers later. I'd love to hear what is the state of the art? The deepest dive I've done into literature recently was the Anthropic tracing model thoughts pair of papers.
Dan Balsam: Mm-hmm.
Nathan Labenz: And I was kind of struck there by, it seemed like there's a lot that is not being reconstructed, I guess, is what I would bottom line it as. So I'd love to get a sense of where is that state of the art and maybe in narrative form, if not in fully cited paper form, what has been the trajectory of improving that? What have been the advances, the unlocks, whatever? And then on that second gap, this sort of anticipates another question I had around just, what is the inference time scaling paradigm for interpretability? It sounds like the answer there is, "Well, today, we run data through the thing, see what activates what, and collect the things that cause the max activation, then try to describe them qualitatively. And in the future, we could do a lot more." And then you can kind of develop a little bit the vision for the sort of higher order auto-interpretability. I think both of those are really interesting mini-lectures that you could go on at as much length as you want on.
Dan Balsam: I think before we dive in there, can I give a quick meta thought on this entire question, which is that most measurement apparatuses that you could deploy in most scientific contexts are reductive in some way, right? There's some set of assumptions that you're making in how you should interpret the data that you're getting from the measurement apparatus that you're doing. The microscope analogy with SAEs I think works really well when explaining this to people because there's three things that you can do with a microscope, right? You can figure out what you're putting on the slide. That's the dataset that you're passing through the original model. Those are the actual activations itself. You can get a different SAE on one dataset from a different dataset. There's sort of how you stain the slide. Maybe that's by the loss function or how exactly you're looking at things that's going to affect the structure of what you get back in different ways. And then there's the size of the lens, right? That's the expansion factor of the SAE. It's important to understand what we're doing with our existing interpreter models as a specific lens, a specific way of looking at the computation of the model rather than the whole picture. I think this is true whether you're trying to do circuit work with sparse approximations or if you're trying to do layer level activation work with sparse approximations. But that's not an abnormal thing to do in science. You take some sets of assumptions, you know they're not always right, but you know that they're right enough sometimes so you can start getting traction and running new experiments. So it's not, and from our perspective, it's not this all or nothing thing. This is like what Tom is saying, we can keep pushing on the fronts of, how do we address the limitations in the existing tools within the paradigm that they're operating with the sets of assumptions that we know are sometimes wrong, but are right often enough to be useful and give us more information? And then simultaneously ask the question, what's a better set of assumptions? And these have to happen in parallel. You can't just do one at a time, because you would never find the right assumptions if you weren't testing the limits of your current ones. So from my perspective, this isn't a unique thing about interpretability. This is just how science works.
Tom McGrath: Yeah, exactly. And it's funny. People remember Kuhn's structure of scientific revolutions for the revolutions, right? Like crises. And yet, the sort of dominant mode of science is normal science, where you're going along and you're generating actually productive knowledge about the world. Maybe on foundations that will later get a bit shaky or get overturned, but you're still generating knowledge about the world. And then there's this idea of, there are anomalies. And then anomalies pile up or unanswered lead to a crisis, right? But that's where the anomalies come from. The anomalies come from the business of doing normal science. So even if you want to generate a paradigm shift, often the answer is to just try and do normal science until it becomes untenable. And I think that's, yeah, that's maybe where we're, one way, one place where we're maybe at the moment. We have probably a proto-paradigm. I think we've been reluctant to admit it, but I think we probably have a proto-paradigm in interpretability. And so we should push it. We can do a lot of useful stuff. We should just keep pushing it, we should keep doing the useful stuff and see where, and wait for the anomalies to reveal themselves. And I think...
Nathan Labenz: I was going to ask actually, are we still pre-paradigmatic? We've upgraded ourselves now to-
Tom McGrath: Ah.
Nathan Labenz: ...proto-paradigmatic.
Tom McGrath: I'm going to say proto-paradigmatic. Maybe I should have some courage in my convictions and say that I think we're entering our first, the first paradigmatic phase of interpretability. Or, well, no. Okay, so this is a bit fuzzy, right? What is a paradigm? A paradigm is a social thing. I don't think there's consensus. There's not the kind of consensus that would lead me to say there is a field-wide paradigm in interpretability. I would say that among a reasonably large group of people, there are the raw materials for a paradigm. I suspect if the field were Anthropic, but ballooned to the size of the global interpretability community, it would be correct for me to say that there was a paradigm. But because there isn't a level of consensus, I can't really yet say there is a paradigm.
Nathan Labenz: So what would that paradigm be? How would you describe the Anthropic and GoodFIRE axis paradigm?
Tom McGrath: I'd say, one, neural networks contain things which are understandable, right? This is actually worth stating. For a long time, this was not generally accepted. I don't know if it is yet generally accepted, but this is the sort of down in the basement of the paradigm.
Nathan Labenz: As a quick interjection there, is that an artifact of just earlier models? Because the way I would tell that story is, the original GPT layer, there were still some things probably that were meaningful enough, but there was also so much noise that people could very easily have been excused for just being like, "Eh, you're tricking yourself." Or, "You might find some spurious correlation here or there, but I don't really buy it." And they maybe just haven't updated since.
Tom McGrath: I think, yeah, you're going to interpret the model that you have, and if there are lots of flaws in the model, those might be what you're finding with your interpretability tools. And if you're entering with the prior that, oh, I should be looking and recovering this specific thing and you can't recover it, that could just be as much evidence that the model isn't doing what you thought it was doing to begin with. When you go through the process of debugging a model using interpretability techniques, the thing that you might find is, oh, the model has memorized a bunch of its training data or something like that. You have some belief about what your model is doing and how it's modeling the task, and that belief might be wrong and that could throw off the perception of interpretability if you're not bringing unsupervised low-opinion techniques in that can work across the entirety of the end-to-end interpretability stack. And I think to that point, we just didn't have unsupervised techniques that can work at least a good percentage of the time across the end-to-end interpretability stack until quite recently. There's a funny U-shaped thing where I think a lot of the early connectionist papers, they actually do look at individual neurons and say, "Oh look, this neuron learned this thing, this neuron learned this thing." And they could do that because there were 12 neurons. And so it's funny, it started out and everyone wanted to look at the neurons and had some success. And then for reasons that are opaque to me, but I suspect someone could find out, it became very, somewhere between unfashionable and considered to be a bad idea or impossible to look at individual neurons. And now it's kind of come back into vogue. So it's like interpretability just went into the wilderness for a bit.
Nathan Labenz: Yeah. Okay. So now give us the paradigm.
Tom McGrath: Yeah. Okay.
Nathan Labenz: So one is there are things that make sense that we can interpret.
Tom McGrath: Yes. There are things to interpret. Interpretability is possible. And then what other parts of the paradigm are there? I suppose there are features like representations are linearly decodable. Or at least linear decoding is a reasonable way to talk about features. There may be higher order structure, right? You might have features that are arrows in space. It might be that actually multiple features lie on some manifold or in some subspace, but that it's a sensible way to talk about representations as lines through embedding space. Third part of this paradigm is this idea of superposition. Because if you're going to have vectors in a vector space, then the natural conclusion would be, well I'm in a D model size space, right? Does that mean the model can only think of D model things? Probably not, right? A language model can think of more than 4,096 things or something. So the other part of this paradigm is superposition, which is this idea that the way that you squash more of these embeddings, more of these vectors, these feature vectors into the same space, is by allowing them to overlap a little bit. And this creates a bit of interference, bit of noise in the representations the models are still able to deal with. Now I guess another part of this is that magnitude, direction along the vector constitutes intensity. And the other thing is rather obviously I suppose features connect to form circuits. And that is basically I think the paradigm. The paradigm which is currently, or rather the thing in which this is the Anthropic paradigm I would say. And if you were to blow this up to the size of the world and if there were consensus on this, then I guess it has enough structure to be called a paradigm.
Nathan Labenz: Is there any competing proto-paradigm, or are there just other people claiming we'll never have one at all?
Tom McGrath: Not in the sense of something that provides a complete worldview. When I say a complete worldview, all I think about is neural networks. It doesn't say anything about dinner, but I don't think about that very much, but a complete worldview for neural networks. You could say that something like parameter decomposition suggests a separate paradigm, and that's because that's a difference in emphasis. Parameter decomposition is talking about the weights; the SAE type paradigm is talking about the activations. Now, I would say there's still... they are different levels of emphasis. I think a way to think about this maybe is that, unsurprisingly, we need both. To make this seem a bit more intuitive, a neural network is, in a very dull sense, a causal model. Every neuron is a node in your causal graph, and all the weights specify the edges in your causal model. It's very big, but not very big, very homogenous, not very interesting causal model. One way of thinking about what we're trying to do in interpretability is we're trying to create a causal abstraction. We're trying to create another model which is a reduced version of this model, but that will also be a causal graph. So when we're talking about this debate between should we decompose the parameters, should we decompose the activations, saying should my graph have nodes or should it have edges? Well, probably it should have both. It's a graph. This is why I think that actually they're not necessarily competing paradigms. They're just two independent ways of thinking our way towards the broader causal abstraction that I think we need.
Nathan Labenz: Would you highlight any work that has focused more on the weights than the activations? Certainly from my perspective, it seems much more news and excitement is coming out of the activation space.
Tom McGrath: It's funny because some of the earlier stuff, we're going to say in the mechanistic interpretability extended universe, because I can't immediately recall some of the other papers as well. I can have a look. But some of the earlier stuff, learned equivariance in the circuit thread for instance, was a weights-based analysis. But then the more recent parameter decomposition things have been really coming up. Lee's group, formerly at Apollo, now he's a principal investigator at GoodFIRE. Attribution-based parameter decomposition was the milestone thing recently. There's another recent paper following up on this based on ideas from the last landscape. I'm blanking on the title. Brianna Christman is the lead author. I can send it to you afterwards. So it's what happened in the past, and then SAEs got a lot of traction and so that took a lot of the focus, and now I think it's coming back in again.
Nathan Labenz: One other school that possibly comes to mind, or maybe contrasts in approaches, is the bottoms-up versus top-down. I think of Dan Hendricks and representation engineering or circuit breaker-type work being less focused on unsupervised discovery of what it contains, and more on the contrast that I care about. Let's make sure that we refuse under certain conditions. Do you think those are fundamentally different approaches, or are those ultimately reconcilable?
Dan Balsam: They solve different problems, and they're both important things to look at from a safety perspective. When we zoom out and think about the alignment problem broadly, the solution could come from interpretability, but doesn't necessarily come directly from interpretability. Interpretability is the measurement apparatus that can make us confident that any other techniques are in fact doing the things that we think they're doing. For instance, are we aligning the chain of thought of a model faithfully to its computation? I don't know how we would have any hope of answering that question if we don't have meaningful ways of constructing graphs that represent the model's computation in some way. I think a lot of these core questions in alignment, at the end of the day, it could be that we solve alignment through bottoms-up, but it could not be. Without the bottoms-up, I have no idea how we would know that we solved it through any other means.
Nathan Labenz: Okay, I think that's really clarifying stuff. I think people will find the high-level mental models there quite interesting. Maybe I'll add a third chapter. So, or the gaps, right? If we structure this in terms of opportunities to make things better, we've got better reconstruction, lower loss. Eager to hear a little bit about what the progress has looked like there. Then we've got better labeling or inference time scaling, getting to higher and higher orders of automation and confidence that what we are describing, what we say is happening is in fact what is happening, that we're understanding it correctly. And then what you alluded to, what I'll maybe add on as the third thing, is the move from activations to circuits. And I do think Lee Sharkey's paper there was super interesting and could be a whole episode on its own, but we can do a mini one as part of this.
Tom McGrath: I should mention the entropic circuit tracing results as well. I think everyone wants circuits. The question is, how do we get them?
Nathan Labenz: So take us through those three chapters and again, take all the time you need because I'm here for it.
Tom McGrath: Okay. So should we talk about the first one first then? That's how have people been doing the machine learning better? Because here, the machine learning part of it is how do we best learn this decomposition of the models or within the... Let's just stick within the SAE paradigm because then I can be very concrete. How do we learn good decompositions of the models? And some of this is you start with the SAE, and it has L1 sparsity regularization, it has a ReLU activation function, and then people sort of hillclimb a bit in the classic machine learning way of, well, the L1 sparsity has certain properties, it causes features to shrink which also has a predictable bad effect on the reconstruction loss. And so there are various solutions to this, things like jump value or batch top-K. There's some hill climbing on this work. And then things like end-to-end SAEs where instead of training on the activations, you're not training purely on the activation reconstruction loss, you're actually training such that when you take the model, you've got the model and when the computations are going up, you put the SAE in the middle and we train such that when you put the SAE in the middle, the model doesn't get too much worse. And it's called an end-to-end SAE. And both of these things, all of these things, improve on the loss reconstructed versus sparsity frontier. What else is there? There are many other approaches to dictionary learning, where dictionary learning is the broad class of things of which the SAE is an instantiation. The dictionary here being your collection of vectors, your quiver of arrows is the dictionary, and you're learning a dictionary that could do a sparse decomposition. And things like using gradient pursuit instead. I saw an interesting blog post on a residual quantized autoencoder. Matryoshka, all these things are other ways of using compute better. You haven't really changed the fundamental assumption that features are directions in space, but you've explored how, under this basic assumption, we can create a machine learning architecture that reduces these metrics of loss, that improves the loss we covered. What else? Trying to think if there's any other. I doubtless will have missed some and offended people by leaving their papers out. That's just a failure of my memory. Apologies to people whose papers I've forgotten.
Nathan Labenz: You would think it would be comprehensive in real time. The Matryoshka concept was one that I was keen to explore. I've heard a little bit about that. It has a very, I don't want to be overly allured by this, but it has a very appealing vibe to me, where it's like there should be some structure that gets finer grained as you go, and basically the idea there is a sort of tree structure, right? I mean, it's not really structured that way, but that's the short version of the idea.
Tom McGrath: It would be ideal if it were a tree. Unfortunately, there's no explicit tree structure involved. Matryoshka is really interesting actually, both as an architecture and from a point I was talking about earlier. We were talking about this idea of models, compute algorithms, and algorithms being bottlenecked on observable experimental data. Us knowing new things about models, and Matryoshka is an example of this happening. So how did we get to the... when I say we, I had no personal part in it. How did the field get to Matryoshka? Well, my understanding is that there's this idea of feature absorption, and feature absorption happens when SAE latents specialize far too much. So instead of there being, I won't go into the mechanism, it's a bit involved, but essentially instead of there being, for instance, a feature for 'token starts with the letter A', you now get a feature for 'token starts with the letter A and is not the word aardvark' or something like that. And then you also have a feature for 'token is aardvark', because it turns out this gets lower L1 sparsity loss. And so this thing called feature absorption was a motivation for Matryoshka. Now, how did we get to feature absorption? Well, someone looked at some SAE latents and said, "That's funny. That looks wrong to me." I think the answer was actually they were training linear probes. So, what letter does this token start with? Which feels like a silly, niche thing to do without context, but actually led to this interesting discovery. It was a very smart thing to do. And so it led to this interesting discovery of what happens with SAE latents that was relatively hard to predict. Certainly no one predicted it a priori as far as I know. And then Matryoshka turns out to be a good way to fix this. Now, a more desirable thing would be what you said, Matryoshka but as a tree. So what actually happens with Matryoshka is that you have a series of nested groups of features, and so we predict using the first group... well, there are two variants, the simplest one to talk about uses groups. We predict with one group, right? And now we have some residual error term, and then we use the second group to predict that residual error term and so on as we go up the shelves. What I think would be quite desirable is instead being able to say, "Well, this feature fired, which means that now I'm going to up weight this other feature firing." And so you'd have this explicitly encoded tree structure. The idea of a tree structure is very interesting for minimum description length reasons. There's a really cool paper on the minimum description there. Okay, I can come back to that in a minute. But enforcing this discrete sampling behavior is relatively hard. It's much easier to just do this very soft, differentiable thing. I think it's harder to do the kind of thing that we both would like to do.
Nathan Labenz: Matryoshka in general, I wonder how much of this, and maybe this could again be abstracted a little bit to interpretability in general as a subfield coming in the wake of broader ML developments. In general, it's been applied in a bunch of different places, right? And it boils down to finding a way to make the first bits of any given thing the most meaningful. I think of it as ordering the data so that the most meaningful stuff comes at the front. And you've seen this in embeddings and weights and whatever, where you can have a short embedding that is pretty meaningful. And then the longer you go toward the full embedding, the more meaningful it gets. But you're always getting the next most relevant bit of information at each stage. So intellectually, would it be right to guess that a lot of the things happening in doing machine learning better are people looking for techniques that have been developed previously in other contexts and saying, "This worked before, maybe it will work again here?" And then having applied that technique, developing a story of what was happening after the fact. That's how I expect a lot of this would be working, but I trust you to tell me differently.
Tom McGrath: I am not sure. I expect that a lot of it was worked out from first principles and that it turns out someone may have done something related before. But thinking about this context, you arrive at, oh, this is a problem. Now there has been a lot of work done on sparse autoencoders in the past, around early 2000s. A lot of work went into sparse autoencoders. Dictionary learning has previously been an active topic of ML research. But I think we're rediscovering a lot of things at the moment, and it could be that one thing to do, one actually high alpha activity in dictionary learning or in this more ML side of interpretability, is simply to hit the history books. I say history books, I mean papers long, long past, and see what has been done that hasn't yet been applied. One example I think is quite funny, this is the recent NeurIPS sparsity, the 2024 NeurIPS sparsity tutorial was very good, but one of the slides early in the talk, it was what this tutorial is not about. And it was this tutorial is not about sparsity for interpretability, i.e., sparse autoencoders, and there are various other things in it. But I saw this, yeah, it's not about, these are various other kinds of applications of sparsity, but I saw this and I was like, "Yes, great. Now I'm going to learn some things that no one else in the field knows," and there's actually some really good stuff in there. So doing that activity a bunch of times would probably be very high return.
Nathan Labenz: Although coming back for a second to your question about ways this parallels other things in machine learning, that could be a good segue, Tom, to talk about minimum description length because that's really interesting.
Tom McGrath: Yes. So Michael Pearce recently joined GoodAI. I guess this is happening quite a lot.
Nathan Labenz: You may have noticed that we have some of the world's best interpretability researchers working at GoodAI.
Tom McGrath: Yeah. Michael Pearce is one of the authors of this paper on the minimum description length, and I think it's a really neat idea because it basically gets to what you were saying earlier, that what you want is effective, compact descriptions of what's going on. Ultimately what we're trying to do in interpretability is describe neural networks. I'm trying to describe it to you, or Claude is trying to describe it to me or whatever, and we're trying to find decompositions that are easy to describe accurately and then give good decomposition solving. So the minimum description length is this idea that what makes a good description is a description you can transmit in relatively few bits. And the idea is that rather than using sparsity as our regularizer, also our kind of, we normally quantify in this SAE paradigm, progress has typically been quantified on this Pareto frontier of sparsity versus reconstructed loss. But the idea of the minimum description length is that really there's only one metric, and that is how many bits it takes you to describe what's going on. And there are various technicalities about how you actually do this with an SAE. I think basically it's a metric that I think we can probably mostly agree is a good idea. The problem is how you actually implement it, how you optimize it, and so on. But it has various other nice properties. If I wanted to have this tree structure in my features, but under an L1 type sparsity regularization, this wouldn't actually be preferred. But under the minimum description length, I can have a feature in, you know, I've got a tree, I have the root and the branch, well the trunk and the branch, right? I can say that if I can say that the branch is high probability given the trunk, but otherwise low probability, then this has a shorter description length than having to describe both of their probabilities independently. It's hard to give a good overview answer of this.
Nathan Labenz: Maybe one way to approach it is what is the artifact that we get with a sparse autoencoder? I have the intuition that the thing may have 10 million nodes or whatever, but only 100 of them are going to light up on any given forward pass. And that could be a hard cap or there are different...
Tom McGrath: Mm-hmm.
Nathan Labenz: Batch top K is an interesting wrinkle on this where you can define exactly how many nodes will be active or control that in a few different ways. But if I apply the minimum description length, what is the thing that I get out?
Tom McGrath: Yeah. So the idea of minimum description length is more like it's a way by which we should compare all of our various approaches to decomposing models. If for the same level of accuracy in terms of decomposition, one has a shorter description length than another, then we should prefer the one with the shorter description length. So for technical reasons, something which is tree structured would generally have a shorter description length than something which is not tree structured, which is just a bag of features. As long as you know the conditional probabilities along the tree, you generally have a shorter description length. And that matches with our intuitive understanding of it's easier to talk about things in terms of relationships between parts than it is by just enumerating all of the parts separately. But actually how you calculate this is quite difficult. The concept is simple: in general, prefer the thing with the shorter description length. All the meat is in the question, well, how do you calculate the description length? And so I'm just going to blast it in there.
Nathan Labenz: Is it something that you can apply directly as an optimization target as well?
Tom McGrath: Not yet. I think we'd like to be able to, but not currently.
Nathan Labenz: Gotcha.
Tom McGrath: It is in some sense the ideal optimization target, if only we could optimize it.
Nathan Labenz: I don't know if it would be batch top K versus a more naive sparse definition or maybe some other...
Tom McGrath: Yeah.
Nathan Labenz: ...that you would highlight, but how has the optimization target improved? That's basically us conceptualizing the problem better, right? And you can measure that by this minimum description length. My sense is...
Tom McGrath: Oh, I see.
Nathan Labenz: ...that batch top K is a better way of conceptualizing the problem than slightly more naive ways that came just before it.
Tom McGrath: I think they would have the same description length, more or less, like an SAE with batch top K versus an SAE with value versus an SAE with jump value. All of these would have approximately the same description length, I believe, all else equal. But being able to have something that's expressed in terms of a tree structure or there being various subspaces, these would have a different description length. Or for instance, to take a detour into circuits for a minute, if I had to describe something with cross-layer superposition by enumerating each of the separate layers that I cared about, and there being one SAE feature for this layer, one SAE feature for this layer, one SAE feature for this layer, plus also they're the same thing, that has a longer description length than me saying there is a feature which is spread across these three layers. So I think it's on a higher level of description. It's on a higher level of abstraction than say batch top K versus value.
Nathan Labenz: So all in, where are we on doing the machine learning better part? Again, my most recent point of reference is the Anthropic tracing model thoughts. And my sense was there's a lot that's not reconstructed, right? So if you...
Tom McGrath: Mm-hmm.
Nathan Labenz: ...were to push, if we say, what's the best case scenario? Can we reconstruct a model that kind of works? Or does it not really work at all still? And is it therefore only limited to these very... because in the Anthropic work there, much of what they were doing was prompt-specific analysis with custom error terms thrown in for those particular cases that they wanted to study. And that doesn't... Certainly there's still a lot to be learned there, but I guess how close are we to effective reconstruction of models as it stands today?
Tom McGrath: So I think there are two aspects to that question. One is, can we essentially throw compute at the problem to get there? And the other is, how much will it take? If the answer to the first is yes, we can, then how much more will it take? And if you think about this in terms of scaling carefully, this is like asking does it plateau or does it not plateau? Is there an irreducible error? And this question has been examined a little, not as much as I would like it to be examined. Did we talk last time about the Dark Matter of Sparse Autoencoders paper by Josh Engels and collaborators? I don't think we did. It's a really interesting paper.
Nathan Labenz: Yeah, I don't think so, but we definitely should dive in.
Tom McGrath: Yes. So what they did is this scaling analysis and they said, "If you were to keep scaling SAEs up, would you be on track to recover all of the activation?" And their experiments suggested that the answer was no. If you trace the scaling curve out, there's a substantial amount of dark matter, that the scaling curve does not go up to 99.99 or 100% or something like that. It actually plateaus. It's actually on track to not recover everything, and no one really knows what dark matter is. That's why it's dark matter. If it were easy to understand, we would already have understood it and probably baked it into a new kind of SAE architecture, but this experiment hasn't been repeated as far as I know, and I... So I feel like there's a lot to pull on there. If you did it with updated models like this... Others... I didn't even mention on doing the machine learning better, this recent SPADE paper, which is a much more expressive version of sparse autoencoders. We should put that in the links. But if you were to redo this analysis with the latest techniques, would you still find that there's dark matter or no? I suspect the answer is yes. I suspect you would still find dark matter, and getting rid of that seems like a pretty big deal.
Nathan Labenz: I mean, we still have
Tom McGrath: other way of
Nathan Labenz: in all the scaling laws, right? There's always
Tom McGrath: Yeah.
Nathan Labenz: a constant term on all the transformer scaling laws too, right? So would it be... I mean, the prior guess would be that there's some dark matter minimum that you might approach and you'll have to chip away at it in the same way that we do with a normal loss, which would be you have to 10X your inputs for the next increment. And I guess that would be
Tom McGrath: Ah, no, so the question is, even if you were to scale up computers as much as you possibly could, would you... Well, it's the difference between a curve on a log plot that is a straight line and a curve that eventually bends. And if your curve bends, then you have a problem because you're no longer on track to reach 100% reconstruction. And I think the claim of the dark matter paper is that it bends. And it bends because there are things that we really cannot efficiently reconstruct. Now, I don't really know what they are. I think figuring out what they are would be very important. It could be that there are things like memorization or various kinds of structure that SAEs are just not, higher order structure that SAEs are just not very good at reconstructing and learning. It could be that they're actually noise. They're actually... Because the dark matter paper was on reconstruction accuracy. Well, actually fractional variance unexplained. But it could be that this is noise, that the dark matter is just noise that has no relevance to SAE, to model outputs. And so if you were to repeat the analysis with loss recovered, for instance, it might look different. There's so much that we don't know, that's actually very important.
Dan Balsam: But okay, so just to make sure I'm not confused, and maybe I am. Going back to just general transformer scaling laws of whatever vintage we want to look at, there's always this constant term. I interpret that as the scaling laws do not suggest that we're ever going to have zero loss, but on our way to the theoretical minimum, we have this, 10X for the next increment of regress.
Tom McGrath: Mm-hmm.
Dan Balsam: And does that theoretical minimum imply the same kind of bend? Is that the same theoretical minimum that you're describing on the
Tom McGrath: Presumably.
Dan Balsam: capability side?
Tom McGrath: I mean, presumably they both imply that at some point it must bend. Unless it hits the minimum, so just bang and then is horizontal. There's an interesting difference though between the scaling laws, which is that there are two sources of irreducible error, I suppose, in a language model scaling law. One is the uncertainty of the world, right? If I literally have two strings that have up to token T, they're the same. And then at token T plus one, they have different strings, right? And these are both in my pre-training corpus. Well, now I have a flaw. I can never get down to zero loss training on this corpus. I can never get down to zero loss because it's 50/50. Looking at this data, I have to pick, given the prefix, it is no longer deterministically predictable what the outcome is even if I had a perfect model of the data. So that's one thing. And the other is architecture dependence, where say there are things that yes, the transformer simply cannot capture and these are somehow detectable from the scaling law. We expect that it will asymptote out because of architectural limitations. The difference I suppose is that we believe we should, like... It seems maybe capabilities people would disagree, but it seems okay for a language model to not achieve perfect reconstruction. But I would much prefer that my SAE or my interpreter model achieve perfect reconstruction. That seems pretty important to get, for that to be at least for it to be very high in reconstruction.
Dan Balsam: I think there's also the question of what abstractions you are working with here. What are the limitations of the technique? If we had wide enough SAEs, what would the features actually be?
Tom McGrath: Mm-hmm.
Dan Balsam: Dataset examples. They would be specific dataset examples for every activation pattern it has seen, because that's what it's going to learn. So you are trading off against generality and memorization specificity in the context of an SAE.
Tom McGrath: I think this is a really interesting intellectual exercise. I think it's very useful. If you think about higher order structure, this is also somewhere you can soak up features. Say that my X hypothesis, my true data, lies on a ring around the origin. There is only one feature, which is the angle: how far around the ring have I gone? But if I want to try and reconstruct this data with SAEs, then I can put infinitely many features into it; they will go at different points on the circle. So I can soak up an infinitely large amount of compute. Whereas, in fact, what I really needed was one parameter. So this is another example of there being fundamental scientific insight or fundamental knowledge about how models represent their activations that we need to convert into induction parses.
Dan Balsam: To add some concrete examples to ring one, there has been work that shows transformers can encode days of the week, for instance, on a ring. This makes sense because you can think of days of the week as a mod seven operation. You add until you get to seven, then you loop back around to one. But if you are training an SAE, what are you going to recover? You are going to recover Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. In the case of days of the week, that particular abstraction, you can think about it both ways and it makes sense, but in certain types of abstractions it is very reductive to look at it in that way. A good example would be we know transformers can do addition. The way they do addition, modular addition, is with a set of trigonometric operations. If you look at Anthropic's latest work with the CLT and what they pulled out of Claude in terms of doing addition, it looks like overlapping heuristics. You can look at that and you can say, "This transformer is doing addition in this very unintuitive dumb way." And maybe it is, or maybe this is an artifact of the measurement apparatus, and it is operating in some function over some continuous geometry. But the nature of the measurement apparatus is that we will never be able to see that continuous geometry. So I think these are really important, interesting questions. It does not mean that it is not helpful, or you are not gaining information about the model, or it is not telling you something important when you are doing this reductive form of measurement. But you do have to keep in mind that even if you had a perfect SAE, what type of information would you struggle to recover from this?
Nathan Labenz: Related somewhat, intentional perhaps, but the mention of modular arithmetic has me back to the grokking paper. I remember this one plot where the test loss or test accuracy shoots up quickly as the model memorizes all the training data it sees, and then it is orders of magnitude later that the grokking effect happens. That happens over the last order of magnitude, which is important to keep in mind: the 90% of the data or the 90% of the training time. It is weird that way, going from ten to the five to ten to the six, after it had memorized by ten to two steps. A mental model I have of any model I am working with is that some things have probably been grokked, some things are mid-grokking, and some things are not at all grokked at the time I am using the thing. This in a way seems analogous to the fundamental world uncertainty when you are trying to interpret a model. You are making an assumption that things are grokked, to some degree. The notion of these features and circuits and all this stuff is a level of grokking that is not just random, bizarre memorization that happens to be the first thing that worked on the training set. I imagine that is a fundamental challenge of all of this: some of the things you might want to recover may only be partially learned in the first place.
Dan Balsam: I think what we really want is a sort of interpreter model that's algorithmically neutral. If it's grokked, then I want to see that it's grokked. If it's a half-baked heuristic, I want to see the half-baked heuristic. And if it's a memorized thing, I want to see the memorized thing. Actually getting that is, again, difficult. But I think you're absolutely right to call out this sort of... It's easy to assume that everything is going to be clean, but a lot of the time they are going to be messy heuristics that maybe even make up the bulk of the computation most of the time.
Nathan Labenz: Okay, so we're moving into parts two and three around the gap between what a feature is in some grounded sense and what we label or understand it as, and then obviously moving into circuits as well. Maybe the best way to talk about this is what are the downstream things that we're actually using interpretability for today? And then we imagine closing this gap of the grounded truth and the labeling seems to be an increasing automation of a lot of those different techniques to cross-validate against each other and make sure things are actually checking out at scale. But right now we're not largely scaling all that out. We're largely exploring in a more naturalist way, and we've got all these different tasks that we'd like to be better at. Can you tell us the story of recent progress in terms of what actual utility we're getting from our interpretability helpers?
Dan Balsam: Yeah. So, I'd love to talk about all the applied work we're doing. Before diving into that, I want to make one more point on features, if that's okay?
Nathan Labenz: Please.
Dan Balsam: All right. Which is that even within features, take as a given the SAE paradigm or recovery features, we can explain some percentage of them with the way that we label today, and some percentage of them that we can't. One thing that's come up a lot in our work with customers is trying to develop a taxonomy of different types of features that you might find. So the way that we label features right now predominantly comes from ground truth external data. So it's connecting some pattern inside the model, some direction and activation space to some pattern in the inputs of the model. And these can get pretty abstract. Our work with R1 and with reasoning models, we managed to recover features that are pretty neat, that are qualitatively different than things that we saw in language model features and that seem to represent important units of the process of the reasoning trace for the model. But then there are also many features which could decompose in similar ways, could be explained. They're not dark matter features, but they refer to algorithmic processes that are happening inside of the model itself that's not easily visible on the basis of looking at input data or output predictions. A clear example of this would be in-context learning. Something happens in in-context learning, people have looked at it and studied it. There's an algorithm that is implemented by the model that can be difficult to observe from the outside, but has really important implications for the downstream prediction. So we can imagine when we look at biology models, for instance, genomics models, which is something I'll elaborate more when we get into the applied work that we're doing. Some percentage of the features correlate very strongly with known biological structures, and then many features don't. And for those features, the question is, well, are those biological structures that we don't know yet? Or are those structures of the computation of the model itself? How can we tell the difference between these things? And then, in either case, that's extremely useful information, because information about how you model these biological systems is still, in and of itself, really important to the question of scientific discovery. But yeah, so I think it's important to think of, when building up your taxonomy of features, there's the percent of features that we can explain by observing the inputs, and we can explain those with high levels of confidence. And our confidence goes down over time as they are moving into detail. And then there are the features that represent something more abstract than just something about the inputs, something about the model itself that we need to understand, or maybe something in the scientific domains about the inputs that we don't understand yet. And then you keep going on that curve and that's a spectrum and eventually you get to things that, with current techniques, we struggle to explain and that's why we need to invent new techniques pushing even just the interpretation of the features that we can recover. And then there's the dark matter. So -
Nathan Labenz: Yeah.
Dan Balsam: So there are a bunch of different dimensions of this problem. But I think it's important for the viewers to also think about this taxonomy of features. What might a feature be doing? And it might be doing something that's actually entirely invisible from both the input and the output.
Nathan Labenz: Yeah, so if you were to do few-shot learning, one thing that strikes me as a way to bridge the gap there, and again, feel free to de-confuse me, would be to expand the window. When I... And there's lots to recommend about the Anthropic work of course, but the interface is characteristically really nice for being able to explore what they're doing in an interactive way, just probe into it yourself. When you click on a feature and it shows, "Okay, here's the examples that maximized that feature from the dataset." You're typically looking at a pretty short snippet because you're three tiers down a UI from a webpage to an embed, to a little pop-up within the embed.
Dan Balsam: Yeah.
Nathan Labenz: I don't actually know how big those snippets typically are.
Dan Balsam: Right.
Nathan Labenz: But if you imagine a few-shot learning feature being, "Okay, we have recognized that there is a recurring pattern here in some very abstract way and our job is to continue it," R being the model.
Dan Balsam: Mm-hmm.
Nathan Labenz: I guess in this narrative. Then you wouldn't see that if you had a 20 token window, let's say. But if you zoomed out to a 20,000 token window, you might see the whole thing. And I imagine some similar things could be happening in biology where you're way downstream of an activator sequence that turns this thing on in the first place, or what have you. So, is there a sharp distinction between these grounded features and the computation features? Or is it just a question of our ability to-
Dan Balsam: Yeah.
Nathan Labenz: ... zoom out far enough to see the pattern accurately?
Dan Balsam: Yeah. This is a really great framing on the problem. I think in-context learning is a good example here because, yeah, if you zoomed out and you included more context. So generally, we do auto interp, which is this process of labeling with different amounts of context and different types of context depending on the problem and the domain in which we're operating. In-context learning is an interesting example, right? Because if you zoom in too much, you wouldn't be able to see it. If you zoom out, you can probably tell it from the prompts but it's also an algorithm that we know that we expect to find in the model to begin with. And then we would expect that a frontier model can probably identify a meta-pattern if it sees a bunch of examples as well, because this is a type of thing that we've identified that models can do that we've already described pretty well. In principle, if you zoom out to the entire genome, and you had some features that were active in a bunch of locations across the entire genome, and you had a frontier model go and look at those, its ability to label those effectively for you in an automated way is bounded by whether or not an explanation was in its training data to begin with. And in many cases in the scientific domains, there's not an explanation. We're working with sets of abstractions that are pushing the frontiers of human knowledge in some way. So, we need some way of labeling and thinking about how these features compose that can push past that. You need to be able to look at a bunch of different contexts. Let's say, look at 100 different genomes and see these patterns and reason about what they might be doing together. You have to be an expert in the human genome, perhaps even beyond the level of the greatest human experts today to reason about that just on the basis of inputs. So, one thing that we're interested in doing is, can we break this down into a set of easier problems, right? If we have 10% of these features that we can explain just on patterns on the inputs in an automated way, can we work with domain experts, scientists to keep pushing up that frontier of possibility even further and explaining more and more of what the model is doing?
Nathan Labenz: So, how's that going? Because that sounds like maybe one of the more important questions we'll touch on today, right? I mean, one of my great hopes has been, and I know you share this, and are actually moving the frontier on it. But the idea that we can use these unsupervised approaches on natural data and then look in and see what the model is learning, and then learn it ourselves and actually have confidence in what we're talking about and make new discoveries. This seems like a really great driver of new discoveries. But you're complicating my naive optimism a little bit with this idea that we don't necessarily know if the features correspond to the real world or just to the internal model. That's a level of confusion that I'm certainly excited to hear how you're going to resolve.
Tom McGrath: Oh, I think that they still correspond to the real world in the sense that algorithmic features can correspond to the real world and follow... Right. Like, Newton's laws do not exist in the world. They're a way of describing the world, right? We can talk about the idea of velocity as a type of feature, but it's not... But you just see things have a velocity. They don't have the concept of velocity. I mean, these are... The structure is useful still.
Dan Balsam: Yeah. And so, this is one of the big ambitions and goals of GoodFIRE, is to crack this question. So, we're working directly with customers across different scientific domains, with their scientific models, with this goal. For instance, viewers might have seen our research that we've done with Arc Institute, recovering features that correlate very strongly with known concepts in the genome. And where Arc Institute and ourselves are pushing this collaboration now is moving towards unsupervised techniques that can help get us new information in the genome. And we're actively working on this research, and hopefully have some exciting things to share in the not-that-distant future. Yeah, I think at a high level, if the model is correctly generalizing in its ability, at least sometimes, right? It's on this continuum for, it groks some things, it's memorized others. What has it groked? What has it memorized? To be able to generalize, it must be learning really meaningful things about the underlying systems. And so recovering those, even in an algorithmic form, tells us really interesting things about the system that's being modeled. Current bioinformatic tools... So, a reasonable question you may have is, "Well, how do we think about the genome today?" Annotations in the genome are done using bioinformatic algorithms that have been developed by humans that have strong priors. In some cases, work extremely well, and in some cases, actually don't work that well at all, but work well enough that they're a good starting point. The genomes are massive and the information complexity of genomes is extremely large. So, we don't have good first principle techniques to answer a lot of bioinformatic questions that we might want to answer. But it seems like in the case of some of these models, they're able to learn things that make these tasks easier. So, it's going to take time. This is an important scientific project that's pushing the frontier of both transformer modeling in genomics and interpretability. But at the end of the day, just going back to the paradigmatic beliefs that Tom mentioned earlier, the model is doing something, and it's doing something meaningful. And we have windows into the model now. They might not be the right windows to tell us everything, but we have strong reason to believe that these windows are good enough to start telling us important things. And so that's really why we're pushing on this very hard.
Nathan Labenz: When you say it's going to take time, do you mean the Oberon Darios two years to a country full of geniuses in a data center, or is that the kind of time we're talking about?
Dan Balsam: I'm going to give a hot take real quick, which is that using mechanistic interpretability for scientific discovery, I think there's a decent chance that even if you were in a world where you had a bunch of geniuses in a data center, this would be their preferred way of doing science. There are so many barriers to running experiments in the physical world. Normative barriers, but also just physical barriers, paralyzable barriers. The rate of scientific progress could be rapidly accelerated by moving as much scientific experimentation as is reasonably possible to simulation, and doing it on chips. That looks to me a lot like what people are doing when they're training autoregressive genomic models or diffusion-based material models. They're running simulations of the physical world on chips. And interpretability then is your scientific toolkit that you go in to actually understand what that simulation is doing and extract principles from that simulation that can then inform further experimentation, and then ultimately real-world scientific progress. An interesting example here is how pharmaceuticals are created. They're often developed with something in mind, and then they're tried. They have some range of side effects and some range of things. Most of the time they don't work. The vast majority of time they don't work the way that people think that they're going to initially. And then these companies just keep around these drugs because the process of manufacturing a new drug is expensive, and then they get the patents. And then they just try it for a bunch of other things. And then eventually they maybe find something that works, and that becomes the treatment for it. There are many examples of this. And that is an extremely inefficient way to do science, and that's going to remain an extremely inefficient way to do science, even in a post-AGI world. So to me it just seems pretty likely that you're going to want to run simulations. And if you're running simulations, why wouldn't you want to train giant models? And if you train giant models, why wouldn't you want mechanistic interpretability to help you make sense of your simulation?
Tom McGrath: So, I guess we have... Yeah, we're doing all right either way, right? Either timelines are longer... It wouldn't be an AI podcast without timelines, so I was wondering when we were going to get to it. Either timelines are longer and mechanistic interpretability is useful, or what are the geniuses in the data center doing? Mechanistic interpretability.
Nathan Labenz: Yeah. I think that does make a lot of sense to me as a convergent evolution basically, right? I mean, either way it sucks to have to do wet work. So as much as possible you want to move it all into silicon, learn as much as you can that way. Hopefully it's accurate. It won't always be right, but one way I phrased the question in the past is, when is it worth it to do the wet work? And as much as possible you want to elevate and validate your hypotheses before you actually take it into the wet lab to muck around with. And whether it's human or AI genius steering that simulation process, for all the same reasons, it seems like it's where we end up.
Dan Balsam: Yeah. Yeah. That's exactly how we think about it. The core thing that just needs to be true here is that mechanistic interpretability, as it matures as a science, significantly pushes the frontier of the types of experiments and the quality of the results of the experiments that you're able to run on hardware before bringing them to the wet lab. That strikes me as overwhelmingly likely to be true. And again, for us, we're doing this because we believe that there is a really meaningful possibility to impact people's quality of lives through bringing interpretability directly to these models that could have real downstream scientific impact. Another good example of this is biomarkers of disease. So, in clinical context, there are two reasons that you would be interested in interpretability. There's the discovery reason that we've already talked about, but then there's the reason that being able to have explainability for diagnostics is really important. AI can often perform extremely well in these tasks and in closed settings, but if you have an AI system that misdiagnoses someone, you can't go to their family and tell them, "Sorry, it went wrong and I have no idea why and there's nothing I can do about it." This core debuggability is an essential feature in clinical context as well. But you also want to be able to say, "Oh, the AI gave a surprising recommendation in this case. It gave a surprising diagnosis." Was that diagnosis a result of the AI being wrong or was it a result of some important input pattern that we hadn't seen before?
Nathan Labenz: One quick aside on multimodality. Does this line of work, because you're doing this across reasoning models, for example, and across different models from science, genomic, and there's obviously lots more to come in terms of proteomic and higher orders of abstractions there. But I wonder, one mental model of superintelligence that I've been playing around with lately is basically that if you take a reasoning model at roughly the current level, and you give it the same depth of integration into 20 modalities of interest that we already now have with image from GPT-4O and Gemini Flash where they can clearly manipulate the image in a way that shows a deep integration of your instruction and the visual space that it's operating in. It's no longer through this bottleneck point of prompting the language model to prompt the text-to-image model. It's all joined in latent space and you just obviously see the qualitatively different results. My baseline superintelligence case has been, do that again for 20 more modalities, many of them in the natural world modeling domains. And there you'll have a superintelligence because you'll be able to reason around and manipulate things, but the things you'll be manipulating will be much closer to the fundamental stuff of reality. And in many of those cases, these are just things that people can't do, right? Nobody has an intuition for how a protein's going to fold, for example. Or at least, nowhere at the level that AlphaFold does.
Dan Balsam: So we talked a little bit about this last time. I want to start by saying that I have a wide distribution of timelines. Geniuses in a data center in two years is not outside of my Overton window at all. But on this particular point, I'm just cramming a bunch of modalities in. I think there's a way in which we're still constrained in AI progress by human data, even in this RL regime that's worth understanding. Why is it possible for these models to reason over images and text really well? That's because we can construct tasks for which we have good reinforcement learning signals for which this is a capability that they need to acquire to complete that task. When you start bringing in the scientific domains as well, we're often working with extremely sparse signals where it's very hard to reason about what task could I easily train an AI to do such that a deep intuitive understanding of the human genome was necessary to perform that task? But then also we have enough examples and a clear enough sense of that task itself such that we could create a strong signal for reinforcement learning to begin with. And then what percentage of the overall task that the model would be performing do these integrate really well? I think it's not a coincidence that combining image and text in problem-solving is something that humans do all the time. And this was one of the easier things, from an RL perspective, to train AI to do once you reach a certain level of intelligence. It's not just because those are easier modalities to combine in some abstract sense, though maybe they are. It's that we as humans are very set up as to how to think about and construct some optimization target at the intersection of those two things. So I think this gets at the question of, do you hit a wall when you get to human-level intelligence in some way, shape, or form? Or why would you or why would you not hit a wall when it gets to human-level intelligence? And I think, Tom and I might have slightly different beliefs here, but I think my belief is we just don't know. It's easy to draw the line out from where we're seeing and say these patterns extend to everything. But I think it could also just be true that these patterns extend to the things that they've extended to so far. I think that we are very likely to be able to get, just to come back to you a second, but the question of agents. Why has it been easier to build AIs that can solve elite coding problems better than any human than to build AIs that can order DoorDash for me successfully? One of these things seems like it should require much less intelligence in some sense to do, yet it's been much harder to get models to do. And I think it's just because it's harder to get good training signal in the agentic use cases, right? This is not something where we've captured a lot of data where you can write algorithmic verifiers very easily. And so as a result it's been harder to do. When you're extending to the scientific domains, it's even harder, right? If we had a doing science good verifier, that would be great.
Tom McGrath: Yeah.
Dan Balsam: But we don't. And I'm not saying that one doesn't exist, it's not possible to come up with one, or that maybe if we get general level intelligences, they can work on this problem and make progress on it. But it's a level of abstraction for which we don't have any evidence and for which we have reasons to believe that actually setting up the conditions in which an AI could learn that task in the way that we currently train AIs might be pretty hard.
Tom McGrath: I suppose another way to put this is that the question here is, are AIs currently experiencing catch-up growth in the sense that a less economically developed country might benefit from catching up to a more economically developed country? Or are they just on this growth trajectory? Hard to disambiguate.
Nathan Labenz: Yeah, I mean, I think that is where the other modalities and especially the models from different fields of science seem to be pretty strong evidence to me that they can do pretty critical tasks at an obviously superhuman level, right? The ability to... and we've gone well beyond just folding, but interactions and multimers and all different kinds of molecules and metal centers now with some models and whatever.
Tom McGrath: Mm-hmm.
Nathan Labenz: I mean, it's really pretty far along in terms of... I also did an episode on it, the guys from Orbital Materials, on figuring out the mechanism of the potassium ion channel. And it's like, geez, this has gone pretty far already. What hasn't happened there is the integration of this sort of chain of thought kind of thing with the more, what I call it, just intuitive physics for whatever different domain the model happens to be trained in. It clearly has a better ability to take a random set of proteins and guess how they'll interact than I do. What it doesn't have is the sort of outer loop to be like, "I should try this. Maybe I should try this. What if I swap this one out for this one?" And so people are sitting there writing scripts against these models to do that outer loop or just maybe even doing it based on their own human intuition one by one in some cases. But I have a hard time imagining a world where that integration doesn't happen. It seems like we can get enough synthetic training data and set up a situation where it's okay, here's what it looks like to just grind through a bunch of these things to get to the point where the reasoning and modality X integration eventually just drops into place, right? I imagine that happening.
Tom McGrath: I think you're right.
Nathan Labenz: I kind of can't imagine it not happening.
Tom McGrath: The question is the depth of integration, I suppose. You might think that things are very deeply integrated in the sense of being in the same neural network, but even there you have no guarantee that you don't have two models in a trench coat. Like, this half of the weights does the quantum chemistry and this part of the weights does the language, and there's not really any crosstalk such that the parts that are doing the language have any access to the process knowledge of how to do the chemistry. And the way that you would expect to get this crosstalk is if you are able to construct paired data or data where the language ability is bearing on the chemistry and vice versa. And we can do that relatively easily and we can provide supervision of various sorts relatively easily in image language, because we understand these domains both quite intuitively. We can give reward as in we can give approval and train a reward model and then do our LHF, or you can obtain paired data from the internet. But in the sort of where does the paired data or natural supervision come from to couple these two modalities for domains that we don't have the same kind of intuitive understanding because we can't give the same kind of approval. Now it might be that actually the answer is kind of more like you say, you put them... They're initially two separate halves of the neural network, let's say. And then you do some sort of training task where you gradually elicit the language reasoning in order to drive the chemistry part of it. But that doesn't feel like the same sort of depth of integration as you get through pre-training where you really do seem to just have this big massive compute all more or less able to access the other... Well, actually even that's not necessarily true. How much of a language model when it's talking about one domain is able to actually elicit its information from another?
Nathan Labenz: Maybe let's just do other applications, the move to circuits, and I mean, you could frame them conceptually or potentially by customer profile.
Dan Balsam: Yeah.
Nathan Labenz: ... there's the monitoring, classification, steering way of thinking about it, and then there's the retail and different models you're looking into. You could attempt both if you want to squeeze them in.
Dan Balsam: Yes, so I think there are three applications right now that we're very excited about. The list of potential applications is very long, but we're a startup, and we need to prune the tree a little bit and focus on what we think the highest leverage bets for us to take are. So scientific discovery, which I've already talked a lot about, so I won't spend too much time there, but again, just to quickly summarize, scientific discovery is if you have these models, they're modeling physical systems, they're able to model them in ways that traditional methods can't. So they must learn something that is important that human beings currently don't understand. We want to explain that. And the hope and the vision there is that those explanations are in and of themselves scientifically useful in terms of pushing the frontier in science. The second application is more guardrails approaches. So what
Nathan Labenz: Right.
Dan Balsam: ... interpretability techniques give you is a window into the model at inference time. If you think about the current way that a lot of enterprises set up their guardrail systems, oftentimes they're playing Whac-A-Mole with the prompt at first when they're setting up the prototype. The number of rules that you can specify, the contextual information in the prompt balloons. You end up in this very natural situation where your task performance is degrading because of all of this information that you have to add, keep telling the model, "Don't do this, don't do this, do this, don't do this." So people then move naturally to LM as a judge, but the scaling properties of LM as a judge are quite poor because now you have a separate frontier model call. It's great for the labs' pockets, but it's not great for the consumer. If you have 1,000 rules, maybe you can bunch them together in different checks, but at the end of the day, you run into the same problem of task aggregation and poor scaling properties. You could train a small model or fine-tune a small model to help guardrail your larger model, but many organizations both lack the data and machine learning expertise to do this effectively. So what we can do with our techniques is offer a cheap inference time solution for models where we can be watching the model's internal cognition and use that as an event when certain things happen, when the model is potentially going to output certain things or is outputting certain things or is reacting to certain things to trigger programmatic responses. So this could be things like if, for instance, the model looks like it might be thinking about PII, then let's flag that for someone for manual review in some way. Or it could be, for instance, there are certain topics that I never want my model to talk about. Of course, there are limitations to what can be done. There are jailbreaks. There are all types of things. But for real practical use cases, there's a lot of opportunity here to offer cheap, fast, and effective checks in real-world scenarios. It just has much better scaling properties. Then the third category of application that we're really excited about is creative models. We're going to be launching a demo soon. In fact, by the time this airs, it might already be out, demonstrating what you can do with image models when you start to understand their latents. And in unsupervised ways, we're able to recompose the elements of the image from this deep understanding of what's happening inside of an image model. And we think this just offers new types of user experiences that even as image editing tools and other forms of AI continue to get better and better, you're still very much locked into these bespoke forms of interaction, whether that's by prompting or highlighting a region for inpainting. And there are certain types of key interactions that we had in earlier classes of image design software, just like the ability to drag something or reorient something or change some subtle property that get lost in that process. And so this image demo is a cool example, but we think we could push this and the value props even more clear in video and music, where the cost of editing them is very, very high. And wouldn't it be amazing to have generative music AI where you could say, "Actually, I want a little more saxophone in this saxophone solo," and you could just very strategically intervene on the output in a way to get specific new generations that adhered to exactly what you wanted in some context?
Nathan Labenz: Maybe just my last thing before giving you the floor for the close-out pitch on Goodfire is, where are we? I think it was maybe a year and a half ago that we first talked about... Maybe just a year. Time is compressed. The idea that with these tools, you can allow many more people to get involved in the process of understanding models. And I've been fortunate enough to have a couple early previews of different model steering and exploring interfaces that you guys have built, and I wonder how you would characterize where we are on that today. Are the interfaces far enough along that we are now effectively able to enlist human intelligence, or are we still working on that and still more in the auto-interpretability world because we haven't quite cracked the paradigm that allows people to make intuitive use of all these new feature spaces that you're opening up for them?
Dan Balsam: It depends on the problem you're trying to solve, which I know is not a very satisfying answer. But in the case of images, as you'll see in our demo, you can reason about images and image features very intuitively. There are ways to visualize them that don't rely on any external assumptions about inputs. It's purely about visualizing what that feature does to the output. So there are fewer assumptions at least, and our flash image demo will demonstrate some of this in that context. There has been a lot of work internally at Goodfire, and also at other labs, and especially in the open source community, on different UXs and visualizations you can have. They all solve problems. They are all a way of looking at it. Right now we are not in a place where we have a single interface that can tell you anything you need to know about an arbitrary model. We are working on some things in the biology domain that we hope will generalize. But it's like everything else. All these tools have limitations. It's about the problem you are trying to solve and what you can learn in the process of trying to solve that particular problem.
Nathan Labenz: Cool. It has been a fantastic conversation. I really appreciate the time and I want to do it more regularly going forward. Take us home. Maybe give a few highlights, things you are most excited about, things you guys are looking for. Last time, if I do say so myself, I understand there were a couple inbound business opportunities, so you can put the bat signal out there for whatever you want to come your way this time.
Dan Balsam: At a high level, we are looking for people to join our team at Goodfire. If you want to work on what we all think is the most important unsolved problem in the world right now, you can join us to help reverse engineer what is happening inside of models. One of the really exciting things about Goodfire is we are pushing on this problem from a bunch of different angles. We are not saying let's push what we can do with these tools, or let's develop new tools. We are doing both at the same time in dialogue and in concert with each other, which is really important to the ultimate progress. We are looking for great engineers and great scientists who are very motivated by our mission here and want to open up the black box. From a customer's point of view, if you fit one of the profiles that we talked about, if you are someone who is training a scientific model, if you are an enterprise looking for more reliable usage of your LMS in production, or if you are someone training a creative model and you are looking for new ways you can open up the experience in the creative model, please reach out to us. You can email me at dan@goodfire.ai. We would love to talk to you and understand whether or not there is a good opportunity for some collaboration there.
Nathan Labenz: I mentioned it at the top, but we should also probably mention you just raised a bunch of money, including taking Anthropic's first ever external investment. So what more beyond that headline would you impart to people?
Dan Balsam: We raised $50 million led by Menlo Ventures, and a million dollars of that from Anthropic as Anthropic's first ever corporate investment. We are extremely grateful to have investors who really believe in what we are building and the mission, the problem we are trying to solve here. We think it is really important to have a company not directly tied to a scaling lab that is directly trying to solve the problem of interpretability. We intend to use all this money we have raised to help customers understand their models and to push the frontier of understanding in mechanistic interpretability. As we were saying before, hopefully across a bunch of scientific domains and unlock lots of new awesome experiences we can have with AI. Our core belief here is that interpretability is as big as AI itself. So there needs to be a lab that is focusing on interpretability and nothing else because the size of the opportunity here is really large. Grateful to have all of the support from investors like yourself, Nathan, and all the people who really believe in the thing that we are building and we are motivated to get out there and start making things happen.
Nathan Labenz: It is definitely a good candidate for the most important problem facing the world today. So it has been a pleasure and definitely a fascinating journey to try and keep up with all the progress you guys are making. Again, come back soon. But for now, Dan Balsam and Tom McGrath, CTO and chief scientist at Goodfire, thank you again for being part of the cognitive revolution.
Dan Balsam: Thank you so much for having us. Thanks for having us.