Untangling Neural Network Mechanisms: Goodfire's Lee Sharkey on Parameter-based Interpretability
Today Lee Sharkey of Goodfire joins The Cognitive Revolution to discuss his research on parameter decomposition methods that break down neural networks into interpretable computational components, exploring how his team's "stochastic parameter decomposition" approach addresses the limitations of sparse autoencoders and offers new pathways for understanding, monitoring, and potentially steering AI systems at the mechanistic level.
Watch Episode Here
Read Episode Description
Today Lee Sharkey of Goodfire joins The Cognitive Revolution to discuss his research on parameter decomposition methods that break down neural networks into interpretable computational components, exploring how his team's "stochastic parameter decomposition" approach addresses the limitations of sparse autoencoders and offers new pathways for understanding, monitoring, and potentially steering AI systems at the mechanistic level.
Check out our sponsors: Oracle Cloud Infrastructure, Shopify.
Shownotes below brought to you by Notion AI Meeting Notes - try one month for free at: https://notion.com/lp/nathan
- Parameter vs. Activation Decomposition: Traditional interpretability methods like Sparse Autoencoders (SAEs) focus on analyzing activations, while parameter decomposition focuses on understanding the parameters themselves - the actual "algorithm" of the neural network.
- No "True" Decomposition: None of the decompositions (whether sparse dictionary learning or parameter decomposition) are objectively "right" because they're all attempting to discretize a fundamentally continuous object, inevitably introducing approximations.
- Tradeoff in Interpretability: There's a balance between reconstruction loss and causal importance - as you decompose networks more, reconstruction loss may worsen, but interpretability might improve up to a certain point.
- Potential Unlearning Applications: Parameter decomposition may make unlearning more straightforward than with SAEs because researchers are already working in parameter space and can directly modify vectors that perform specific functions.
- Function Detection vs. Input Direction: A function like "deception" might manifest in many different input directions that SAEs struggle to identify as a single concept, while parameter decomposition might better isolate such functionality.
- Knowledge Extraction Goal: A key aim is to extract knowledge from models by understanding how they "think," especially for tasks where models demonstrate superhuman capabilities.
Sponsors:
Oracle Cloud Infrastructure: Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs, including up to 50% less for compute, 70% for storage, and 80% for networking. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive
Shopify: Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive
PRODUCED BY:
https://aipodcast.ing
CHAPTERS:
(00:00) About the Episode
(06:07) Introduction and Background
(10:09) Parameter Decomposition Basics (Part 1)
(21:29) Sponsor: Oracle Cloud Infrastructure
(22:38) Parameter Decomposition Basics (Part 2)
(34:23) Computational Challenges Explored (Part 1)
(36:16) Sponsor: Shopify
(38:12) Computational Challenges Explored (Part 2)
(49:39) Loss Functions Optimization
(01:03:27) Method Limitations Discussed
(01:09:11) Stochastic Parameter Decomposition
(01:30:46) Causal Importance Approach
(01:44:15) Feature Splitting Solutions
(01:55:25) Future Applications Scaling
(02:00:36) Outro
SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...
Full Transcript
Transcript
Nathan Labenz (0:00) Hello, and welcome back to the Cognitive Revolution. Today, I'm speaking with Lee Sharkey, principal investigator at mechanistic interpretability startup, Goodfire, about fascinating recent work that he and coauthors have done to start moving beyond analysis of the concepts that neural networks represent between layers and actually begin decoding how they compute within and across layers. We begin by discussing why an understanding of concepts isn't enough. On 1 level, this needs no justification. An approach that explains only the meaning of the intermediate results between layers, while it does take a serious bite out of the black box problem, leaves the layers themselves as smaller but still unexplained black boxes. But more concretely, research has also highlighted important weaknesses of the feature centric approach. Conceptually, a sparse representation of features inherently loses a lot of potentially important information, which is encoded in the structure of how features are clustered together and otherwise meaningfully arranged in space. For example, in some networks, the days of the week are represented not by 7 random directions in activation space, but by a set of directions that lie together in a plane, such that a simple rotation operation can act as a sort of next day function that converts a given day of the week into the next. If history is any guide, there is presumably a lot more such critical complexity to be discovered. And so with this motivation in mind and taking inspiration from sparse autoencoder and similar techniques that effectively separate clean concepts out of their usual state of superposition, Lee and team have similarly set out to decompose a neural network's parameters, which of course are used to process layer inputs into outputs into simpler subcomponents that they hope will correspond to interpretable mechanisms that the network has learned. Their first approach published earlier this year was called attribution based parameter decomposition. Lee describes this as a sort of model unmerging. If you imagine counterfactually having started with a giant mixture of experts model and then merging all the experts into a single instance of the architecture, Parameter decomposition would be the process of unmerging to recover those hypothetical original experts. Personally, I visualize it as expanding a neural network, which of course has width and depth defined by its architecture into a third vertical dimension, where each vertical slice consists of a simpler subnetwork that's presumably needed only a small fraction of the time, such that when all the vertical dimensions are again collapsed down to a single model, they all add up to the original network weights. Whatever mental model you prefer, amazingly, by constructing the loss function that incentivizes 3 things. First, faithfulness or the accuracy of the sum of the sparse networks back to the original full network. 2, minimality or the idea that as few network slices as possible should be active for any given input. And 3, simplicity of the discrete sub networks derived from the process. You can actually get this to work. I really do think it's worth taking a moment to appreciate how incredible this is. As Lee says, channeling Ilya, sometimes the models really do seem to want to learn. That said, given the complexity of the target, you shouldn't be surprised to learn that this first method was still far from ideal. In addition to being computationally expensive and requiring massive memory for all those parameter copies, training success was extremely sensitive to hyperparameter choices and there were conceptual issues with the gradient based way that performance was attributed to parameters. Specifically, in some cases, critical components that had approached a local performance maximum would have a near 0 gradient and would be misclassified as unimportant. These weaknesses naturally inspired the second method that we discussed, stochastic parameter decomposition. For efficiency, instead of describing network subcomponents as sparse instantiations of the original network architecture, the new approach breaks each weight matrix into rank 1 components, which are matrix operations that read from 1 direction in activation space and right to another, and which can later be regrouped into larger units. Also, to better identify which subcomponents really matter, they replaced the gradient based approach with a novel approach that uses stochastic masking to help the network learn to predict each component's causal importance, effectively identifying which elements the overall network can't do without. Overall, it's a more scalable, stable, and accurate approach. And on toy problems with known ground truth answers, it is able to successfully recover the expected mechanisms. Of course, as with any branch of interpretability, plenty of work remains. 1 big question is how can we algorithmically group the rank 1 subcomponents into semantically meaningful mechanisms? Much as sparse autoencoders require a feature labeling process, we still need to figure out how the rank 1 components ladder up to conceptually intuitive transformations. And of course, there will be plenty more challenges with scale. But as Lee explains, the promise of this work is tremendous. This kind of understanding could enable everything from surgical unlearning of capabilities to the identification and interpretation of novel scientific insights that models may have learned, including and perhaps especially from non language data. All things considered, this conversation is a great chance to visualize what's going on inside neural networks, to ponder how much we've learned and how little that still seems to explain, to marvel at the fact that gradient descent and back propagation can effectively optimize such complex architectures even under such complicated multipart constraints, and to appreciate the combination of brilliance and plain hard work that goes into figuring all this out. Right up there with understanding our own biology and arguably no less important or challenging, understanding how and why neural networks do what they do is 1 of the grand scientific challenges of our times. So I really hope you enjoy this conversation about decomposing neural networks in parameter space with Lee Sharkey of Goodfire. Lee Sharkey, principal investigator at mechanistic interpretability startup Goodfire. Welcome to the Cognitive Revolution.
Lee Sharkey (6:14) It's great to be here, Nathan. Thanks so much. A huge fan of the show.
Nathan Labenz (6:17) Thank you. That's an honor. I'm, excited for this conversation. I always love to learn, and I expect to learn a lot, over the next hour and a half or so. Real briefly before we get into the technical work that's gonna be the primary focus, you've moved recently. We've had, 2 guests from your previous organization, Apollo Research, on the podcast over time focusing on risks of deception and, you know, identification of deceptive behaviors in frontier models. And we've also had 2 different episodes with folks from Goodfire. Dan and Tom have done a a great job orienting us to everything that's going on in mechanistic interpretability. You were doing mech interp at Apollo and recently moved to Goodfire. So maybe give us just a little bit of a update on how that came about, and then we'll, get into the work itself.
Lee Sharkey (7:07) Yeah. Absolutely. So, yeah, Apollo Research and I, you know, I I started Apollo Research with a bunch of folks, including Marius, Marius Hubhand, who you mentioned you had on. And, you know, we started, and we were focused on detecting and ideally mitigating deceptive behaviors in Frontier AI systems. 1 of the goals there for my team was to focus on the mechanistic interpretability side of that, where if we can, you know, read the thoughts, so to speak, of these frontier AI systems, and maybe that will give us a leg up on detecting deceptive behaviors without necessarily needing to rely on their outputs. And yeah, I think we did I was really happy with the work we got done there and really love the organization we built and really continue to support the work that Apollo does. I continue to be somewhat involved there. But yeah, we figured that it was probably better at this time to for Apollo to really double down on the evals side of things, which left, you know, my team, which primarily focused on mechanistic interpretability rather than evals, you know, a bit in the lurch. And so we but, you know, we all agreed that this was the right step. And meanwhile, you know, say, maybe 6 months before we actually decided to to make this decision, I had it seems like engine history now, but I ended up connecting Tom Gra and Eric Ho, who, you know, went on to to co found Goodfire together. And so I had a, you know, a B character role to play in that in the the findings story of of Goodfire. And so, you know, I was obviously very aware of Goodfire and, you know, massively supported the work they're doing. I think, you know, the team is absolutely amazing. And it just felt like a really natural fit to to move over from Apollo to to to Goodfire, and it made sense for, yes, some of the team to come along as well. So we can basically continue the stuff that we were working working on there.
Nathan Labenz (9:19) Was it 3 of you that made the move?
Lee Sharkey (9:21) Yeah. So me, Dan Braun, and Lucius Bushnack.
Nathan Labenz (9:25) Gotcha. Cool. Well, that's great. I and I assume the $50,000,000 Goodfire raise to support both team and compute certainly didn't hurt the value proposition either. Right? And we'll get into a bit more, you know, where the compute costs are for some of this work, but everything seems to be, you know, dependent on a healthy dose of compute. So the ability to draw those resources in from the private sector, you know, definitely as opposed to, you know, Apollo being philanthropically funded. Yeah. Also, you know, it seems like it could be a big differentiator long term for the ability to scale what you're doing.
Lee Sharkey (10:07) Yep. A 100%.
Nathan Labenz (10:09) Cool. Well, the we're gonna, I think, primarily focus on 2 papers today, 1 of which came out still under the Apollo banner and the more recent 1 under the Goodfire header. First 1 is called interpretability in parameter space, minimizing mechanistic description length with attribution based parameter decomposition. I think there's a you know, try not to retread well covered ground too much, so I definitely would refer listeners who wanna go deeper on that particular paper and and have a really thorough exploration of it to the AXRP podcast that you did not too long ago. I thought Daniel did a great job there, and I listened to the full thing to help inform myself coming into this. I guess it seems like maybe I'll just ask you the first question. What is attribution based parameter decomposition or maybe even more narrowly, what is parameter decomposition? And, you know, how does it differ from the schools of interpretability that people are probably familiar with? And you could probably assume people, if they've listened to our feed much, at least are aware of the basics of of SAEs.
Lee Sharkey (11:16) Yeah. So what is parameter decomposition then? Well, let's contrast it then with with activation based decomposition. So in interpretability, typically in the past, what we've done is we've wanted to understand what is going on inside a neural network. And so what we've done is we've collected lots of activations of how the network has processed its input data. And, you know, you've got these intermediate activations, and then it spits out an output. And what we've typically done is we started with the idea that, well, these activations should, in some sense, be involved in different things that the network is is is doing in order to compute its, you know, intelligent output. There is some sort of, like, learned algorithm that the network has has has learned, basically, to to, you know, exhibit its intelligent behavior. And activation decomposition basically is the idea that we can look at these intermediate activations, piece them apart, and say this part does this, and this part does that. And by contrast, parameter decomposition, it has is philosophically very similarly motivated. But instead of picking apart the activations, the idea here is to pick apart the parameters. Now why might you want to look at the parameters? Well, in some sense, the thing that we're really interested in is the neural network. And the neural network is not its input data, although it interacts in very, special ways with it. But the the neural network is, in some sense, you know, the parameters, the architecture that pieces them together, the nonlinearities that connect them. And it uses all these components to transform the input data into activations and transform those activations until it gets to the output. And in some sense then, the parameters on the architecture and the nonlinearities are the thing that we really want to understand. They are the, like, they are the algorithm that the they are implementing the algorithm that the network has learned. And the, you know, activations in some sense are, in some sense along for the ride, but they do interact in in very particular ways where, it's not quite as, as separable as that. But, the basic idea here then is that the network is, using its parameters, in different ways for different inputs. And, for, you know, an input where there is a cat, it will, take these inputs and, spit out a, you know, a cat label, say. And for another input, maybe, you know, a picture of the Eiffel Tower, it may not use very many of the same parts of the network as it did when looking at a cat. And so in some sense, there's this idea of modularity. There's like a, parts of the network are doing 1 thing, other parts are doing some other thing. There's like a specialization in in what, the parameters are doing. And we we don't really you know, the the aim for a parameter decomposition is to find these, like, these modules that are doing specific specific jobs, specific computations. And the the term that we're that we, you know, say we're looking for here is we're we're looking for the mechanisms, like the the mechanisms that the network is using in order to compute its behavior.
Nathan Labenz (14:50) Sort of just flush out that cat example a little bit more and just say some really basic stuff. All these neural networks are composed of layers. Often the layers are the same, you know, exact structure repeated over and over again, although not always. We've observed by doing interpretability on the activations that we can see things like fur has been detected and eyes have been detected and a tail has been detected. And we sort of see this, like, gradual move from low level features to higher order features as we go through the layers. And so there's this story of kind of understanding. It kind of mirrors, especially in these visual cases, of mirrors what I understand it would also be happening in the human visual system. Mhmm. And so that's remarkable coincidence. Yeah. But when we are purely looking at the results of the intermediate calculations, we can sort of say what concepts are active at any particular time with lots of caveats. Refer to earlier episode with Dan and Tom about, the sort of philosophical gaps between, like, the labels of these concepts and the underlying sort of, you know, what exactly is is being activated there and how does it relate to the labels is a bit fraught. But nevertheless, bracketing that for the moment, we have these sort of labeled concepts that we can say, okay. This seems to this feature always lights up for all these inputs. All these inputs, you know, seem to be for, and so that seems to be the fur concept that's activated. When that and the and the tail and the, you know, whiskers and the eyes are all the pointy ears or whatever are all activated, then in, you know, future layers, we get the cat activated. But we're not saying anything there about how the transformations are happening from layer to layer. Right? That's the big gap that has been sort of left by the sparse autoencoder work for future work. And and this parameter decomposition is basically is that future work that is now saying, okay. How do we actually move from these concepts through the layers 1 to another?
Lee Sharkey (17:05) I agree with that. I'll say also that it's not the only approach we that you might consider using in order to, you know, piece together how these, like, you know, representations at 1 part of the network may, become other representations at other parts. There are other other approaches that remain in activation space that that you might consider also using. But, yeah, they they both try and, achieve this idea of, like, characterizing the the computation between representations rather than just identifying the the representations themselves. Maybe 1 way of thinking about this is, like, identifying the the the variables used in computations rather than identifying the the computations and parameter decomposition and other approaches aim to find the computations rather than the the variables.
Nathan Labenz (17:56) So why why do we need that? Why isn't it enough to say, well, we've got, you know, concepts a, and c in layer 1, and then in layer 2, we see concept d. And, you know, isn't that like, doesn't that tell us what the neural network is thinking, so to speak? Like, what is what is not answered by that level of analysis that we still need to get clarity on?
Lee Sharkey (18:19) It gets somewhat philosophical. However, I'll I'll try to, yeah, I'll I'll try to convey kind of how I think about this. I would say that we don't what does it mean for the network to have to use particular variables? You know, we can go in and we can say, look, there's this set of activations, and this set of activations seems to correspond to data points on which there are cats. But maybe we can also find groups of input inputs where there's a cat, but it's a cat standing in a particular position. And there's 1 group of your inputs is a cat standing in 1 position and another in another position. Now does the network use the fact that there are 2 different positions of these cats, or does it just have a cat variable? What would it mean to be able to distinguish between a network that just has a cat variable and a and a network that has multiple different cat variables? My like, proposition here is that what it means for a network to use particular variables is that these variables are they are like the variables that the network does its computations over. Now it may be the case that this network that had multiple cat in different positions variables actually might not use those things. They all just to the network, they're all just cats. But you may nevertheless be able to look really in detail at the dataset and find lots of little distinctions within the activations. But the network might not itself actually use these. In some sense, we want some way to identify the set of things that the network is using. And to do this then, like, in some sense, you might say, well, this is putting computations first and representations second, rather than putting representations first and then, computations second. This is it's like yeah. It's it's like saying the the fundamental, so to speak, variables that the network is using are, those over which it does computation. And we can, you know, maybe find things that that, I don't know, the the network seems to represent, but there's not a particular reason to to say that this is a a feature that the network uses just because we can find it in there. Does that make some sense?
Nathan Labenz (20:56) Yes. I think so, though it, as you said, does get a
Lee Sharkey (20:59) little philosophical.
Nathan Labenz (21:04) I was also thinking of the finding of multidimensional features as sort of another window into why the assumptions that underlie the SAE paradigm may need to be elaborated and and may need a, you know, little richer treatment. You wanna tackle that as well? Hey. We'll continue our interview in a moment after our words from our sponsors. In business, they say you can have better, cheaper, or faster, but you only get to pick 2. But what if you could have all 3 at the same time? That's exactly what cohere, Thomson Reuters, and specialized bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less storage, and 80% less for networking. And better, in test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now with 0 commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.
Lee Sharkey (22:38) Sure. Yeah. So, like, I guess, since, you know, the some of the the earlier, like, SAE work in the latest, you know, bout of sparse coding work in in in the deep learning space, which kind of maybe started late 20 22 and kind of picked up in 2023. You know, some of the observations that people had since since that time were things like, well, it looks like the representations that we're finding, seem to be spread across multiple different layers, and it's kind of confusing to think about what that might mean. It's there are also observations that these things that we were finding, well, they they were their directions in activation space, their their, you know, particular patterns of neural activation, but that didn't seem to be the whole story. There seemed to be some sort of, like, multidimensionality to them. It wasn't just 1 direction. It was, you know, a perhaps a plane of directions or even, you know, even higher dimensional spaces potentially. And it was kind of unclear how to think about, you know, these kind of higher dimensional features, so to speak, as as as like, it just was philosophically somewhat confusing to to think about this. And 1 of the reasons that it was indeed these these were several of the reasons that we that motivated some of our you know, some of the the thinking that led to parameter or primary sorry, attribution based parameter decomposition, which was that, well, what if, say, a multidimensional feature just is an input variable to a multidimensional computation? And that's indeed why the network has bothered to go the length to structure its representations in this in this way. Like, without that, without there being some sort of multidimensional computation applied to it, you might ask, well, wow. Like, why did the bother why did the network bother to to structure its activations in this, like, really, you know, ordered ordered way? There must be, like, some something for which it is using that order. And this was why we were thinking, well, maybe it's the computations that are giving this that are kind of defining the structure of these of these features. And maybe then the same kind of thing might apply in for the multilayer case. You you might think, well, it doesn't for certain for certain networks like transformers, which use a residual stream, it's very easy for computations to be spread over multiple layers because it's it's yeah. It just works out kind of straightforwardly for them to do this. And it was kind of confusing to to think about this in terms of, SAEs, as well. And it kind of just made some natural sense to think about, well, maybe it's the same feature if it interacts with, like, a a part of the parameter vector that happens to span multiple layers. And then for this kind of edge case as well, this multilayer edge case, it just made a bit more sense to think about the computations as well rather than, representations as as, you know, primal.
Nathan Labenz (26:11) And so an example of that is like the days of the week or the I I think the original cracking result with modular addition would probably fall under this heading where there's a it's not like there is a well, I guess, you know, it could vary obviously across different networks. But at least in some settings, it has been found that it's not like there are 7 days of the week features that light up independently and are all like, you know, if this is a really strong Monday signal or a really strong Tuesday signal, but rather there seems to be a plane in space through which the day of the week kind of rotates and remarkably, like, seems to sort of crystallize into this, like, you know, the graphics on this are, like, pretty amazing sometimes where it's like, you pretty much divided the circle into 7 equal shares, and each day of the week has a different direction in this plane as opposed to, you know, each of them having their own independent linear directions. And basically I think the same mechanism seems to have or a similar similar structure, if not the same mechanism has been found in the Grokking result where to do this modular addition, things are first kind of translated to a angle, then the angles are added trigonometrically, and then the result of that trigonometric calculation gets mapped back onto the final number. And that is seemingly, like, hard to maybe you can give a little more technical intuition for, like, why that would be really hard to figure out just with features. But it seems like it's on some level, it it may just be that I guess I don't know exactly what you would find if you applied the standard SAE methodology to 1 of those things. Because if I'm not oversimplifying the standard SAE methodology, you would have you would be expecting to see these, like, 7 distinct features or, you know, in the modular addition case, you'd expect to see, like, I'm not even sure exactly what you'd expect to see. But I don't know that you would expect to see a sort of a cyclic trigonometric structure
Lee Sharkey (28:27) Mhmm.
Nathan Labenz (28:28) If you were looking for all independent features.
Lee Sharkey (28:33) Yeah. I I think it's probably easiest to think about the days of the week case. It so to be clear, it can it can be both, that there is both a, like, a multidimensional aspect to these like, representations as well as individual, individual days aspects to these these representations. And the reason is that sometimes you will you might want to, in some tasks, you might want to say, well, the day after Tuesday is and you you might just apply the, the computation that, like, you know, rotates the day of the week feature, because now you've just got 1 mechanism that you can apply to every single day of the week. But you might just some settings, you might really just want 1 day. You might just say, I was on my way to the shop on and there's really just, you know, a 1 direction that you wanna wanna say here. Say it's Monday. And so in some settings, you may just want, computations that apply to these individual, days and some settings where you might want these, like, higher dimensional computations, kind of 2 dimensional 1 that might rotate it. Right? And this is, I think, you know, an important aspect of of parameter decomposition. There's not, like, 1 basis that you might use to interpret, you know, the the whole model. And it's it's more that there are steps in an algorithm. And, you know, sometimes, these steps will, yeah. I guess just sometimes these some steps will be useful and others will not. And, yeah, I guess let's see. So the SAE and the SAE kind of approach might find might might not find this, like, shared, you know, rep shared computation that just rotates from, you know, 1 day to the next, it might find, you know, conditioned on prompts that, say, the day after, or, you know, the fall the day following or the day so on, following Tuesday is you you might find such sorry. You might find, like, 1 for Tuesday, 1 for Wednesday, 1 for Thursday, just to to rotate around every single time. This this is just because, you know, it's kind of trying to find sparsely activating parts of the dataset that have a particular shared property. And in particular, parts of the dataset that have a particular shared, like, direction in activation space, which is not necessarily, you know, the all the all the ways in which inputs might share properties. 1 of the ways in which they might share properties is they use similar mechanisms. And these mechanisms may do the same operation but to different inputs depending on what the what what the prompt is. I'm not sure if that was was clear, but, yeah, this is 1 of the the differences you might expect from parameter decomposition and an SAE like approach. Now this is not there are other approaches that might get closer to what you might expect or want to find with an approach that identifies the computations. For instance, the Anthropic recently released an update on MOLTS, m O L T. I will need to remind myself what that stands for, but it's you know, there's a similar principle here where it's instead of looking at individual directions, it's looking at more like multidimensional computations.
Nathan Labenz (32:34) Yeah. So I guess maybe another way very simple way to say it is we wanna also understand the functions. Like, if the whole the whole project here is to understand why the AIs do what they do and possibly be able to intervene in in certain ways, we're just gonna have to dig in and figure out the nature of the transformations as well as the intermediate results. And the intermediate results can shed some light on that. Mhmm. But at times, there may be assumptions that in that approach that leave us blind to, I thought a key point you said was functions that are reused on different inputs and do the same semantically meaningful transformation on a given input. That is something that the SAE approach could be entirely blind to, but which obviously, like, is a is a pretty important aspect and potentially happens a whole lot. Like, I guess we don't really know yet how much that may happen, but certainly seems like it probably should happen a lot. Right? I mean, the the higher especially as we get to more and more powerful networks, would seem like we are moving away from you know, graduating perhaps in a sense from sort of simple rule based, like combinations of distinct features and probably toward higher and higher level abstractions, which would presumably then correspond to more
Nathan Labenz (34:13) functions that can operate over over ranges of inputs and do, you know, do the same useful transformation to some, you know, some space of inputs. Yep. In a way, that's like that's chunking reality. Right? And we that's like an important aspect of of chunking reality that it seems like the SAEs are sort of hinting at but not really directly characterizing.
Lee Sharkey (34:37) This feels reasonable. Yeah. I I think 1 of the 1 of the cases that I like to think about is in an SAE or, say, a a a transcoder, which is an SAE that takes as input, you know, some activation and spits out a prediction of what the the activation will be at the next layer. Both of these will have a, you know, a problem where suppose the network is just doing a simple transformation. Suppose it's a rotation or suppose indeed that it's actually just an identity transformation, you might get, you're gonna have to spend a lot of, a lot of, representational capacity to, represent the computations that the network is doing at this particular transformation. Why? Because, well, you've got, say, you know, 1000000 different features, and there's 1000000 different features in the output, especially you know, in the case of the identity transformation, there's gonna be 1000000 there. And you would need 1 1, you know, input feature and 1 output feature for each each such, you know, input output pair. Whereas, in essence, what the network is doing is actually very simple. It's just a transformation of this of this type, whether it's an identity or a rotation, and you've just got 1 object doing that that transformation. That's the kind of thing that we want parameter decomposition to be able to find if it it is if it is indeed the case that networks are doing that kind of thing.
Nathan Labenz (36:08) Yeah. Okay. So we've got some examples of that, coming up later. We'll continue our interview in a moment after a word from our sponsors. Being an entrepreneur, I can say from personal experience, can be an intimidating and at times, lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just 1 of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right 1, and the technology can play important roles for you. Pick the wrong 1, and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in The United States. From household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.
Nathan Labenz (38:13) So let's bracket that for a second and maybe just talk about sort of how it works and the sort of intuition for how how the whole setup works. There, I think there are a lot of similarities to SAEs. I mean, the I'll do the quick recap of the SAE. So we to do an SAE, you insert this very wide layer between 2 layers of a network. The thinking being that we know that the the normal width of a network is way smaller than the number of concepts that it can handle. So clearly there's some, you know, superposition of concepts. The concepts it's not like 1 neuron corresponds to 1 concept. It's sort of any direction in this 4 or 8 or 16 or however many thousand dimensional space, any direction in that space can represent a concept. And so the goal is to untangle those into a sparse thing and for that you need this really, really wide layer. So you train through this reconstruction loss where the goal is to project out into this really wide space but have a sparsity term in the loss function so that some small reasonable number of truly relevant features are activating and then project back into the dense space so that you are recovering the behavior of the model but then can look at, you know, which things are activated and then you go through this labeling process at the end of that to say, okay, here are all the inputs that led to this particular thing lighting up, like what does that in fact seem to be? And with that, we now have this thing that we can use, you know, hopefully for monitoring, detection of, you know, potentially bad concepts being activated, potentially steering. That's where Golden Gate Clog comes from, etcetera, etcetera. It seems like there's a pretty similar motivation here saying like, okay, if concepts are densely packed in this superposition way at the activation, you know, as the results of the intermediate calculations, which we call activations, then perhaps the same is true about the computations themselves. The network itself represents or, combines a huge number of probably much simpler, more, you know, semantically sort of intuitive calculations, computations, transformations, functions, whatever the exactly right word is there. And because they're all so densely packed and overlapping and the sparsity of the dataset also is a key thing here, right, because like the reason that this can work is because certain concepts that may be, you know, pointing in very similar directions almost never occur together. And so even though they may look very similar, since they don't occur together almost ever, like, that's fine. You know, the network can sort of get away with And there's probably that sort of happening in computation space as well. And so the same question kind of applies. Can we break this up into a really wide version of it that separates the flattened layer or the flattened network that we originally trained into all these more atomic units of computation. And if we can do that, then does it appear that they actually have intuitive meaning so we can start to go through them, you know, painstakingly and understand what each 1 is doing. How am I doing there on in in terms of just creating the motivation? Anything I'm I'm missing or anything you would correct or complicate?
Lee Sharkey (42:03) There's plenty I would complicate, but I think it's a really good, you know, intro. Like, yeah, I think, Yeah, I think that's just the right way of thinking about it. Think that there are many different computations going on inside this network, more indeed than it has neurons. And the way it might the way we think networks can do this is that they, you know, spread their computations out over their, like, computational units, where these computational units are the the the neurons. And so this by spreading it out, you kind of, get some nice properties such as, you know, avoid well, even though you might overlap with other computations, and because you spread this out, you can kind of suppress this noise that comes from this overlap, you don't necessarily need to you you can basically fit more in because you're you're kind of, like, silencing, the stuff that you don't really wanna be there, the kind of computations you don't really want to to be active right now. It's it's very similar principles to the idea of, like, representation and superposition, but, yeah, just brought into to computational space. And, yeah, I think the the other thing that you mentioned, yeah, it's kind of like there's this extra layer to to what the the network is is doing. We've kind of split it out into this. It's not just you know, we don't just have a a layers and width dimension anymore. We kind of have this this extra dimension where, you know, you're asking, like, what computation is the or what computations is the network using at this particular layer, at this particular in this particular width dimension as well? And this does have precedent in other areas of deep learning. Mixture of experts might is a kind of example of this where have layers, you have individual networks, but you're there's this kind of you can do different computations at the same layer at the same know, neuron dimension, so to speak, depending on which expert you use. And so there are senses in which this is kind of like, well, suppose every single network is just a big mixture of experts that's been smushed together. Can we separate out the the experts? 1 way I've heard it described is, it's like model unmerging, where model merging is just like you've combined these 2 networks that do different things into 1 network that can do both. And we kind of want to do the opposite. We want to split this network out, this network that can do many different things in out into networks that can only do 1 thing or as a smaller number of things as possible.
Nathan Labenz (45:03) Yeah. So the setup is to to actually do the training of of this to to start to actually do this splitting. I'm I I think your work does a great job of really emphasizing the loss functions. And so the constraints that you put on this are, first of all, that the and I think there's a couple different versions of it. We'll maybe start with the first 1 and you can and we can explain how the second 1 is new and improved and better in a couple different key ways. But if I understand the first 1 right, it's basically saying, okay. Let's make, like, a bunch of copies of the network with this essentially the same footprint to start, and we'll have the constraint that all these different copies must sum to the original. So if you had, whatever, 1000000000 parameters in your original model and you decided you're gonna split this into 1000000, you know, different sub components, you would then have a quadrillion parameters, which we can talk maybe about, like, the, you know, computational challenges and the the memory requirements for things like this. But okay. Now we've got the constraint though that those million copies, they must add up to the original. We then want to say, okay, akin to the SAE sparsity requirement to make this, you know, to make what comes out of it hopefully intuitive and tractable and semantically, like, natural. We wanna have as few of those be active as possible for any given input, which you call minimality. And then the final 1 is for each of those things, we also wanna make them as simple as possible so that they individually are interpretable. So again, the we can split this into and I sort of think of this as giving the network height. Obviously, we have, you know, width of the network is sort of how many neurons it has at at each layer, and depth is like how many layers it has. Think of this as sort of splitting it into a vertical height dimension where instead of having all this stuff, you know, happening in the same, you know, track of computations in a way that we can't untangle. We're now untangling it into all these different, hopefully, atomic computational subcomponent units. Only a few of which will be in use at any given time, and ideally, those can be kind of distilled to their simplest form while still, of course, reproducing the original network behavior. A visual I have of this comes from the Tegmark group. I think it was Xu Ming Liu was the lead author on this paper. Seeing is believing where they just trained toy very small models on like simple problems, but with a strong sparsity feature and you could watch they had these great graphics where you could watch an, know, initially randomized, you know, network learn a particular function, but also you could see that most of the weights would drop out to 0 as the thing sort of crystallized into its simplest form that could still, you know, do the the task that it needed to do. Those are relatively simple, you know, toy problems. So again there, what would you add to my setup of the you know, it's it's those and so now we're gonna do this giant training optimizing all of that together, which is kind of an amazing thing to me. I I felt this way about SAEs, and I I think I feel it even more strongly here that you can put all those constraints into a single optimization problem, and it works. Like, if you told me that in advance, you know, if you said, like, here's my idea. I'm gonna put all these different terms into 1 single joint loss function, try to optimize it all together at the same time, I would say good luck. That seems like it, you know, it seems hard to find something that's actually gonna work in that space. So, yeah, tell me if I'm if I'm missing anything or again, anything you think is important to to add for understanding there. And then I'm really interested in like, how do you account for the fact that this actually can work?
Lee Sharkey (49:39) Yep. Yep. So, yeah, I think you did a great job on introducing the the various losses. So just to recap, we have the faithfulness 1 that makes them sum up to the parameters of the first of all, what what are these things that we're summing up? These are we're we're calling them parameter components, and these are the things we want to, you know, approximate 1 individual job or 1 individual mechanism that the network has learned. So we have the faithfulness loss, which makes all these parameter components sum up to the parameters of the target model. We have the minimality 1 that basically makes it do its job do the same job as the original network, but using as few of these parameter components as possible. Then the simplicity 1, which is like, we want these things to use as little computational machinery as possible. And that 1 is pretty important because Well, they're all important, but this one's important because, well, a great way to satisfy the first 2 losses is just to use the original network itself. You're only using 1 thing. It sums up to the parameters of the original model, but you haven't done any work to decompose this. So you need them all to be simple. The way we operate operationalize simple is that they should be, you know, low rank and involve as few layers as possible, and we can get into the technicalities of that in in a bit. But, yeah. I think, like, the idea I think, you know, I I hadn't actually made the connection between, say, the the work that the did you say that she wants?
Nathan Labenz (50:59) Yeah. Think
Lee Sharkey (51:00) so. My apologies for messing up the name. But yeah. So, you know, I I hadn't really made the connection between this kind of work and and parameter decomposition, but the the idea here is that suppose you aren't optimizing a network with the kind of sparsity constraints and, I believe, some sort of locality constraint that they might have used to so there was kind of like neighbors neighboring neurons did similar things, if I recall correctly. Suppose you weren't optimizing it with these constraints. Well, the network is in some sense still able to learn I would guess that the network is able to learn a very similar algorithm no no matter whether or not it's, like, trained to be sparse in a particular basis and constrained so that, you know, these neurons are close together. I would guess that it's like we just need to, like, figure out the way that the network is doing the same algorithm, but in some sort of, you know, basis that we don't have direct access to. The the computations if the computations are the same, the computations both have to be sparse. We just need to find the basis in which they are sparse. And so, yeah, it's just we're assuming that the network is doing sparse computation even if we haven't optimized for it, and then we're just trying to find the the ways in which, the the basis in which it is indeed sparse. With regard to, yeah, whether or not sorry. How, you know, how how do networks manage to do these things? Well, the networks I think it was Ilya's discovery, as you said, the networks, they just want to learn. I think there's something really deep to this in that there's just many different ways for the net like, the larger your network, the more ways it has to get things right. And this is a pretty powerful principle, I think, just because it means, like, it's actually just much, much easier to find the the right way to do something if you've just if you're simultaneously looking in many different directions, many different directions in prep for space. And this so it's kind of, in some sense, somewhat unsurprising that you can add you know, we we could add many more different constraints, and it'll find some, you know, some, like, reasonably satisfying way to to satisfy all of them. But, you know, it is kind of counterintuitive to our low dimensional brains to think in these terms. And it's still kind of amazing. But yeah, I should also qualify this and say, well, the algorithm that we're talking about, attribution based parameter decomposition, which where your parameter components are very large and it's very computationally expensive and where the algorithm basically has got a bunch of problems, it is still somewhat hard to get this particular algorithm working, which is why we ended up doing future work. And so even though it is still kind of amazing that they can find some solution that satisfies all these things, it's caveated with the idea that, well, at least for this algorithm, it was at least somewhat hard for that algorithm, but less so for others.
Nathan Labenz (54:17) Yeah. 1 practical question on the faithfulness concept, which is the idea that the all the different vertical dimensions, let's say of the the identified subcomponents that they all have to sum up to the original. Can you help me just, I guess, develop my intuition for exactly what's going on there? Like, if I and I should say, you know, most of the problems that you've worked on so far, I'd say we're kind of at roughly you could, you know, again, complicate this, but it seems like we're roughly at the models of superposition phase of this work where the models that we're studying are rather small. And then, you know, a a big question, of course, is gonna be like, what does it look like to scale this up? Mhmm. But when so it's not like a 1000000 layers. Yeah. Just the the key point there. Yeah. But when 1 parameter in 1 particular layer gets turned up, that means that that same parameter position in all the other layers has to be turned down by the corresponding amount. Right?
Lee Sharkey (55:30) In these different parameter components. Yes. As long as they all sum to the parameter.
Nathan Labenz (55:38) Yeah. I said layer there, but I meant vertical layer, which is a concept maybe we should retire. But I am thinking about it in this vertical
Nathan Labenz (55:51) visualization, I guess, for myself. 1 thing I wasn't clear on there, and I was I was thinking about this concept of attention syncs that I've seen in the past where we're calling this on the fly, but like it's been found in in certain cases at least that having a few sort of junk token positions at the beginning of a transformer can be really useful because in the absence of that, aka in the normal approach, there's you gotta look back to something in attention. And so whatever tokens happen to be in those first, you know, 5 to 10 token positions become really important because everything kind of has to look back to that. And so the introduction of this, like, buffer, which they called an attention sync, was a way to say, let's not overweight the initial tokens of this sequence. In some cases, you know, you might just wanna look back to sort of nothing and kind of recognize that, like, these these initial tokens aren't actually, like, super critical to predicting what comes next. So let's not, you know, force all attention to kind of land on on those initial tokens.
Lee Sharkey (57:01) Mhmm. Mhmm.
Nathan Labenz (57:02) And that attention sync concept, was kind of wondering, could there be sort of a junk sync concept in this setup where like how do I not end up in a situation where my the new subcomponents that I've created are learning whatever they're learning, which is recreating their initial model behavior. Mhmm. But maybe some of the other layers, which, you know, would never need to get used, are in fact just sort of taking on the opposite of whatever the ones that are actually doing the computations are learning. And could that create a sort of conceptual disconnect where my new vertical layers are learning computations? Yes. But how do I know that those are faithful to I know that they still sum, you know, in that sense of faithfulness, they still sum. But maybe, like, the computations that have been learned in these layers could perhaps be, like, quite distinct from what is happening in the original network. And and maybe that's being hidden by the fact that some of some of the other new vertical layers are just like absorbing all these like gradient changes in ways that like don't actually mean anything because those, you know, never get used at all. And you've maybe just kind of trained like some new stuff that didn't actually exist in perhaps or, you know, doesn't necessarily correspond to a mechanism in the original.
Lee Sharkey (58:28) Let me see if I have understood what you're saying. So we have these parameter components and we you know, they satisfy the constraint that they all sum to the parameters of the original model. And you're worried about the possibility that even though they all sum, there's a lot of, you know, different ways that in which you can sum. And in 1 of these, we're worried about the case where there are some of the computations that the say, of these parameter components is is doing is kind of canceled out by another 1. Is it the case that both of these would be active?
Nathan Labenz (59:08) Well, I'm assuming that the if there is sort of a junk sink Mhmm. Component, then it would presumably very rarely be active.
Lee Sharkey (59:18) Mhmm. Mhmm. So we do in many, like, networks expect expect there to be a junk component, and a junk component that does actually do nothing. And this is just because networks neural networks are, you know, degenerate. There are many different ways to implement not just something that works well, but implement the same algorithm. 1 degeneracy you might think of is if you scale up the weight before a ReLU and scale down the weight after a ReLU, well, this is very much the same algorithm. It's the same kind of amount of activation gets put back into the residual stream, but there are you know, there's a a 2 dimensional space here in which we can well, in fact, it's a it's 1 dimensional space here, but in which we can move parameters such that exactly the same thing is done. And there are many more such degeneracies in neural networks. And so there are you know, we basically expect 1 of the degeneracies that networks might use is that, well, they say, well, none of the some of the act sorry. All of the activations that the network ever sees are, say, orthogonal to, this direction in parameter space. So this direction in parameter space is, in some sense, never used. And you could indeed ablate this direction entirely in parameter space, and it would just not affect the outputs of the original model. Now this would need to be a overparameterized model. You might not expect this kind of thing in the underparameterized case such as language models, but in, say, an MNIST model, you might expect, you know, a a junk component. An MNIST model of a a of a size, you might expect a a a junk component that is just not really used for for the algorithm. Now would you be able to yeah. I guess but I'm not sure this is exactly the same kind of junk component that you're talking about. Maybe there's maybe it's something else in mind.
Nathan Labenz (1:01:19) Well, I guess it maybe another way to frame the question is, how do you know where the real learning or algorithm where where is the real learning happened? Where is the real algorithm implemented? Another kind of inspiration for this question is going back to, like, some of the Neil Nanda work on, like, the a fellow board state type thing. I recall that there was there there is at least the possibility of confusion where you you wanna know, like, what is being represented inside the model. But if you train, you know, a big detector model on the internal states and try to predict board state from the internal state of the model, it's you know, at some point you're like, well, maybe my new network has learned to do that, but that doesn't really mean that my original network had the you know, had a sort of semantic understanding of this concept. Maybe that kind of only came online in the, you know, interpreter model that I trained. Yep. And so I guess I'm I'm worried about a a disconnect between the original model, you know, we've split it into these, you know, many copies. Those individual layers or those vertical layers, those subcomponents are evolving through this training process as they learn to be simple and reproduce and be, you know, only a few of them activated at a time. Yep. But I guess I'm just wondering, like, what is tethering those resulting subcomponents to the original? Because I could imagine that I can change, change, change, change, change in a functional way that gets me the desired behavior. Yep. And, you know, I can offset that in some other 1 of these subcomponents, which maybe never gets used. And then I could potentially, if I do that long enough, hard enough, I could arrive at a spot where potentially the algorithm of the new thing implements is like quite distinct from the original. And I'm not sure how I would know that. Think there's something that I'm not catching that is preventing that from happening conceptually.
Lee Sharkey (1:03:28) So for attribution based parameter decomposition, that is, I think, a fairly reasonable concern. I don't know if it happens in practice. I don't know if, like, the other losses in some sense implicitly penalize it. It will if whenever we move on to the stochastic parameter decomposition, I think it will become a lot more obvious why this is this can't be the case. We can talk about that now, or I'm happy to park it until until we get there.
Nathan Labenz (1:03:52) Yeah. Well, let's do it. I think we're pretty much there. I mean, the so we can add this to the list of possible problems and whether or not it's an actual problem with the original method. Mhmm. But let's just run through maybe, like, what made this, you know, not the last word or, you know, why the why you needed an upgrade. You mentioned that it's not super easy to get working. Like, there's a lot of hyperparameter dependency. It's not super easy or, you know, stable to train. So that's 1. There's computational expense, which maybe you can unpack a little bit more. There's this, like, conceptual drift, possibility that I'm flagging. Yeah. And then I also wanted to hear a little bit about, like, reliance on attribution methods, and why that's a a problem as well. So maybe we can just run through those, and then we'll move on to the the new hotness.
Lee Sharkey (1:04:42) Yeah. Yeah. So, yeah, this algorithm was extremely janky. It was just very difficult to get it to do what we felt was a sensible thing to do. Later algorithm has much less of this issue. But for instance, we mentioned there was a bunch of different hyper parameters that were necessary to tune in order to get a decomposition that made sense to us. There's just 1 of the reasons here was that, well, you we're using, like, a a top k parameter where you're only allowed to activate, you know, the top k most attributed parameter components for any for any given input. And you want to train these attributed ones to be able to do better at the task that they were on which they were attributed. And so you're basically training them to reconstruct their output. And so, yeah. This this top k parameter is 1 of the 1 of the very fiddly things to get right here. It's very it's very what's the discontinuous, basically. You know, you can because you can very quickly go from 1 component being active to, you know, you slightly update the parameters ever so slightly such that another 1 comes online instead of that 1. Well, you know, now you've for a very small nudge in your parameter space, you've got a very different output. Now you're implementing a very different function for a very small change in parameter space. So that's just inherently hard to optimize with gradient descent. Another issue is just like the sheer number of different hyperparameters. But yeah, I think the other main 1 is this. Really relying on your attributions to be right. And if they're, you know, in some sense wrong or or in some sense systematically biased, you might end up there may not even be a kind of stable optimum for in your training landscape Because if your biases you know, suppose you were at a a a good point in your training landscape, but your biases were slightly off. They might nudge you, you know, somewhere else, and then you might move over there and your biases are systematically biased in some other Sorry, your attributions are systematically biased in some other way. And so it might nudge you somewhere else. Biases Sorry, the Like the attributions that we were using were gradient based, which there's some, like, some intuitive sense in which this is kind of trying kind of capturing what we wanted. What we wanted was, like, some number that told us this was highly important this parameter component was on this interview, on this input. And gradients are a kind of proxy for this because, well, if your output changes a lot for a small change in this parameter component, that is to say, if the gradient of the output with respect to this parameter component is large, well, you can say that it's, you know, a an important parameter component. But there's some cases where this is just, like, straightforwardly not the case. 1 example might be attention, where if you've got a really suppose you have a transformer and it's attending very, very, very strongly to 1 particular sequence index. And attention there is almost 1 and everywhere else 0. Well, if you nudge the parameter component that is responsible for this very strong attention to this sequence index, you nudge it by some small amount. Well, the attention is basically saturated here. And so the gradient is actually very small. It's actually not going to move your attention by very much, despite this being a very mechanistically important parameter component because it's implementing such strong attention. And so this is just like 1 of the ways in which gradients may not be a good attribution method. We actually just want some some other method that tells us more accurately how important was this was this parameter component. And this was 1 of the the developments that we introduced in the in the the follow-up paper on stochastic parameter decomposition.
Nathan Labenz (1:09:10) I think that's a perfect transition. So that brings us to stochastic parameter decomposition. My general read of the paper is it's like very much a natural follow on. Right? In fact, I actually read I went through the AXRP podcast in the first paper and then wrote the first half of this outline. And then I read the introduction to the new paper, I was like, basically, the first half of this outline sort of mirrors the the introduction to the new papers. I felt good about that. So it's basically same goal, same concepts, same not exactly the same loss functions, but sort of same conceptual constraints just operationalized differently this time and addressing the weaknesses that you just ran through in the first the first case with better results. So maybe take us through it. I do have some questions on kind of the intuitions for some of the decisions. But take us through the, you know, the set what's new and improved first, and then we can kinda unpack exactly why you did it.
Lee Sharkey (1:10:17) Yep. Yep. So the 1 of the new things is in the headline, stochastic parameter decomposition. And this was this the stochasticness relates to the replacement for the attributions in the previous in the previous paper. Now it's still reasonable to think about you know, we we still need, in some sense, these attributions, but we will, you know, just to avoid confusion with with the previous method, we can we'll we'll call them something something else. We'll call them causal importances. If something is causally important for the network's algorithm, it should be attributed. And we just want some something that will let us approximate how causally important this this subcomponent is. And as a before we get into the details on the on the the causal importance calculations. Well, it's also important to note that 1 of the other major differences with the previous method was that we are now not using this frankly ridiculous size for each of the parameter components. Are not just a single you know, something they're not a randomly initialized copy of the original model. They are in fact just a randomly initialized rank matrix for each of these layers. And we're basically like, the way we think about this is that if in the previous method, we had these parameter components and that spanned all layers and spanned all potential ranks in these layers, well, we are basically just pre splitting it up. We are using we're saying, well, pretend everything was just localized in rank 1 matrices in each in in 1 layer. And later on, the idea will be that we group these things together back into, you know, full parameter components such that if we did indeed have, like, rank 2 parameter components or parameter components that span multiple layers, we could indeed find them, but, you know, post hoc after we'd group these things together. So those were 2 of the major differences between the APD, attribution based parameter decomposition. And so it's probably, you know, now reasonable to get into the the the causal importance calculations, which is where the stochasticness comes in.
Nathan Labenz (1:12:47) Can we take just 1 more second on the on how you're breaking the original network down into these rank 1 components? Mhmm. So again, in the original, you've got a network that has a certain architecture, it has a certain parameter structure, you make n of those with is it really like random initialization on all those copies?
Lee Sharkey (1:13:12) Yeah. Yeah. Yeah. Yeah. Interesting. And it's it's similar in the in the the the subcomponent that we're now using in in stochastic parameter decomposition. These are sub they're called subcomponents rather than components because we will, you know, group them together later on. But, yeah, these subcomponents and parameter components are they're all ran they're all randomly initialized, and it's this faithfulness loss that, like, makes them all sum to the parameters of the original model. And that faithfulness loss is pretty strict, so they don't really get too deep yet very much from from the parameters of the original model as you sum them.
Nathan Labenz (1:13:48) But they they do it first. They just They do it converge quickly to
Lee Sharkey (1:13:52) Yeah.
Nathan Labenz (1:13:52) To, like, sum back to the yeah. Interesting. It's there's I guess I would have naively thought in the first case, I would have thought more along the lines of like, and I don't know why this wouldn't work. You could probably tell me. Why not you know, if you're gonna divide the thing into 1000 vertical layers, just take all the parameter values and divide them by 1000 and then have sort of 1000 weak copies of the thing, some of which can then be turned up, down, whatever. Why randomly initialize instead of doing something more, you know, principled or sort of, you know, a simple transformation of the train network itself.
Lee Sharkey (1:14:30) Yeah. I think we explored a bunch of different initializations. And I actually don't recall the exact 1 that we used in the APD paper. Was the example that you gave of dividing it down and just exact copies is I think something that we explored. And we may even have used a randomized version of that. But you can't really do this in the stochastic parameter decomposition case. Why? Because, well, you have, say, I don't know, I guess 1 of the ways you might consider doing this is having a, say, 1000 x 1000 matrix and, you know, dividing it up into, say, 10,000 subcomponents. Now I don't know if there's a guarantee that if you sum up all but 1 of these parameter components that the remainder is also rank 1. Does that make any sense? So the the so there's, like you assume that your 10,000 subcomponents sum to the parameters of the original model. And the idea would be, well, we might as well just constrain the this this sum to be exactly the parameters of the original model. And so what we'll do is we'll just take the sum of the first $9,009,999. And then for the remainder, we will just let that be whatever exact you know, makes exact this sum, know, to be the the parameter of the original model. But I think then this this final 1 may be of arbitrary rank. I that feels intuitive to me, but I'm not sure if it's 100% true. But, yeah, it feels like that 1 may be of arbitrary rank. So if we want them all to be rank 1, then we can kind of just randomly initialize. There is a slightly more, you know, involved randomization, or sorry, more involved initialization such that they do point in similar directions to the to the original model, such that their cosine similarity with the original model is not, you know, is at least not negative at initialization despite being rank 1. But, yeah, it would the the the details of that are probably not, you know, super important. I think it should just work with random initialization random initialization. Yeah.
Nathan Labenz (1:17:08) So how how should we think about these? It is the original setup is, like, pretty intuitive to me to sort of envision. Okay. I've got this thing. I split it into all these vertical layers. Each 1, you know, gradually over time becomes simple. Not that many are active. And so I sort of can visualize like instead of this sort of dense and uninterpretable computation, I now have like n of, you know, a 100 n or n of 1000 n that are actually active. And those, you know, are each relatively simple and they create the same output. I can sort of visualize that and I'm like, yeah. Okay. So there's sort of these and and they feel like circuits. Each of those things feels like a circuit and it's kind of, okay. If I activate these 8 circuits for this input, then it works. These other, this 1 takes 12 for a different input and it works, but each 1 sort of intuitively feels like it's doing some sort of information processing through the original architecture, and so that's kind of intuitive. Here, I'm a little less able to tell that story because I'm like, at each layer, we are breaking up whatever matrix exists at that layer into a bunch of rank 1 parts. How should I think about those rank 1 parts? I mean, I I had to go, like, refamiliarize myself a little bit with what exactly is rank and what does it mean. And in the simplest terms, it's like just a much simpler, smaller part of the overall transformation. Mhmm. Specifically referring to dimensionality. You have some n dimensional space, something that is rank 1 basically exists as a line in that space. Right? And something that is ranked 2 exists as a plane in that space and so on. Yep. So we've got now a bunch of like, instead of a transformation that can operate in n dimensions all at once, We've got a bunch of transformations that all operate on 1 linear direction in that broader space, and now we're composing those sequentially, I guess, in a way that recreates the original. But how do how should I be thinking about those rank 1 things? Like, why do that, I guess?
Lee Sharkey (1:19:24) Yeah. I I guess, like, I I don't really view them as very fundamentally different. So suppose you know, consider the case in the old algorithm where we have this parameter component, and this parameter component can span all layers and all potential ranks. But 1 of the penalties that we were emphasizing was this simplicity penalty, where we don't want it to involve too much computational machinery. We want to be able to study it we want to be able to study as simple objects as possible. And the way we operationalized this was that we penalized, indeed, the rank of the matrices in this parameter component, so and also penalized it for existing in multiple layers. And so what this you know, the idea here being that if this parameter component really was just localized in a few layers and in a few ranks, the rest of its parameters should become 0 in all these different layers. And there should just be a low rank set of matrices in the layers that in in which it is involved. And so we are hoping that and I'd like just to maybe go into a little bit more detail about why that is, like, simpler. If it's rank 1, as you mentioned, you know, this is just a a rank 1 matrix basically looks for and it reads and writes in a particular in 1 single direction in activation space. And the only the only direction in which it, like, reads is the the the direction defined on its the the right hand singular vector, and the direction to which it writes is the the left singular vector. Either it this base this matrix will not be used if the activations don't, in some sense, overlap with these with with that that read direction. And if the right direction just has no relevance downstream, it has no causal importance downstream, it doesn't do anything, then this rank 1 matrix won't do anything. So given that we're already looking for these low rank matrices in these full parameter components, the idea of stochastic parameter decomposition is that we can well, 1 of the ideas we introduced in stochastic parameter decomposition. It's not a fundamental part of the algorithm. But the idea that we introduced here was that we'll just start with these low rank versions of the the of the parameter well, sorry, of the subcomponents. And then later on, we can, like, aggregate them such that if we wanna combine 2 in order to create a rank 2, you know, parameter component, then we can do that. And if we wanted to, you know, have existing multiple layers, we can aggregate them together as well. And so they're really just chunked versions of the same fundamental object, this, like, parameter component, this this vector in parameter space. But 1 is just, you know, something that we'll group together later on.
Nathan Labenz (1:22:37) Okay. Well, let's come back to the how the grouping is done because I I think that's another quite vexing, potentially quite deep question. But for the moment okay. So we've broken the matrix into if it's an you know, a rank n or an n dimensional space, we've now got not just n, but actually could be more than n rank 1. So how about a little help on that intuition as well?
Lee Sharkey (1:23:06) Yeah. Yeah, it's 1 of the core ideas in the representations and superposition line of work is that there are these features in activation space that each individually represent a single feature, a single thing in the world. But there are more of them than there are neurons that represent them. And the same idea holds true here. In parameter space, we can represent more directions in activation space by using just more rank 1 matrices. And this is the this gets back to the read and write direction. So if we want to read from, you know, more directions than we have neurons, well, we just have more ranks than we have, you know, dimensions in our in our activation space, more ranks than we have columns in the matrix. And so we're like it's possible for this to, you know, read rep read directions read, you know, read these representations in superposition if we have more of these rank 1 matrices in our individual metrics. The idea then is that this these rank 1 matrices are what, you know, will implement individual or parts of individual computations that the network is doing to compute its behavior. But only whenever the challenge is basically to identify which ones are, you know, important for a given input yeah, an input data point, like and when they when when when it's not important. And that's when where the the stochastic calculation of the of the causal importance comes in, which I can get into if yeah.
Nathan Labenz (1:25:12) Yeah. Maybe 1 more question before we go to the causal importance. We've broken these things into all these some and there's, like, an interesting question too of, like, what is the ratio of how many of these rank 1 subcomponents do you need for different size and complexity of original network? In the relatively simple problems studied so far, like that ratio isn't super high. I'm kind of wondering is that ratio gonna become like extremely high if we get into the language modeling space? Seems like probably yes.
Lee Sharkey (1:25:45) Potentially. Like we, you know, it it we will probably have more computations than we have neurons. That's just the the name of the game in taking things out of superposition. But 1 of the questions that I ask myself often is, well, maybe, like, is it gonna be more or less efficient than, say, the the dominant approaches, you know, at the at at the moment? And, like, the these dominant approaches may be, say, sparse autoencoders, transcoders, cross layer transcoders, and so on. And, like, the basic question here is, well, is it gonna be more or less efficient? And, like, it's unclear, but I think there's reason to believe that if indeed this is a reasonable way to decompose networks, there's probably going to be, like, fewer of these subcomponents than there are, say, latents in a sparse autoencoder or a transcoder or so on. The reason being that we're there's no upper bound to how many latents you can have in your sparse autoencoder or your transcoder. And whereas there is a kind of an upper bound on the number of parameter components that you might have. Why? Because, well, they have to all sum to the parameters of the original model. And this prevents you from just adding in an extra component or an extra latent and just cranking up your reconstruction loss.
Lee Sharkey (1:27:21) Basically, there is kind of like a it should level off a bit faster, I would say. But this is an empirical question that remains to be answered for the kind of models that we care about, like large language models or other similar similarly interesting models.
Nathan Labenz (1:27:38) Is that sort of like saying that while reality is complicated, the rules of physics are are simple, like the sort of, you know, the number of fundamental transformations, the number of sort of fundamental functions that are used to carve up reality should in in theory just be a much reduced space that can then operate on, like, many more inputs? Sort of thinking like platonic representation hypothesis type notions.
Lee Sharkey (1:28:15) It feels like it's getting at something true. Although, yeah, I it's not obvious to me right now how to connect it to, yeah, the way in which I think of things. But, yeah, it it feels like it's getting at something true.
Nathan Labenz (1:28:28) Yeah. Well, more work to be done for sure. Is the and how expressive can these rank 1 things be? Like, they can represent, for example, a rotation. So is that right?
Lee Sharkey (1:28:40) I think they can represent a rotation of 1 direction to another, but I'm just I'm hesitating because I don't I'm trying to recall the, like, technical definition of rotation, but, like, the, you know I I think it is true. Yes. That it basically can do a rotation and also a scaling. The rotation is a very simple kind of rotation. It is just like, to the extent that, an activation or an input activation projects on into 1 direction, it will now project into this other direction. You can get arbitrarily high dimensional rotations up to the dimensionality of the, you know, the the the input. But yeah. So I I think it is true to say that it can do rotation, but it's a very simple low dimensional kind of rotation.
Nathan Labenz (1:29:30) But that would be enough, for example, to handle the transformation of, like, Tuesday into Wednesday with a sort of next day operator.
Lee Sharkey (1:29:39) I think we basically need 2 for this for this particular transformation because these variables live on a plane. And so you basically, would need to project, you know, 2 2 directions, in a particular way. Like, if it were yeah. You're basically leveraging 2 such 1 1 dimensional rotations in order to do this 2 dimensional. Okay. Rotation, I think.
Nathan Labenz (1:30:08) So this gets us back also to the grouping concept being quite important. So let's come back to that again in a second. I think I've belabored the rank 1 thing enough. We've broken this thing up into some ratio greater rank 1 subcomponents from the original n dimensional matrix. Now tell us about this swap out of the attribution method to the causal importance learned the learned estimation of the causal importance.
Lee Sharkey (1:30:47) Yep. Yep. So 1 of the ways you might think about what the attribution method was doing was that we wanted something that said, well, if we turn this parameter component off, it shouldn't really affect the output. And so the way we did that was just we literally did turn them off and only kept on this top k parameter components. And what this is like really saying is that it shouldn't really matter if this if this parameter component is on or off. It it it could be on to, you know, to its full extent, it could be off to its full extent, or anywhere in between, and it really should not affect the the output. The top k, approach was just not really great at optimizing for this such that if you did turn it on like a little amount, it probably still would do stuff. Whereas what we're basically doing in stochastic parameter decomposition is we are in fact learning a function that predicts for a given input if like how ablatable is this how turn offable is this subcomponent? And if it is turn offable, then, again, it shouldn't matter how much we turn it off by. It is just like it just causally irrelevant. And so we have this causal importance function that predicts this 1 number, how causally you know, we we call it the causal importance value for this subcomponent on this in this on this data point. And it is just a number between supposed to be between 0 and 1. And to the extent that it is 0, we can basically, randomly mask this subcomponent on this input, anywhere between 0 and 1. That is to say we can turn it on randomly anywhere between 0 and 1. And if it's if this causal importance value is 1, well, then we don't really get to randomly modify randomly mask it at all. I mean, it just has to be on because it's, you know, really, really causally important. And so the way yeah. We literally just have a neural network that spits out this this 1 number for each subcomponent. And if the subcomponent is, you know, very causally important and it gets the right right causal importance value of 1, well, then great. But if it is, you know, causally important and accidentally the, you know, the network spits out, well, 0.5, well, now the this parameter subcomponent can be masked anywhere between 1 and 0.5. And if it's masked close to, you know, anywhere that's not 1, so, you know, anywhere along this random distribution, well, it's going to damage the loss. And so what the This means that gradients will be able to flow into this causal importance function that produced this number using, yeah, basically the reparameterization trick to let us basically pass gradients through this random sample and basically output this random distribution such that it actually does learn to approximate how ablatable this this parameter component is for, you know, a given input. And and I think, like, this was not my idea, but I think it's a, you know, I think it's a really great idea that I actually haven't seen elsewhere in the attribution literature. And I think it has applications far beyond just the approach where the interpretability approach we're having here. It doesn't have to be applied to parameter components. It could indeed be applied you know, parts of an input image and in the kind of way that attribution met attribution methods have been done in in the past. We haven't done that ourselves. We'd be keen to see someone do it. But, yeah, it's just a an attribution method that actually gets to the you know, really captures just how causally important this this component was for for this this input.
Nathan Labenz (1:34:54) And so is that that's a pretty first of all, that's a pretty big matrix, right, that's been added for that or network that's been added for that learning Because it's gotta go from the input space to the output space of all the subcomponents, which are all the rank 1 breakouts at all the layers.
Lee Sharkey (1:35:17) This causal importance function can basically be of arbitrary architecture. It can indeed, as suggested, be a map from the input to this like the number of components per layer times the number of layers. Like that would be a very large network. And this is not in fact how we implemented it in the paper. In the paper, we just had a very, very small 16 parameter. I think it was 16 or it varied somewhere between, say, 16 and 128 parameter, like, thresholding network that took as input. The I mean, it gets a little bit, mathematical here, but it's not super complicated. It's just like you multiply the activations by the right singular vectors of the parameter components, which is basically you're asking how much are these parameter components reading from the activations? Is this 1 reading a lot? Is this 1 reading not very much? Just measuring you're measuring the overlap between the activations and the right singular vectors of these components. And to the extent that the parameter subcomponent has a lot to read there, to the extent that it's active, this is like a number that you can basically, you know, do a fancy thresholding with a small neural network. And that's just, you know, 16 times however many parameter components you have, which in the scale of the overall network is not very large. And it doesn't really even need to be a very fancy, neural network at all. It can just be some sort of simple learned threshold. We found in practice, it was, you know, it it helped it learn nice kind of thresholding functions that were not completely, you know, not completely straightforward, but and they had some kind of structure for them. But overall, it was just a, like, a a slightly fancy threshold.
Nathan Labenz (1:37:19) That's interesting. And and then when the whole thing is run when it's trained and and when it's run, it's that's now part of the reconstruction. Right? It is because, again, all this is jointly optimized. This whole prediction of which component is or which subcomponents are gonna be active and the structure of the subcomponents themselves, that's all being jointly optimized in in 1 big compound loss function. And then when you're actually running the replacement network for study on different inputs, that is still part of it. Right?
Lee Sharkey (1:38:00) It can and usually is part of it, but you can throw away this causal importance network and just sum up the parameters of the sum up all these subcomponents together and just run it like the original model. And if training has gone well, well, this should look very much like just running the original model. It's like running everything with a mask of 1 instead of 0. It should do exactly the same thing, and the causal importance function is not necessary. What the causal importance function is letting you do is basically telling you which parts are actually being used on this input. And you can imagine a world where you can throw away suppose there was a subtask of particular interest, you might be able to throw away most of the network and only have and only keep the parameter or subcomponents that are used on this, distribution of interest, and that may indeed, depending on how large the model is, may indeed be a much smaller set of parameters than the whole model. But yeah, that's future work. Yeah. Interesting. Okay. So
Nathan Labenz (1:39:25) why does why does this work better? I guess, how does it work better? Tell us about the results of, you know, maybe go problem by problem that we had in the last version and talk about the updates. You can get into the maybe toy problem, you know, compare and contrast. And then again, I I'm I'm really interested in sort of, do you have any intuition for why this works? It still seems very magical that all this just shakes out.
Lee Sharkey (1:39:50) Yep. Yep. So I think with regard to attributions, this is just much closer to what we actually wanted from gradient attributions. It doesn't have the the kind of biases that gradients would have. It should, for instance, be able to tell us know, it should be able to cope with the the saturated attention case that I mentioned, you know, earlier when we were talking about gradient attributions. I think it might also be time to revisit the garbage component that you mentioned, where because basically I think we were talking about a case where maybe the network has learned the subcomponents. In fact, maybe the subcomponents are actually doing slightly different algorithm from the original network. But because in APD you could switch them off and never activate them, well, you can kind of compensate for 1 component that is active doing a different thing while still summing up to the parameters of the original network. In stochastic parameter decomposition, you can't really do that anymore because you're just randomly activating every part of the network basically all the time. And if it is doing something that it shouldn't be doing, that was going to, you know, like modify the activations in a way that the original network just didn't modify the activations. And so it's going to, you know, be bad for the loss. And it's like, it's important to realize that there's just we're not actually able to turn off subcomponents in spd. There's no dead components or anything. What it means for it to be a dead component is that you not that you never activate, but that you activate with always randomly. And I think this is just fundamentally different from, just being completely silenced all the time and gets rid of, you know, a lot of pathologies, like the 1, that you mentioned. And it also gets, you know, around this idea of the discontinuity that I mentioned in the top k, top k optimization, where it's no longer the case that, you're only activating, the top k and the rest are silent, everything, again, is active all the time, which means that if you're in APD where you're using top k, well, if you're silenced, well, gradients aren't really flowing through you. You can't really learn to be your parameters of your parameter components can't really learn to be better. Whereas if you're always on randomly all the time, well, gradients are always flowing through every subcomponent. And so it feels just like there's a lot more a lot more continuity in the in the learning process. And this, you know, pans out in terms of, like it's not so much you know, the the the algorithm is still somewhat you still need to get hyper parameters, like somewhat right. However, it's just not like an attribution based parameter decomposition where not only was it sensitive, but it was just not well behaved. You changed it 1 way, and it changed in a kind of unpredictable unpredictable fashion, where at least in stochastic parameter decomposition, fine, there's a range in which you need your hyperparameters to to exist. But because things are well behaved, you can actually just find them without having to it's it's just easier to understand which thing you need to change in order to to get to the right right solution. So it's it's just much nicer to work with than than previous than the previous algorithm. Now that said, the method that we introduced in the paper is just not you know, it's not perfect. There's plenty of things that even since publishing that paper, we've, you know, focused on changing and found better versions for, you know, sampling this randomness and the gates we used in the causal importance functions and so on, just to make it a better behaved algorithm overall. And there's still optimization on algorithms to do, but it is just significantly better to work with than the previous method, which makes us excited for scaling this up to larger models, which we've begun to do for e g, like, small transformers, including language models.
Nathan Labenz (1:44:16) Okay. Let's talk about then feature splitting. You had said that the some of the results here show that the new method avoids feature splitting. Guess and then coming from an SAE perspective, like, I sort of think feature splitting is is good. I think of it as being just like, as I, you know, make my SAE bigger and bigger, then I get like finer grained resolution on the features and I expect to see these sort of vanilla get resolved into like French vanilla and normal vanilla or whatever. And that seems good. But in it, tell me, like, in what sense is it bad and, and like, in what sense, you know, this is, are we glad and how has this helped us avoid that problematic version of feature splitting?
Lee Sharkey (1:45:01) Yep. Yep. So we've talked already about 1 of the examples with the cat and the cat in a number of different positions. And it's not clear whether or not the network really is actually using, in some sense, these more fine grained features that you can identify. And so this is just from a practical point of view, we want our explanations to explain more than they need to. Now we can explain potentially how the network takes 1 of these cats in 1 position and applies circuits to it. Then we can do it for the next cat in another position and so on. But we're wasting our time if the network is just not really using these variables in in its its algorithms. And so we basically want to avoid doing unnecessary explanation. We want our, you know, our descriptions of what the network is doing to be minimal in some sense. So another particularly, like, pathological case of feature feature splitting is, say, feature absorption where you might get say you have a a in 1 SAE a smaller SAE, you have a a feature for words that begin with the letter e. And but then, you know, in a larger SAE, 1 that has more fine grained features, we have a feature for elephant and then a feature for everything else that begins with the letter e. Or it spits into other words that begin with the letter e. And in some sense, this this other decomposition, the 1 with more features, is just, like, less interpretable because, well, there was no need for us to split out elephant maybe. It was just you know, maybe we want we just, again, just want the want the variables on which the network is working. And this is just there's no obvious way to avoid this using, say, SAEs. Now it is it may be the case that, you know, there's a perfect stopping point for SAEs such that, you know, beyond this point, you're using variables that the network is using. And before that, you're you're not. Sorry. Yeah. Yeah. There may be a point at which, you know, you go into a regime where you've got SAE features that are the corresponding variables that the network is is not using. And before that, you've got, you know, only variables that the network is using. But that's not obvious either because, well, fine, you you could find these sparsely activating directions and activation space. But there's, you know, no real reason to expect well, no no guaranteed reason to expect that these are indeed the the the way in which the the network is chunking up its its computations. Maybe maybe it is in fact a a group of these of these SAE features over here that is actually the network is actually using. And maybe you actually did need to split further for these for this group over here. There just there's no guaranteed reason that SAE features will will will find variables for for computations. Now this is not necessarily a criticism that applies to to other various other variants of the sparse dictionary learning paradigm. And I do want to caveat everything we've been talking about here in there are people whose views I respect a lot that see issues with our approach. We see issues with theirs. But overall, don't want to give the impression that the algorithm that we're, you know, putting forward here is the is the final word on anything in interpretability or that it's, you know, definitely finding the the right thing. I think there's the there's a lot more research to be done on this. And, yeah, I just wanted to, you know, make clear that there's there's still more to be done.
Nathan Labenz (1:49:08) So then how about the actual, like, grouping, labeling, and sort of semantic understanding of all these rank 1 units. There was 1 bit, and this is 1 of the toy problems. We don't have time to do the full setup, but there was the sentence that SPD splits its corresponding w out into 50 subcomponents. In other words, this, like, 1 matrix that's been broken down into into rank 1, it has 50 rank 1 things that all remain active. And then the paper says that these 50 subcomponents appear to be effectively part of a single rank 50 component comprising the entire WOUT matrix. And there I'm like, how do I know what does that mean? And how do I know it's right? Like I'm doesn't it leave us kind of with a similar problem to where we started, which is like, what's going on there? You know, if we've got 50 things and we group them back into 1 thing, like how did you know to do that versus determining that there are 50 different things happening? Part seemed to go by very quickly and I wasn't quite clear on it, let alone how to extrapolate it into larger and larger models in the future.
Lee Sharkey (1:50:30) Yep. Yep. That's a really great question. So I think what's important to appreciate is that we are kind of whenever we're optimizing these things and whenever we're saying that this is the right decomposition, there's an asterisk here. We're saying this is for a given for a given amount of decomposition, this is the the decomposition that breaks it up into as few components as is necessary to explain a given level of, you know, functionality of of the network. We and so, you know, we we can basically scale up and down the amount of decomposition that we want to do and, you know, break up the network into larger or smaller parts. There is, like, some reason to prefer some of these larger there there are some reason to prefer, you know, 1 some regions of this spectrum over others. 1 of these reasons might be, for instance, that the reconstruction loss, you know, is actually as you scale things and decompose them more and more and more, there comes a point where you actually just your reconstruction loss kind of gets a lot worse. And correspondingly, there may be a point where your causal importance is these values that we, you know, are supposed to tell us, like, how important each subcomponent are. They kind of behave a bit pathologically. For instance, they might all typically take, you know, fractional values rather than values that that should be like 1 or 0. Because ultimately, you know, if something's causally important, fine, it can kind of be, you know, partly causally important, which will let us, which will shed light on, like, how much loss we might lose by ablating it. But by and large, we kind of prefer regimes where it's a bit more binary. There's like points along this spectrum as we like modify these, say these hyper parameters that control how much decomposition or activation sparsity is going on that give us clues as to what decompositions make sense. And in some sense, none of these decompositions are right. This holds true for sparse dictionary learning for parameter decomposition. None of these decompositions are right in some sense because, well, we're just dealing with 1 fundamental truth that is the original network. And we're just really trying to get a lens onto what it is doing under the hood. But we're always going to none of our lenses are going to be super perfect. They're always going to introduce approximations of some sorts. This is just a necessity. We're all trying to discretize a fundamentally continuous object in various different ways. And whenever you do that, you're just simply going to throw out some non 0, you know, amount of approximation quality. But we are willing to pay these, you know, costs to get tractable explanations even though there is not 1 like actual ground truth true decomposition.
Nathan Labenz (1:54:03) Yeah. Okay. Maybe last question. I'll maybe bundle a couple last questions into 1 and you can take as long as you have. Where do we go from here? Like how expensive is this gonna be to scale up to like the, you know, let's say small but like meaningful language model regime, your sort of 3 b's or your 7 b's? Is that like the next leap that you're gonna try to make? And what sort of practical applications do you find to be most exciting? Obviously, again, SAEs can do the sort of monitoring and steering. Are there fundamentally new use cases that you see opening up here? And maybe if you have time for it, is there an intersection between this and the science models? I know Goodfire has been doing some interp work on science models and that's 1 of the things that excites me most in addition to just keeping the general purpose models on the rails. The idea that we could like make scientific discoveries by understanding what the models have learned is pretty exciting. So yeah, what's the next chapter? What's it cost scale up wise? What are the new capabilities that unlocks? Any comments on science? And then we'll let you go.
Lee Sharkey (1:55:26) Cool. So, yeah, like, the next steps are definitely to, scale these up to more respectably sized models, eventually getting I I think we've, you know, we've tried out in the region of low digit billions. We've scaled this down to even smaller language models in the millions just to, you know, get some early traction before scaling back up to those sizes. There's a
Nathan Labenz (1:55:55) lot of
Lee Sharkey (1:55:55) interesting empirical questions that we wanna answer there. Like how much do we actually get to decompose these language models into low or high maybe the things that they decompose into are too high rank for us to understand, which, you know, would be you know, we would basically need to do extra work after after this first step of decomposition, but would at least get a, you know, a reasonable a decomposition into a bunch of components that that are themselves easier to to isolate and understand. Then, yeah, I think some of the applications that I'm you know, that that I think this opens up or it could open up are I think it should make, say, unlearning more straightforward than, say, in SAEs because you're already working in parameter space. You already know this is the direction that does x. We can just modify, you know, this this this vector in parameter space. And, like, we have a a more straightforward lens onto, like, functions as they relate to to parameter space. Now whether or not this beats some gradient descent guided approach is a separate question. But I think that monitoring is also something that I think will be useful for to the same or a greater extent than say sparse autoencoders or other approaches would be. Why? Because well, sparse autoencoders might be detecting a particular region of the training dataset. It's a too narrow region perhaps. And instead what we really wanted was whenever the network was using function x, that's when we want to pay a bit more attention and go into detail on what it's doing rather than was there an activation in this particular direction. There's going to be some correlation between these 2, but there will be just some things we want to monitor for that might be better thought of as in terms of functions to get less abstract. Like, maybe there are maybe there's a function for deception, and there's many different situations in which a a model might be doing deception. The input directions might be different. And so the intermediate, you know, computations might point in different directions in a way that, like, SAEs might struggle to identify as this is the 1 direction that does deception. It may instead have a multi like, many different directions that do this. But in sparse parameter sorry, stochastic parameter decomposition and other, you know, we hope later versions of this approach might be able to identify the the individual components that are, you know, doing doing this functionality. And then lastly, yeah, we really want to extract, you know, knowledge from these models, some of which are superhuman at, you know, some tasks that we've trained them to do. And we want to be able to, you know, actually think in the terms that the network itself is thinking because that's the you know, that is the knowledge in some sense that the that the network is is is using. And so we hope that this may be a way to to get a lens onto that in a way that, you know, may or may not generalize better than than than other approaches. But, yeah, that remains to be seen.
Nathan Labenz (1:59:28) Cool. Well, we'll definitely be following your work here. It's fascinating stuff, and I'm glad to have you on, you know, while we're still in the relatively early phases of it. I think the, this mega project of figuring out why these things do what they do and if there's any way to really make ourselves confident that we're going to be able to predict what they're going to do in novel circumstances is about as high a priority as any in my mind. I appreciate that you're working hard and have taken at least 1 good bite out of that problem. And I'm really looking forward to hopefully seeing this line of work mature into you know, something that can tackle the, the biggest models and and the hardest problems. It'd be great to ablate that deception circuit if we can find it. So, any other thoughts you wanna leave people with before we break?
Lee Sharkey (2:00:21) Think I that's honestly it, Nathan. It's been a really great conversation. Yeah. We had a lot of fun. And, yeah, just super pleased to be on.
Nathan Labenz (2:00:28) Lee Sharkey, principal investigator at mechanistic interpretability startup Goodfire. Thank you for being part of the Cognitive Revolution. If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the turpentine network, a network of podcasts where experts talk technology, business, economics, geopolitics, culture, and more, which is now a part of a 16 z. We're produced by AI podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And finally, I encourage you to take a moment to check out our new and improved show notes, which were created automatically by Notion's AI Meeting Notes. AI meeting notes captures every detail and breaks down complex concepts so no idea gets lost. And because AI meeting notes lives right in Notion, everything you capture, whether that's meetings, podcasts, interviews, or conversations, lives exactly where you plan, build, and get things done. No switching, no slowdown. Check out Notion's AI meeting notes if you want perfect notes that write themselves. And head to the link in our show notes to try Notion's AI meeting notes free for 30 days.