Demystifying LLMs with Mechanistic Interpretability Researcher Arthur Conmy

Watch Episode Here

Video Description

Join Arthur Conmy and Nathan Labenz in this captivating and accessible discussion as they embark on a deep dive into the cutting-edge world of interpretability research. Discover how pioneering researchers have isolated sub-circuits within transformers that are responsible for different aspects of AI capacity. Arthur introduces us to the groundbreaking ACDC approach, a revolutionary method co-authored by him, which automates the most time-consuming aspects of this intricate research. If you’re looking for an ERP platform, check out our sponsor, NetSuite: http://netsuite.com/cognitive

TIMESTAMPS:
(00:00) Episode Preview
(04:40) What attracted Arthur to mechanistic interpretability?
(07:49) LLM information processing: General Understanding vs Stochastic Parrot Paradigm
(14:00) ACDC paper: https://arxiv.org/abs/2304.14997
(14:45) Sponsors: NetSuite | Omneky
(24:30) Putting together data sets
(32:39) How to intervene in LLMs network activity
(36:00) Setting metrics to evaluate the production of correct completions
(44:20) The future of the mechanistic interpretability research
(50:00) Extracting upstream activations in the ACDC project and evaluating impact on downstream components.
(56:00) Anthropic research findings
(01:08:00) 3-Step process of the ACDC approach
(01:22:00) Setting a threshold and validation
(01:27:00) Goal of the approach
(01:32:00) Compute requirements
*Correction at 1:33:00 Arthur meant to say = "quadratic in nodes"
(01:35:30) Scaling laws for mechanistic interpretability
(01:40:00) Accessibility of this research for casual enthusiasts
(01:46:00) Emergence discourse
(01:56:00) Path to AI safety

LINKS:
Towards Automated Circuit Discovery for Mechanistic Interpretability https://arxiv.org/abs/2304.14997
https://arthurconmy.github.io/

SOCIAL MEDIA:
@labenz (Nathan)
@arthurconmy (Arthur)
@cogrev_podcast

SPONSORS: NetSuite | Omneky

-NetSuite provides financial software for all your business needs. More than thirty-six thousand companies have already upgraded to NetSuite, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you’re looking for an ERP platform: NetSuite (Code http://netsuite.com/cognitive) and defer payments of a FULL NetSuite implementation for six months.

-Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that *actually work* customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

Full Transcript

Transcript

Arthur Conmy: 0:00 A really ambitious goal of interpretability where the whole architecture of the forward pass can be understood to a human, or at least these high level concepts like the whole routing to a particular expert has some meaning to humans. And I think it's possible that we can get to this stage with mechanistic interpretability. But I think it's worth noting that even if this fails pretty badly, it's still possible for the interpretability of narrow tasks, like an understanding of the dangerous capabilities so we can at least remove those dangerous capabilities even if we don't have an understanding of all capabilities of the model.

Nathan Labenz: 0:42 Hello and welcome to the Cognitive Revolution where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost Erik Torenberg.

Nathan Labenz: 1:05 Hello and welcome back to the Cognitive Revolution. Today I'm excited to take a deep dive into mechanistic interpretability with Arthur Conmy. Looking back on the show over the last few months, I realized that I'd mentioned the topic of mechanistic interpretability many times, repeatedly highlighted it as one of the most promising paths to long term safety, and shared a few of the canonical results that most inform my AI worldview. But we've never really got into much detail about how mechanistic interpretability work is actually performed. Today, we're getting into those details. Now, this is an advanced topic. So while we definitely take our time to explain the key concepts in the simplest possible terms, we do assume throughout that you understand the differences between related concepts like parameters and activations or attention heads and MLP blocks. If those distinctions aren't already clear, I might suggest watching part one of my AI scouting report first as the fundamentals that I present there really are meant as a foundation for a conversation like this one. Beyond that foundation, Arthur really is the perfect person to guide us, as he's just published a new paper and software library that aim to accelerate mechanistic interpretability work by automating the most cumbersome and tedious parts of the typical research workflow. Beginning with the core questions that mechanistic interpretability seeks to answer, and describing the conceptual basis and experimental setups that are most commonly used, Arthur does a great job, both in the paper and in this conversation, of providing clear, even intuitive explanations for how researchers are starting to pry open the black box that are large language models. It's really fascinating to learn how sub circuits are discovered within transformers and to see the effective but often quite alien problem solving strategies that models learn. I'm also really excited by Arthur's vision for how mechanistic interpretability work could one day allow us to inspect powerful AI systems for the emergence of concerning capabilities even during the training process. While we're a super long way from being able to do that reliably today, such mastery of how AI systems work would be an outstanding development for safety, reliability, and performance. This is an unusually accessible introduction to mechanistic interpretability from a practitioner who is among the best in the field. I learned a ton, and I think you will too. So I hope you enjoy my conversation with Arthur Conmy. Arthur Conmy, welcome to the Cognitive Revolution.

Arthur Conmy: 3:37 Thanks, Nathan. Good to see you.

Nathan Labenz: 3:39 Yeah. I'm super excited about this. So, folks who've listened to this show for a little bit have heard me certainly mention the concept of mechanistic interpretability a few times, and I have mentioned that I'm excited about it as a research direction, and it seems like one of the most promising paths to long term safety with AI systems is to understand them in a deeper way, in a more penetrating way than we do right now. But we've not really got into the details too much other than looking at a few headline results of grokking and a few seminal things that give people a sense that progress is happening here. This, for many, I think, will still be their first real deep dive into mechanistic interpretability. So I'm excited to get into all that with you. Maybe for starters, could you just tell us how you think about mechanistic interpretability? What is it? What are the goals? What attracts you to it perhaps as well?

Arthur Conmy: 4:36 So mechanistic interpretability is the reverse engineering of the learned algorithms that neural networks implement into human understandable concepts. So the idea here is that neural networks, machine learning models, are an algorithm which turns inputs into outputs, but it's very opaque how exactly that model is turning inputs into outputs. And in mechanistic interpretability, we aim to explain how that happens in terms of the internal components of that model in a human understandable way. So not that this matrix is multiplied by this matrix and produces the outputs, but here are the high level variables the model's using internally to produce outputs from inputs. And so I see my research goal is to improve the world's understanding of how neural networks operate by, for example, explaining more of the behaviors of neural network models in terms of their internal components. And I think my background, which got me into mechanistic interpretability, was probably studying mathematics in my undergraduate degree, where almost the whole subject is about actually understanding how things happen. And I was drawn to this area in machine learning because machine learning is in general not a field where the understanding of how algorithms are operating is the way that further techniques are made. It's just a blind optimization procedure, but it doesn't have to be that way. And my motivation is to continue to make machine learning and the algorithms learned by neural networks more understandable in terms of human concepts.

Nathan Labenz: 6:27 So there's a few levels to this, I suppose, in terms of how much we could achieve. Right? And I think different results get to different stages of this depth of understanding. The way I think about it, and tell me if you would present this a little differently, is first, you might just ask, how is it that the models are doing what they do? Can we even just describe how information is being processed in any way that's more enlightening than just everything connects to everything and we don't really know? If I then went up a level, I could say, okay. Well, I can just describe that and we're gonna get into the work that you've done to help zero in on the part of the network that seems to be most important for a given task. But then I could go beyond that, and I could say, well, okay. I can see that these are the parts that are lighting up, but why does that work? What is it actually doing? Is there a way for me to understand that in any sort of intuitive way? And then I guess maybe a third level would be, to what degree can we understand or determine if the things that it has learned, the strategies that it has learned are in fact general and constitute some meaningful understanding, or are they sort of still in the stochastic parrot paradigm of, yeah, you might get decent results on things that look like the training data, but there's not really a deeper understanding here. And I guess further too, you might say, well, is there anything we can do to encourage actual understanding or maybe discourage, I guess, depending on exactly what you're looking for from your systems? How do you think about those layers? Would you adjust that mental organization of this?

Arthur Conmy: 8:10 Yeah. Definitely. I agree with the first two layers, Nathan, you proposed there, and think about those two a lot. To me, mechanistic interpretability has just two steps essentially. Firstly, you find what the important subcomponents of this huge neural network are that matter. And then having established this subset that is important, you can then ask, well, what is the meaning of that subset? Which I think maps onto your two levels very well and I think describes a bunch of research that has already been done. I'm sure we'll get into that further. And yeah, I think that then the third level is one of many potential use cases essentially of mechanistic interpretability, where this could enable us to answer the questions of, are these models doing reasoning or modeling humans that they're interacting with? Or is it just heuristics and statistics that is not doing anything intelligent beneath just a ton of rote based rules? And we don't know the answer to that question at this point in time. Yet the stochastic parrot hypothesis is still a hypothesis that models are just parroting nonsense, but it looks correct to us because we haven't probed deep enough and no one does probe deep enough with their evaluations. And if we can actually understand the algorithms these models are implementing, we can have a yes or no answer to the question of whether there are just, in general, surface level heuristics or there are actual algorithms which go beyond just normal heuristics.

Nathan Labenz: 9:52 In my general sense, correct me if you would disagree with this, but my general sense is that it's always both, and we just don't really know which is the case for any given task and model that we might want to consider. Right? It seems to me that maybe not always both, but certainly you get to a certain level of scale. There seems to be some generality that starts to appear, which, even if we haven't proven it yet, we've seen enough examples of meaningful grokking to believe that more is happening. But we just don't know for any particular task under consideration at the start. Has this been understood, or is it just statistical correlation that appears to be making sense? Is that your understanding too?

Arthur Conmy: 10:37 Yeah. I think that it's pretty clear that there are cases when models are reasoning or performing the correct algorithm to produce certain completions to problems. You can think of basic math problems or basic reasoning problems that have been turned into benchmarks and then destroyed by various large language models as affirmative tests. There are some cases of reasoning going on. But at the frontier of capabilities where you suddenly have the next size of model that can do something that previous models couldn't do, such as GPT-4 surprising many people with its coding abilities. It's unclear whether this is at that scale when the ability first emerges, is it actually just pattern matching that's worked in enough cases to convince humans, or there's something deeper going on where at some level you reach some understanding of code. And I think this is quite an important distinction, whether frontier capabilities are incredibly surface level when they first emerge or whether they can be learned in generality straight away, because the emergence of capabilities and the unpredictability of new things that models can do is quite important for the future risks of systems. Because if we can't predict what's going to come, then we would at least like to know that there's hopefully a surface level heuristic maybe than a completely general solution to something that we thought was very difficult, because this could cause quite a lot of instability and unpredictability when we deploy systems. So we're already getting quite into the high level motivation for why these issues are quite serious. But yes, I agree that there's both surface level heuristics and general reasoning ability in these models. And I think that the problem is distinguishing the two, particularly at the frontier, if that makes sense.

Nathan Labenz: 12:35 Totally. I've said, probably on a few episodes at this point that for me, the most important sentence in the GPT-4 technical report is quite a short one. Certain capabilities remain hard to predict, and there they show the reversal of what had been an inverse scaling law where, I'm sure you're familiar with this, bigger models had been more susceptible to hindsight neglect or hindsight bias sort of reasoning fallacies until GPT-4 when suddenly, sure looks like that has been grokked. I don't think it's been, certainly hasn't been proven in public. I don't know what they know internally. But to go from everything's getting worse as the models get bigger to all of a sudden GPT-4 is perfect definitely suggests that there are some phase changes that happen on the frontier. And the fact that even OpenAI, for all the smooth curves that they can plot, can't really predict on any given task whether it's gonna be understood or not. That definitely creates a lot of unpredictability, as you said. So let's get back a little later to what we might do to encourage things to be more understandable from the start. And for now, just dive into what you have done to help us make sense of the models that we do have at present. So I love the way that you approach this paper, and I also enjoy the clever name, the ACDC, which you give to the algorithm. I think first, it's really useful just to describe the general process that one takes to mechanistic interpretability. I think you did a beautiful job of that in the paper, and this will be the first time that folks have heard this level of detail. So for starters, how about just giving us this overview of the workflow that mechanistic interpretability researchers tend to pursue? We'll get into each of the three parts in more detail. And then, of course, get into, especially, the part that you've automated.

Arthur Conmy: 14:40 Cool. Yeah. That sounds great, Nathan.

Nathan Labenz: 14:42 Hey. We'll continue our interview in a moment after a word from our sponsors.

Arthur Conmy: 14:46 So, the way we set up the three steps of the mechanistic interpretability workflow, just to prepare in advance. Firstly, choosing a behavior or a task that a neural network can perform, such as, as we discussed, the ability to predict modular addition in a language model trained to get the correct answer to modular addition sounds. And then secondly, after picking this behavior, we then define the scope of the interpretation that we're aiming to explain. So this means that sometimes it's possible to explain the whole computation that the model performs in terms of individual neurons, for example, which communicate between different layers of your model. But this is in general pretty difficult since there are a huge number of neurons and computation is not often localized to individual neurons. So other researchers and projects I've worked on have instead looked at the attention heads and MLPs of transformers. So MLP consists of many neurons, but you can consider it as just one component. And so a researcher who maybe thinks it's a bit too ambitious to explain how a task is performed in terms of the individual neurons may wish to explain the task in terms of the MLPs, the whole MLPs that are important in the model. Maybe it's just a subset of them. So then that's the first two steps. And then the third step is to perform a bunch of intervention experiments that find those attention heads or MLPs or neurons that are important for that task at hand. And this is generally the most laborious from the human side part of the process, and hence why we thought it was a good fit for trying to automate in terms of the circuit discovery process. So, yeah, those are the three steps. It's a high level overview. I'm happy to get into that if you'd like, Nathan. Or if there are any other questions that seem to stick out to you, happy to discuss. Arthur Conmy: 14:46 So, the way we set up the 3 steps of the mechanistic interpretability workflow, just to prepare in advance. Firstly, choosing a behavior or a task that a neural network can perform, such as, as we discussed, the ability to predict modular addition in a language model trained to get the correct answer to modular addition sounds. And then secondly, after picking this behavior, we then define the scope of the interpretation that we're aiming to explain. So this means that sometimes it's possible to explain the whole computation that the model performs in terms of individual neurons, for example, which communicate between different layers of your model. But this is in general pretty difficult since there are a huge number of neurons and computation is not often localized to individual neurons. So other researchers and projects I've worked on have instead looked at the attention heads and MLPs of transformers. So MLP consists of many neurons, but you can consider it as just 1 component. And so a researcher who maybe thinks it's a bit too ambitious to explain how a task is performed in terms of the individual neurons may wish to explain the task in terms of the MLPs, the whole MLPs that are important in the model. Maybe it's just a subset of them. So then that's the first 2 steps. And then the third step is to perform a bunch of intervention experiments that find those exact tension heads or MLPs or neurons that are important for that task at hand. And this is generally the most laborious from the human side part of the process, and hence why we thought it was a good fit for trying to automate in terms of the circuit discovery process. So, yeah, those are the 3 steps. It's a high level overview. I'm happy to get into that if you'd like, Nathan. Or if there are any other questions that seem to stick out to you, happy to discuss.

Nathan Labenz: 16:54 Yeah. I definitely want to get into each 1 in more detail. So but just to kind of echo it back to you. First is identifying a behavior of interest. I mean, let me let's just get into the details now. So I'm struck by there's a lot of trade offs here. Right? You've got, first of all, what models do you have access to? What models can you kind of actually manage to scale your approach up to handling? What are those models capable of? So those are kind of some very practical constraints in general. People seem to be doing this work at this stage on relatively small models, and those models are only capable of relatively modest tasks, certainly compared to the likes of GPT-4 and Claude 2. So maybe just give us a little bit more insight into how you choose what models you're working with and how you identify behaviors that you think are actually particularly worth this depth of investigation.

Arthur Conmy: 17:52 Yeah. Sure. That's a good question about the first step of what do you mean by picking a task or the criteria to choose here? And I think that some of the criteria which feel important to me are localization of a task, such as something which can be thought about on its own as distinct from the rest of the natural language computation. So 1 of the tasks we considered was a task where a model's able to predict a future year in a given century from a previous year in that century. And in this case, this is quite localized to individual token completions that are about particular years that have been incremented from the previous year. Whereas tasks that are somewhat more vaguely defined, such as the model produced a nontoxic response or a non harmful response, are often really open ended and difficult to pin down. And so it's generally harder to interpret tasks that have quite a vague definition and can be completed with a huge number of different tokens, for example, because this is just going to be distributed all through the network and be quite difficult to pin down. So that's 1 consideration on task choice. That's mostly a question of tractability, how tractable is it to actually be able to complete this interpretability project. And you could often narrow down wide concepts into smaller ones to get around that issue. And then the second consideration would be something like, is this task relevant to something that bigger models can do and is confusing from the relevance of the actual capabilities of these models? So as examples of this, a lot of people are interested in models' ability to recall facts, for example, to produce completions to sentences that require knowing something about the objects in that sentence. And this is important because we'd like to know, if we train our language models on this much data and this kind of person is mentioned this many times, for example, well, will the model be able to store that information? Will it know that information and stuff like that? And so this turns out to be quite an interesting problem because it certainly matters for models we deploy later and there are lots of things that natural language models do because just the training data distribution is extremely diverse, and so selecting something which is useful as models get bigger is another important consideration. So, yeah, those are the 2 considerations for choosing a task.

Nathan Labenz: 20:48 That's really interesting. So just on the first point about kind of locality, it strikes me that there's probably at least a rough mapping onto or maybe from tasks that we ourselves sort of have some ability to describe how we're doing and tasks that we might hope would be tractable. For example, a toxic or nontoxic response. If I look at my own behavior, I'm not always that super clear as to why I generated a toxic or nontoxic response. Whereas I think I have a clearer sense of how I'm thinking about something like for this example you gave about the years, a prompt sample prompt from the paper. The war lasted from 1517 to '15, and then it's up to the language model to complete that. Introspectively, it feels like I have a better sense for what I'm doing. How much does that kind of introspective decomposition of tasks feed into your task selection? Not to say it should or shouldn't, but I'm curious.

Arthur Conmy: 21:57 I think that we always would be choosing tasks that involve something that humans know how to do because then we can put metrics and measures of how much the model does them since we can figure out how we do them. But I think the interesting point here is that models often do things in very different ways to humans. And we found several examples of this in the circuits we discovered where when models are choosing the correct name to put on the end of a sentence, they often aggregate all the names together and then remove the deleted names, which is not how humans reason about names at all. So I think it's the case that we always choose tasks that as humans we can understand how we perform them, at least at this point in time. But we don't always observe that the language models perform these tasks in the same way as humans at all.

Nathan Labenz: 23:00 Yeah. Certainly, it's always important to keep in mind the just profound, profound differences between the way that we do things and the way that the models are often seemingly, in fact, learning to do them. So okay. Well, there'll be more opportunity, I think, to unpack some of these examples and kind of see how this plays out. When you're putting together a dataset, I also understand that it's important to have contrasting examples where you want to set up a situation where you can kind of look at the difference between a sample that's doing or an example completion that's doing what you want it to do and 1 that's not doing it. Can you give us a little bit more intuition for that? Is it like, for the year example, would the contrasting be doing it right and doing it wrong? Like, getting the wrong answer from 1517 to 15 15, something nonsensical. Is that the kind of super sharp contrast, or is it just other tasks that are kind of not this task?

Arthur Conmy: 23:57 Cool. Yeah. That's a great question. So when we define these tasks as we can sort of explore in our paper, a crucial component of this is the selection of 2 datasets. And the first 1 is unsurprisingly a bunch of prompts or inputs to a model, which have the behavior identified, but you also do need a contrasting setters, Nathan, you've just mentioned, that are crucial to find the components of the model that actually do your original task. And why do we do this? Why do we need a comparison of 2 different parts? This has to do with how in the language model computation, it's not possible to just find the subset of the model, which does your 1 task and just completely ignore the rest of the model because that is still going to need to do some computation in the forward pass when you're running the model to be able to produce the correct completion. You can't just take your model in your programming language and just say, like, no, I don't I don't want that component. Just remove that. You have to be somewhat more clever. And so the default path that's taken in most machine learning research is pruning. And what pruning does is set certain weights in the network to 0 to remove the effect of that component of the model. So if it's a neuron, it will now never fire if it's a weight of 0. But this is actually quite problematic because the model in training and normal runs is not used to seeing just zeros from a huge number of components in its forward pass. It's used to seeing different values which fall on some distribution. And then sometimes something fires slightly more than usual, let's say, and that will then cause the model to produce the completion it does in the current setting. And so this is a quite long point, but then the crucial finishing touch here is that if you have a contrasting set of examples, you're able to just set the model's internal activations to the activations on the corrupted datasets. And this doesn't have the problem that a lot of machine learning research has, that it just sets to 0 and the model is now essentially confused as to what it's doing because it's never seen zeros before. Instead, the model is just counterfactually seeing different outputs from earlier components. So that's the intuition for needing this contrasting sets of examples. And, in the case of the years, I think, where we're predicting it, the war lasted from 1517 to '15, and this example is from a great paper by Michael Hannah and fellow authors where they find how GPT-2 can do greater than, if you're interested. This then has a baseline example that those authors chose where I believe it says just the war lasted from 1500 to and here, literally any completion can work, 1500 to 1500, 1500 to 1501, 1500 to 1599. And so the model doesn't need to do this greater than operation to find out the future years. And so this serves as a perfect baseline because now when you compare your 2 datasets, the model components that are important are on the greater than 15, 17 to some future year datasets but aren't important on the baseline data set, 1,500 to literally any year in the 1500s, then are actually the model components doing that greater than computation. So I guess the key point here is that greater than is implicitly an operation which is not just the super general algorithm of just a year. So our technique, this formalism of the mechanistic interpretability workflow is able to specifically zoom into the task at hand, which is about predicting a greater year rather than just any year at all. So yeah, I hope that provides a bit of a longer story for why the baseline for mechanistic interpretability is quite important, and it's not sufficient to just have zeros as a baseline.

Nathan Labenz: 28:49 Yeah. Cool. Okay. So then that's in anticipation of just to kind of send some of that back to you, hopefully, to make sure everybody listening along is with us. What I'm understanding is that in anticipation of the part 3, where you're going to be systematically eliminating parts of the network to figure out which parts matter most, it has been found, and you have an intuition for it, but I assume to start it as an empirical finding as most do, that just a hard elimination of different parts of the network where we literally take them straight to 0 is in fact kind of too far outside of what the network has learned to expect and learned to process. And so it creates these other problems, and then you actually can get better results by replacing instead of actually deleting outright, you are replacing whatever values with something that's representative and kind of normal in some sense for the model. But to do that, that's where you need these kind of baseline examples to have a sense for what that normal would be in this case.

Arthur Conmy: 30:06 Yeah. That's exactly right. So I think, yeah, a key intuition pump that I have for why this why we'd like to avoid setting things to zeros is that possibly the model's components have an implicit bias term, essentially, in machine learning that is not present in the literal bias parameters. But the weight matrix just on average, for example, is outputting a value in some particular range, and it's unlikely that this range will be zeros overall. And so it's super useful to use a baseline rather than use just the zeros. And this was yeah. It's not my primary research contribution, but when I was at Redwood Research, who did mechanistic interpretability research, some of their research output provided evidence for this claim that it's not a great idea to set to 0 activations, but instead corrupting your model with activations from a different dataset example may be more representative of the model's computation.

Nathan Labenz: 31:09 I'm as, I think I mentioned always kind of very careful about analogies, but, again, just for intuition's sake, if you're trying to do something like this on a human and you literally just removed part of their brain entirely, then you might imagine that other parts of the brain would be quite disturbed by that and be like, hey. Wait a second. We're expecting signal from here, and not getting it. And the whole system can kind of go haywire. So instead of liter I mean, that's where you get into literal lobotomies, I suppose. Instead of actually totally disabling a part of a network, you kind of say, let me just return this to sort of baseline activity so that other parts of the network aren't disturbed and they get something along the lines of what they're accustomed to seeing.

Arthur Conmy: 31:57 Yeah. That sounds exactly right. And I definitely do think in terms of similar analogies to interventions you do on humans sometimes, because it is helpful to choose between different interventions you could do on models. And I agree that the zeroing intervention often is equivalent to just removing something in a human body or something if we were giving an analogy to medical interventions. And in general, this wouldn't be the way you would treat someone. You'd rather go for a placebo of the hormone or something they're being treated for rather than just removing their body's ability to produce that hormone. That would be the default strategy to treating something.

Arthur Conmy: 31:57 Yeah. That sounds exactly right. And I definitely do think in terms of similar analogies to interventions you do on humans, because it is helpful to choose between different interventions you could do on models. And I agree that the zeroing intervention often is equivalent to just removing something in a human body or something if we were giving an analogy to medical interventions. And in general, this wouldn't be the way you would treat someone. You'd rather go for a placebo of the hormone or something they're being treated for rather than just removing their body's ability to produce that hormone. That would be the default strategy to treating something.

Nathan Labenz: 32:39 Okay. Cool. So we're in, getting close to the end of part 1 and definitely anticipating part 3. Again, the part 1 is identify behavior of interest, a dataset that demonstrates it, and a metric to evaluate it. So let's then talk about the metric to evaluate. This is one of the areas of the paper where I was a little bit confused because my intuition out of the gate was just if I'm trying to figure out what parts of a network are important to doing a certain thing, then the intuitive metric to me that I would want to look at first would be can I do some of this neutralization or corrupt patching, as it's called, and basically get the same output? But you take a few different approaches where it seems sometimes you are just trying to make sure that outputs are minimally changed, but other times you're looking at other kinds of metrics. So give us a little bit more about how you think about identifying metrics and why there are different metrics here in the first place.

Arthur Conmy: 33:44 Sure. Yeah. I think that metrics are mostly a question of the practitioner's choice in selecting a task. So to give an example, in the case of the war sentences, so again, the language models complete sentences like the war lasted from 1517 to '15, yada yada, the language model we deem to be correct when it answers 1518 or 1519 or some future year and incorrect when it completes a sentence with lasted from 1517 to 1516. And so there is some human judgment call here that some completions the model chooses, like 17, 18, 19, are the correct ones and we want to measure that. And some completions the model probably places some probability on because these models usually output a distribution over each completion they could create that we deem incorrect. So the practitioners in this case decided to measure how good the model was in this task by summing the probabilities on the correct completions, like the future years, and subtracting the contributions from years that were less than that. And so it wasn't sufficient for those researchers to just find the subsets of models that were similar to the original model. Because the original model was wrong in some ways, would sometimes predict the wrong earlier years. And so the metric allows you to just be a little bit more fine grained and measure an exact behavior that the model has rather than just hoping that the model's distribution, which includes some incorrect parts, is correct. So, yeah, I guess the high level thing here is that language models can be wrong and are often wrong, but you can select for that if you know what the right answer is and put this into a metric. Does that make sense?

Nathan Labenz: 36:06 Yeah. It's interesting. I mean, I guess we'll get into this a little bit in the upshot for the findings portion as well. But it does still feel a little bit to me confusing just in as much as, if this model has been trained to do this thing or maybe it's not been trained to do specifically this thing, but it's been trained, and this is one of the things that it seems to be able to do. If we then go in and start cutting out parts of the network or patching them with neutral, it doesn't seem it would it should get better. I guess, in some cases, it could, but that would be a weird, random result or seemingly random. I mean, if you have intuitions for that, I definitely want to hear them. But so then I kind of wonder, why not just stick to the simplest thing of keeping the same original behavior as opposed to looking at this task specific measure that's cooked up seemingly ad hoc.

Arthur Conmy: 37:10 Yeah. I think I want to push back slightly on the intuition there of how language models produce their completions, because often when we're using the GPT-4 and Claude chatbots that are state of the art, it feels like they're perfect in some completions. It all looks correct, and there's very few errors. But the pretraining task for models, at least most of the computation procedure to produce that great model at the end involves predicting text from across all the internet, so across all books and all forums and all just general sites on the internet. And the model probably always has some uncertainty over what exactly the location on the internet essentially of the prompt is that it needs to use to produce better completions that are more likely to be correct. And so in some sense, it's juggling a lot of different heuristics, but some include just producing what to humans seems correct, like a year that's in future, but there's some balancing of ironic completions or jokes or something where it has to balance some probability that there's just a dumb mistake here, essentially. And then there's also some uncertainty that maybe this web page on the internet got transcribed poorly so that the document will just cut off at this point. And when you think about just how diverse the Internet of different text pages is, it's then less obvious that the model should be completing what, to us humans, seems like the correct response, just because there are so many contexts in which any particular sentence could arise. And this introduces just so much complexity that I guess my intuition for having metrics for behaviors and mechanistic interpretability is you can control for this long tail essentially of possible reasons why completion would arise by specifically choosing completions that are correct in the sense of a year that's larger rather than smaller. So, yeah, I think that a language model training produces a very wide diversity of outputs, some which seem correct to humans and some that don't at all, and we can control for this.

Nathan Labenz: 39:49 To try to synthesize that for myself, I guess what you're saying is because the models in general have not been trained super specifically on this task and they have not been trained on a super clean dataset. That's probably another reason that the frontier models are performing so much better is they probably have a lot cleaner dataset than some of the earlier open source stuff that is accessible for this kind of work. But because they have such a mess going into training, they've learned all these different things, and there's all these sub circuits running in superposition on top of each other. And so, actually, there are when you zero in on this task, it is actually reasonable to expect that there is some sub circuit that in isolation could do a better job than the model itself is doing. And your goal with defining a task specific metric is to zero in on that perhaps even better than baseline model performance by stripping away this other stuff that in fact might be hurting overall performance on this task.

Arthur Conmy: 40:59 Yeah. Exactly. I agree entirely. Yeah. And just to put it in perspective, the number of tokens these models are trained on is just mind blowing. And you imagine or rather you look at the statistics of how long a human would need to spend reading just text monotonously. It's orders of magnitude longer than a human lifetime to consume the amount of text that these models consume. And therefore, as individual humans, we actually cannot model the full diversity of the distributions that we're training these models on. So to me, it's actually not surprising that there's just substantially more complexity than there is to human text completion inside of these language models.

Nathan Labenz: 41:50 That's all my questions on part 1. So we have identified a behavior of interest. We have put together a dataset that demonstrates it and has some contrasting examples that are similar but don't have exactly that same critical behavior. And we've got a metric to evaluate it, which could be, by default, just minimal change to the model's output. But we do have some reason to expect that we might even be able to get better performance if we're savvy about defining a more intentional metric for evaluation. That's a big theme I kind of keep trying to reinforce for folks in general just at the base level of model creation in the first place, that clever formulation of a loss function has been one of the big unlocks in the last few years, really. Right? Just moving to this next word prediction or next token prediction in the first place was a stroke of genius that allowed for all the data to be used. And but that's just one loss function, and there's certainly a lot more work to be done there to come up with better optimization targets. So that brings us to part 2. And this part, again, I'm tempted I don't want to be overly reliant on analogies, but here it does seem in a lot of different areas of science, one of the core decisions that you have to make is at what sort of level of zoom or what level of abstraction am I going to study this problem on? So in biology, you've got ecology and you've got, on the other end, genetics, and you've got a lot of layers in between. You can study cells. You can study systems within the body. You can study an individual organism. You can study a species. You can study all these different layers. So it seems there is a reasonable analogy here that you have to do the same thing. Right? You could look at every single activation, but for multiple reasons, that becomes either computationally intractable or just too much of a mess. And so you have to pick how zoomed in do we want to get, for the purposes of this particular analysis. So how can you develop our intuitions a little bit more there for how you think about this decision?

Arthur Conmy: 44:21 Yeah. The analogy to biology, while it's important to be guarded around giving analogies to different fields, I think that broadly speaking, I expect that the developments of this mechanistic interpretability field to progress more like biology than different fields, such as physics, because there are a number of parallels between the development process of the complex systems inside neural networks that are neural networks, essentially, and the evolution process that also trained on a very stupid goal function, but then also gave rise to incredibly complex behaviors along the way. And so beginning with an agreement with the analogy here, yeah, in terms of what is the choice and how the choice is made in mechanistic interpretability research, I think it's mostly a question of how ambitious the researcher is essentially in terms of how they're pushing this frontier of the best explanations that we currently have of certain behaviors and certain behaviors of different complexity with sufficient depth into that explanation of a behavior. And so it's a young field, there are really not many researchers doing research on mechanistic interpretability, but we already have neuron level explanations of how toy transformers complete the correct completion to modular addition, which is work by Neil Nanda and collaborators. And we have much worse understanding, but still some understanding, of how GPT-2 small completes its predictions solely in terms of attention heads and MLPs. And then to give a third example down the line, other researchers, I think the paper is called Rome or the Eiffel Tower is in Rome, have looked into the factual recall of models and then edit that recall of models by looking at where facts are stored in terms of whole attention layers, which include each a bunch of attention heads in parallel. And their work groups together whole layers of the model to then say, well, the factual recall is isolated to these layers. And so really it's a question of how ambitious is your project? And the answer to how ambitious are you is the correct answer rather is not always more. If you try and explain things at a really low level and this is just extremely difficult then it's unlikely that projects will be successful. And so just I think as a research community, people in mechanistic interpretability research are just trying to improve this frontier of getting better and better at explaining harder behaviors but in a more zoomed in way and giving their contribution to the field that way. And so it mostly comes down to a judgment call on the researchers' parts for where they're aiming at. And so, yeah, that's an example of three different levels: the neuron level in modular addition work, the attention head level in GPT-2 small work covered in our work, but then the attention layer or several layers level that's explored in different factual recall work. And I think all of these are pretty good contributions at different parts of the frontier.

Nathan Labenz: 48:07 And, again, it just seems there is intuition here around things like the best way to explain the flight of a fly ball is not to go to the quantum mechanical level. So you want to use a level of description that is actually meaningful to the person who's absorbing the output. So in some sense, it's like you're optimizing for the human audience's ability to understand the results as much as anything else.

Arthur Conmy: 48:40 Yeah. And also as a quantum mechanic to actually be able to finish your project to explain flight there. I wish you luck if you're a quantum mechanic trying to do that, but I also don't expect it to be successful. So that's the trade off that is being made here.

Nathan Labenz: 48:55 Yeah. So how in practice, at a conversational level, this sounds sort of I don't want to say easy, but it sounds like there's a few levels that you could zoom into that seem pretty natural. And if we're chatting about this over lunch, we can easily say, oh, why don't we try looking at the attention heads as the level of zoom for this particular project? But then, obviously, if you go back and actually code this and make it sort of something that can feed into the third step of actually automating the process of isolating these subgraphs. So how hard is that from a coding standpoint or a notation standpoint? That sounds kind of hard to me. I'm I know I'm not the world's greatest coder, but I feel like I would have a hard time going from, okay. We've decided we want to zoom in on the attention head level or an individual MLP blocks will be our unit of consideration to then actually figuring out how do I express that in code as a causal acyclic graph. That's where it starts to sound a little more challenging. How hard is it in practice to do that? Nathan Labenz: 48:55 Yeah. So how in practice, at a conversational level, this sounds sort of I don't want to say easy, but it sounds like there's a few levels that you could zoom into that seem pretty natural. And if we're chatting about this over lunch, we can easily say, oh, why don't we try looking at the attention heads as kind of the level of zoom for this particular project? But then, obviously, if you go back and actually code this and make it sort of something that can feed into the third step of actually automating the process of isolating these subgraphs. So how hard is that from a coding standpoint or a notation standpoint? That sounds kind of hard to me. I know I'm not the world's greatest coder, but I feel like I would have a hard time going from, okay. We've decided we want to zoom in on the attention head level or individual MLP blocks will be our kind of unit of consideration to then actually figuring out how do I express that in code as a causal acyclic graph. That's where it starts to sound a little more challenging. How hard is it in practice to do that?

Arthur Conmy: 50:05 Yeah. I would say this was definitely one of the more fiddly parts of the ACDC project to translate these high level intuitions into something which is able to be modified inside code. I do think that there are pretty good libraries. This is mostly a coding question at this point to be able to extract and edit the internal states of machine learning models. Yeah, with any like, once a researcher has some familiarity with how language model forward passes work, it's not so difficult to then add attachments into your code base to extract those activations because generally they are represented cleanly. And in our work, we made a library for researchers to be able to more cleanly edit the impact of one particular model component specifically on one later downstream components, because that's the part which is somewhat harder from the implementation perspective, editing the specific effect one earlier model component has on one later model component. Because by default, just your code just runs through end to end, and one model component affects all downstream components. But to do interpretability, you'd like to be somewhat more fine grained and look at the impact that an earlier upstream component has on each individual downstream component. So that's the part that's difficult, though it is pretty easy to get at least started with isolating individual attention heads. And there's now a lot of educational material, trying to get more people to do mechanistic interpretability and to have fun doing these sorts of experiments.

Nathan Labenz: 52:03 The difference here is that if you are trying to just execute procedurally the transformations of the transformer, you sort of take in data, you apply some mechanisms that ultimately cashes out to linear algebra. You take the results and you just kind of keep going. Right? But once you're kind of past certain layer, you can leave that stuff aside. You don't have to by default, that inference time, right, you don't necessarily keep track of how layer 1 interacts with layer 8 or whatever. You don't even necessarily think of it that way in the implementation because you're just kind of doing one thing at a time. It feels like this happens, then this happens, then this happens, and it feels very linear. But there is this kind of conceptual finding empirical conceptual finding that because of the residual stream, if I understand correctly, actually, you can have critical interactions that do not proceed layer to layer, but actually skip layers or you have these kind of if layer, obviously, I'm just making up fake examples. But if this happens in layer 1 of this particular network, then layer 5 also lights up. And that's not the kind of thing that the naive kind of forward pass implementation really looks at. So the hard part is kind of maybe not the hard part, but certainly a conceptual leap that one needs to make here is understanding that this causal graph is a bit more complicated than just the direct procedural implementation.

Arthur Conmy: 53:45 Yeah. That's exactly right. So this was a really beautiful finding from the anthropic interpretability team. So the AI lab that Nathan, as you mentioned, there's a residual stream, which in normal machine learning is usually just referred to as the hidden states of the network, which is transformed from each layer incrementally. But the researchers at Anthropic realized that if you're implementing a model with residual connections, which just means that once you apply some transformation to this hidden state, you then add the original hidden state back to the transformation. So you incrementally update by adding a transformation of the current state at each step. And this gives you a whole way to view the forward pass of transformers and many other models as a continual stream of information which the model components read from and then write back to. And this is a really useful finding for all of the sort of mechanistic interpretability research field that has this key consequence that we can model the impact that one very early layer component has on a specific downstream component that is a really nontrivial finding, but it's really beautiful once it makes sense from the Anthropic paper, a mathematical framework for transformer circuits, that was a substantial inspiration for our work that we added to.

Nathan Labenz: 55:30 I do have I think the graphical representation that they use in that paper, which presents the residual stream as kind of the central object and then all of these other things as kind of side the attention heads and the MLPs as kind of side loops that are making a contribution as opposed to centering those and having this kind of random line that kind of goes around them, which is how you know, so many of the transformers are designed. That was definitely an eye opening and kind of clarifying reframing for me, and that is in the early slides of the AI scouting report. So, definitely another good reminder to check that out to get a little bit more grounding on some of these earlier results. So this ultimately does still sound kind of hard to me because you have kind of a clean forward pass implementation, but now you do have a many more kind of paths, causal graph that ultimately turns into for the purposes of this type of sub circuit isolation work. Come back to that in just a second. One thing I did that jumped out to me also is a kind of a footnote, but, and this is the kind of thing that may be really obvious to all the practitioners like you who do this every day, but is not necessarily so obvious to people who are kind of trying to catch up to you, is the idea that all of these circuits are acyclic, meaning there can never be a loop in the design of the circuit. It all has to be one way. It all has to be the kind of language of forward pass and back propagation sort of suggests that. But I'm interested in your thoughts a little bit on this kind of cyclicality. I guess, for one thing, do anything more to add? And then on the other hand, I wonder if that kind of is represents some sort of frontier in model development where if we could figure out how to have some sort of cyclical loop, certainly, that's something that we have. Right? We have some sort of ongoing kind of feedback mechanism where our current state interacts with our future state and our past states in ways that are not we're not just waking up and executing a single forward pass in isolation each time. So, yeah, I don't do you have any thoughts on this kind of cyclicality as, like it's a little bit of a digression from the main topic, but does it feel like that is something that will never come online because our techniques just don't support it? Or do you feel like that is maybe one of a candidate for kind of another future capabilities unlock?

Arthur Conmy: 58:06 Yeah. I really like this observation that I think is one that researchers like me who spend their days trying to understand these models forget. It's a design choice, which has informed all the things that I look at, that these models are just ends to ends rather than cyclic at all. And so I think it's a great observation because it's just one that I just wouldn't have noticed because I spent all my time working with the end to end models. And yeah, the constraints of back propagation and requiring to have gradients for each individual component as aggregated end to end does enforce the forward pass of the model then just computes a purely linear process which goes through different edges, but then is always going forward in some sense. And yeah, the mechanistic interpretability research community has mostly focused or is mostly focused at the moment on these transformed language models, which all operate under this acyclic paradigm. And yeah, it's certainly just essentially a whole another dimension, which there is no current work on to be able to unlock some interpretation of, which definitely could be something which would really advance our understanding, particularly because a lot of the current use cases of language models are maybe not quite acyclic, but certainly more recurrent, which means feeding back into themselves, which in general mechanistic interpretability hasn't managed to get a great handle on yet. So while we can talk a lot about the ability to understand how models produce one token completion and how this is a really exciting open research direction, we don't have much understanding of how models produce helpful rollouts of completions, so whole poems or prompts that go on and do some chain of thought reasoning, for example. We just don't really know the mechanics of how the model's using its computed first token to then produce its computed next token. And we also don't have a great understanding of how these more common agents deployed on the internet, such as the auto GPT models, how they or if they are meaningfully different from the models that just compute one forward pass. But instead, these models are continually produced, more and more actions that the model then tries to do and then observes some consequence from that action and then acts on that further. So yeah, all this sort of discussion is premised on just really individual single forward pass completions. And there are ways it could be extended, but we certainly haven't done them. And I'd love for future work and future mechanistic interpretability research to hopefully grapple with these harder problems of recurrence and then plausibly even cyclic models if people find ways to make this work with backpropagation.

Nathan Labenz: 1:01:19 Yeah. I kind of see two more little follow ups in this digression, and then we'll get to the core of your contribution. What I was kind of if I understand correctly, the core kind of constraint here and the reason for having this no cycle, no loops, acyclic constraint is basically that we just want to have easy computation for back propagation. Right? Whereas if you so you can kind of work your way back and at each time, you can sort of say, well, we know how all the other stuff already plays out. So we're kind of using the chain rule, and it's all at each step, it's an easy calculation. Whereas, I guess, if you had a loop, that would seem to suggest something more like a differential equation type of math dynamic, and then you would have just a lot harder math on your hands. Is that basically the issue there?

Arthur Conmy: 1:02:10 Yeah. That seems correct, that we have found this easy way to train models under backpropagation. And it's certainly not the optimal way, but it's a common lesson in machine learning, essentially, that a lot of progress is generally gained from pushing simple techniques super far compared to creating incredibly intricate and complex techniques over long periods of time. And so, yeah, the article by Rich Sutton, a famous machine learning researcher, calls this the bitter lesson that generally are smart methods in machine learning. If they can't absorb lots of amounts of computation power, lose to much simpler methods if the simpler method can scale a lot. So I think this forward pass and noncyclic paradigm of models is probably a consequence of this simple back prop setup for language models, in fact, being very scalable to large amounts of compute, whereas cleverer architectures may bake in better assumptions and, in theory, have more useful properties but aren't as easy to just scale with a lot of compute. And so I think that's the focus of my research, I'd say, overall, focusing on the simple in principle techniques that actually scale to quite formidable consequences.

Nathan Labenz: 1:03:54 All makes sense. The bitter lesson, we learn it over and over again. What about a change to the loss function on the other hand? So I'm thinking of a recent paper. I'm hoping to interview the authors of this one as well. I believe it was out of Stanford where it kind of made the rounds for having a Backspace token. That was kind of the headline. Like, we introduced the Backspace token. Now the model can kind of course correct. I've only read the paper sort of superficially so far, but it also seemed to involve a different loss function that they talked about recasting the process as an imitation learning challenge as opposed to just next token prediction. And, therefore, the optimization seems to be over a longer set of tokens, and then that can kind of feed into this ability to do the backspace action when the results seem to be getting too far outside of normal distribution. What do you think of it? Does that how does that kind of loss function switch potentially relate to this kind of mechanistic interpretability work?

Arthur Conmy: 1:05:02 Yeah. I don't think as much as a researcher about the sort of the loss function these models are trained on. What's the optimal choice for loss function there? But I certainly think it's a really exciting direction for interpretability to try and choose loss functions that are more interpretable by default. And I think I was excited to hear that you spoke with Zemin Liu who's done some stuff on changing the loss function of models to make them more modular. And I think that this work is exciting, yeah, and it's not something which I personally worked on, but I always enjoy seeing changes to the default setup which can hopefully incentivize models to be more easily amenable to our explanation techniques. And yeah, I think that one lesson here, which is quite useful, is that it's exciting to have interpretability and mechanistic interpretability techniques that can hopefully work no matter what the sort of training setup is or how models change. So we'd like to have approaches which will work even if the sort of game changes slightly and people do things differently in future. And so this was a substantial motivation for the work on just pinning down models computational graph in full generality, because this wasn't tied to having the particular transformer architecture that's basically ripped off of the GPT-2 and GPT-3 papers but could potentially be used of any sort of model. So, yeah, I think that I'm I think it's a good idea, particularly because machine learning moves so fast, to be open to approaches that will still work if the board game changes as machine learning progress continues.

Nathan Labenz: 1:06:56 So that brings us and I appreciate all your time and willing to go down some of these rabbit holes with me, but I think that brings us finally to your core contribution, which just zooming out for a second, this 3 step process of identify a task, have a dataset that can demonstrate the task, have an optimization goal to evaluate how a subnetwork is doing against that task. That's all part 1. Get set up. 2, figure out how to represent your network as this causal graph. And now 3, what you have created is a piece of software that can automate the otherwise extremely tedious process of just systematically working its way through all the branches of this graph and figuring out which of these actually do anything and which can we cut as we look to zero in on what's actually doing the core part of this work. So tell us about how that works. Ads: 1:06:56 So that brings us and I appreciate all your time and willingness to go down some of these rabbit holes with me. But I think that brings us finally to your core contribution, which just zooming out for a second, this 3 step process of identify a task, have a dataset that can demonstrate the task, have an optimization goal to evaluate how a subnetwork is doing against that task. That's all part 1. Get set up. 2, figure out how to represent your network as this causal graph. And now 3, what you have created is a piece of software that can automate the otherwise extremely tedious process of just systematically working its way through all the branches of this graph and figuring out which of these actually do anything and which can we cut as we look to zero in on what's actually doing the core part of this work. So tell us about how that works.

Arthur Conmy: 1:07:55 Yeah. Cool. That sounds great. We got so into just explaining the sort of side tangents of why mechanistic interpretability research done all these things that that was far longer than a necessary introduction to what the contribution of ACDC, so automatic circuit discovery, is since ACDC is really just a 3 step algorithm that imitates the human process for trying to interpret neural networks, but does this just via software rather than requiring a human in the loop. And so given all the extensive description of the setup that we went through, the 3 steps are firstly selecting this computational graph at some level of abstraction, and then at a given node in that graph, looking at all the input edges to the node in that graph. And 1 by 1, removing them by setting their activation to the activation on the baseline datasets, and then measuring whether setting the activation along this particular edge to the baseline dataset decreases the model's performance on the downstream metric by a given amounts. And if this is a large decrease in model performance, then we keep this edge in the graph. But if it didn't seem to matter at all, we can remove this edge. And that's step 2, which we then just recurse in the third step through all the nodes. So that's the high level overview of all of ACDC, but it really is just 3 steps to find a subgraph of the model's whole computational graph.

Nathan Labenz: 1:09:59 So you've got this process for kind of neutralizing components of the graph, asking how does performance compare on the output metric. And then if the performance is sufficiently degraded, then we decide, okay. That's 1 we need to keep. Whereas if it's not sufficiently degraded, then, okay, we can throw that away. So I guess 2 questions there. 1 is just procedurally, you have all these different length connections. Right? If I'm starting at the end, my last MLP block or whatever, it's influenced by all previous layers. But those are mediated through the residual stream. So what does it actually mean to say if I'm looking at, okay, the connection between the eighth and final, let's say, MLP block and the first attention layer, that's all kind of the direct connection there being mediated through the residual stream where all this other information is also flowing. What does it mean to knock that part out? You can't take the whole residual stream out. Right? So what does that actually look like to cut that kind of connection?

Arthur Conmy: 1:11:08 Sure. So in the example of a very early layer attention head, for example, that might be just 1 of the things that are the inputs to 1 late layer MLP, we would generally write the inputs to that MLP as the sum of all the previous components, because as mentioned in the residual stream, the inputs to the MLP is just the sum of all the previous components that have added to the residual stream. So if we want to corrupt the effect that this singular attention head in the early part of the model has on this far downstream component like an MLP, what we can do is we can take the inputs into this MLP at the end, we can then subtract this clean contribution from the early attention head and then add in the corrupted output of this attention head. And then this will preserve all the other clean activations which are the inputs to this MLP and will just corrupt the contribution which comes from that 1 early attention head. And so, yeah, that's the process that we use to edit just this singular connection from, in this example, an attention head to an MLP.

Nathan Labenz: 1:12:47 Let's go back to the years example. So you're looking at the greater than task, the war went from 1517 to 15 blank, and you've got your contrasting example, which might be the WAR went from 1,500 to 15 blank such that you're not you don't necessarily need to do the greater than because it's gonna be some any 2 digit number there would work. Right? So you then have a situation where you're like, alright. I've got all my activations for the actual greater than task. I've got all my activations for the kind of very comparable but not requiring the actual greater than operation task. I hear you. I get the idea that, okay, I can express all the inputs to a late layer as the sum of all the outputs from the earlier layer. If I switch you're only doing 1 adjustment to the sum at a time. So if you're just looking at this 1 layer, I'm a little confused still about if I were to change the outputs, those would also change how each middle layer of computation actually works. Right? But you're not doing that. So if I'm looking at the connection between layer 1 and layer 8, I'm not necessarily changing the sum for the purposes of looking at layer 7. Is that right?

Arthur Conmy: 1:14:10 Yes. You're totally correct that we do not edit the effects that in the example the early attention head has on all the middle components. And to provide an implementation detail, which might help to understand the process here, we actually simply cache the corrupted value and the clean value of this early attention head. And the benefits of caching are that we can run the forward pass up to this final MLP without having done any changes to that early attention head. But then once we're at this MLP, we have the cached clean and corrupted value. So now that we can force this MLP to have essentially seen the corrupted value, even when in the whole forward path so far, we just had the early attention head, for example, on the clean input. So it's a matter of caching those 2, saving them as in your Python code, just nothing more complicated than that. So that then once we're at the downstream node, then we can do the editing. And this ensures separation of the effects of the earlier attention head on the middle components separate from the effect of the earlier attention head on this particular MLP.

Nathan Labenz: 1:15:43 So, again, what I understand you to be doing for starters to make sure I have this straight is defining how much degradation in performance on our optimization metric will we tolerate. And if the degradation is below that level of tolerance, then we'll cut that portion of the graph. But I was kind of confused because I was kind of thinking different tasks, different graphs, everything might be so different. Does it make sense to make that kind of the free parameter, or would it make more sense if it were tracked? Well, I'm not this is where my abstract math isn't quite strong enough always to know what's gonna work or not. But I kind of felt myself wanting to reframe the question a little bit and say, how sparse can I make this graph? How much can I cut it while still maintaining some overall level of performance? So why set that kind of individual kind of operation, bit by bit, threshold as opposed to some kind of global notion of how sparse can we go while still succeeding at the task?

Arthur Conmy: 1:16:50 Yeah. This is a fantastic question because I would much rather have an algorithm which chooses the sparsity of the graph and then gives you back a subgraph that has that sort of sparsity. And the reason that instead we just have a threshold, which measures the amount that a single edge matters is purely a tractability question that we don't know how to design an algorithm which at the end, or at least I don't know at this point in time, that at the end will give me this graph that's sufficiently sparse, let's say, but in the process aims robustly towards that conclusion because in advance, you just don't know which of the huge number of subgraphs will be the 1 that is the best at explaining how the model does a task. And I think this is a general machine learning iterative optimization problem that really we would like to specify what we want in the end, but this doesn't really give us a tractable way of getting there. Are you just asking for like a sub x or y sparse graph, this amount of sparsity in your end graph, but if you just immediately select a graph with that sparsity it just will be in general absolutely hopeless at the task. And so we need an iterative way to get there. And this could be gradient descent, or this could be the sort of ACDC algorithm, which goes node by node. But it just is a pretty hard problem to have that end goal be a useful target through the optimization process. So your intuition is completely correct that we would rather have this level of sparsity graph, but instead what we've got is this proxy measure of the local corruption amount that's allowed. But it certainly can give you some indication of how sparse the end graph will be because the ACDC iterative process first will process the output node. So once you've run with a particular threshold, you'll be able to see how many nodes is added to the output connection. And this will provide some guide to then how many nodes will be upstream because you'll likely if you have 1 half of the nodes outputting to the end connection, then it seems like you're going to have a pretty dense graph overall. It's going to include most of those connections. Whereas if it's including just 2 of the nodes and you, for example, wanted a substantially more like that was not enough for you and you thought there are more components that mattered, you could increase it. So while sadly we don't have a hyperparameter which can give us the exact sparsity of the end results. The early performance of the algorithm does give you some indication of how sparse the end result will be. But yeah, it's a great concern. I hope people in the future can make a version which gives graphs of a specified sparsity. But for now, we don't have a method of doing this.

Nathan Labenz: 1:20:21 So for now, you're kind of sweeping through the parameter space for what that threshold should be and then kind of eyeballing initial results and kind of saying, well, when I set the threshold such that it tolerates a lot of degradation, everything got cut, and it looks like this has kind of gone too far versus if I set it to only tolerate a tiny bit, then you might see something that looks still super dense and be like doesn't feel like it's gone far enough. And as you you're kind of using a certain amount of taste to kind of figure out where in that trade off you wanna be.

Arthur Conmy: 1:21:01 Yeah. I think there's a good distinction here between 2 ways in which we tested the ACDC algorithm that involved validation tests, which swept over a huge number of parameters to see what the best performance was for this algorithm in different regimes, in the very sparse regime to the pretty dense regime was 1 thing that happened. But then we also tested at least a use case where this was used in practice by researchers who were trying to find a particular behavior of a model and where it was computed in the graph. And in this case, you're not so worried about sweep over all possible parameters. You're looking for a revealing subgraph, which helps you begin your research into how the model is doing a particular task. So I think there's 2 modes there, that in the work, in machine learning work, you need to validate that actually your technique is helpful for recovering circuits. But then in practice, you can do an early stopping essentially once you found something which is revealing how models do certain tasks.

Nathan Labenz: 1:22:25 Yeah. I think the 2 ways that you validated the results are super interesting to me as well. The I was a little bit surprised by the order of presentation, not that that's the most important. But when I think about how would I know, I've come up with this technique. How would I demonstrate that it actually works? To me, the obvious thing is show that it can continue to do the original thing as well or not so much worse or maybe even in certain circumstances better than the fully dense graph. That that part makes total intuitive sense to me. But you is there a reason that you kind of prioritize the other 1 for discussion earlier? The other 1 being looking at what folks like Neil Nanda have actually found through their own non automated painstaking approaches and kind of comparing what the algorithm, the ACDC algorithm isolates to what they isolated by hand. Why was that the first place to go for validation?

Arthur Conmy: 1:23:21 Yeah, that's a good question. So just to clarify the experiments that was performed, it was a experiment to see how well the ACDC technique, as well as differing techniques that we repurposed from the literature, were able to find the circuits that previous work found. I don't think it actually included an example from the Neil Nanda line of work, but it did include, for example, the the greater than year example that we discussed a lot. And I think we chose this measure of the measure of how much our technique reproduced the other work that practitioners had done was that our motivation was to make something which is helpful for mechanistic interpretability research that is the first step in the process of actually giving semantic meaning to the different components in models and what these components are doing. And so I think that it is helpful to get some indication of how performance your subgraph is that you mentioned as the second evaluation that we chose. But the purpose of the ACDC algorithm was certainly something that we hoped practitioners would use to explain models rather than get models that are just really good at predicting years, because this is not something which is actually useful to people. Like how good is your model at predicting years? So our priority was the first step on the path to understanding the semantic meaning of components. And high performance is maybe correlated with that, but it's not exactly as direct as just finding the important components that researchers had in practice found were the semantically meaningful components. So I hope that makes sense as the 2 different evaluations and why we were excited about reproducing previous work. Though it is certainly flawed, we can get into that if need be.

Nathan Labenz: 1:25:31 If I understand, because it's kind of an audience driven thing where your goal is to create a tool that will be adopted by interpretability researchers and to convince them that this is actually meaningful. You wanted to show that you could recreate earlier results that they all know about and hold in high regard. Nathan Labenz: 1:25:31 If I understand, because it's kind of an audience driven thing where your goal is to create a tool that will be adopted by interpretability researchers and to convince them that this is actually meaningful, you wanted to show that you could recreate earlier results that they all know about and hold in high regard.

Arthur Conmy: 1:25:52 Yeah. I think it's true that it's an audience problem, but I want to clarify that I'm not overselling this approach. In terms of an approach for finding the best sub circuits that do different behaviors, like correct years, this would probably not be a very competitive approach because we're doing these interventions on the edges that involve a substantial amount of caching and recomputation that would be inefficient compared to other ways you could elicit model capabilities because it's just a substantially larger amount of compute. So that's just not really the area that we are competing for to make a good technique. We're competing for something different, which is simply the discovery of the semantically meaningful components. So I'm not overselling my work. I do think it would be a very good algorithm for getting very good subgraphs at particular tasks, but it wasn't the goal ever either.

Nathan Labenz: 1:26:59 So that would contrast to the Ximing Liu from the TechMark group work that we talked about earlier where they are taking a different approach, modifying the loss function during the training process to kind of create sparse networks by design. Is that kind of what you're contrasting against? Like, that would be the approach to finding the sparsest network that can do a task, and you are instead trying to create a tool that can kind of also do that, but is doing that already downstream of this very messy training process. So it's not really optimized for the best possible circuit, but it's optimized for finding what circuits do in practice exist given current techniques.

Arthur Conmy: 1:27:41 Yeah. I think this is a useful distinction between post hoc interpretability, which our work is an example of, and training process interpretability or selecting for interpretability that I think the Seemingly WOO work is a great example of. So here, we're assuming we have some fixed model and it's a black box. We want to open up the black box to understand what's happening within it. That's just the premise of the work. But complementary work, a different direction you could take, is designing training processes that incentivize interpretability through modular structure in that example. And I just think these two approaches complement each other because on one hand, if the architectures that we were studying were more modular, this would make mechanistic interpretability much easier. But at the same time, work that's building into the loss function some hopeful notion of interpretability does need to be validated to actually be interpretable down the line because models can learn strange solutions which appear interpretable but are not in reality as interpretable. And some work by Anthropic on a technique called SOLO as an activation function instead of RELU or GELU is actually an example where the researchers tried to choose a training process which led to a more interpretable model but found that the model was hiding its superposition in this case, via confusing routes that they had created by introducing that new technique. So I think that to reiterate the two paradigms of post hoc interpretability and designing training processes for interpretability are complementary and different in terms of approaches that you would take to try and reach both of those goals.

Nathan Labenz: 1:29:53 It's just important to keep in mind that sparsity does not necessarily mean it's super interpretable and certainly doesn't mean that it's generalized in a way that we would consider to be grokking or representative of some more fundamental, non stochastic pair of understanding. You could have sparsity and still have all those other problems at the same time. Cool. Well, then let's talk a little bit. You kind of alluded to it for a second about the compute that goes into this. How much compute did this take? What kind of resources do you need? How accessible is this kind of stuff? Do the techniques that you have today scale up to large scale models if you just have enough compute, or is there not enough compute in the world to apply something like this to a GPT-3? Tell me about just kind of all the compute considerations with this line of work.

Arthur Conmy: 1:30:49 Sure. So this work was done with compute that was mostly from just FAR AI, so a research group that one of the collaborators, Andrea Carrigo Alonso works with. And this was not a super cluster of one of the huge labs that we were working with. And we could see practitioners get results on the GPT-2 small large language model in half an hour or an hour of runs when they worked super well and were pretty sparse. Though there are definitely cases where compute is somewhat of a bottleneck to ACDC and particularly scaling it. So the two cases that come to mind are that when you don't select the threshold appropriately and you include lots of edges, then you tend to have to search through every single node of the computational graph. And since your computation is roughly linear in the number of nodes that are present, this becomes extremely expensive as your technique includes and searches through each node. So the first case is when you don't choose the correct threshold and this sometimes can be frustrating and then leads to slow runs and slow feedback loops that we hope that future work will be able to work on.

Nathan Labenz: 1:32:19 GPT-2 small is how many parameters? Like, 10 million?

Arthur Conmy: 1:32:23 Yeah. So that's the second sort of worry about the compute, that GPT-2 small is a 100 million parameter model. So this is large given statistical models or things that people used 10 years ago, but it's incredibly small compared to the hundreds of billions of GPT-3 larger models. And because our technique is iterative over all the edges, in fact, which scale almost with the square of the number of nodes involved, currently, this is not feasible at all for a GPT-3 size model and isn't really even very efficient for models that are at the billions of parameters. So we're not able really to scale up an order of magnitude beyond GPT-2 small at current. I at least am excited for further interpretability research to hopefully scale to those sizes and already people have done some interpretability work on the Alpaca model, so a 7 billion parameter model. And I think that I know of a number of follow-up works to this work that could plausibly be able to scale up to that size while automatically finding circuits. So an open problem, essentially. I would like to see more work on it.

Nathan Labenz: 1:33:42 Yeah. Interesting. So even with GPT-2 small, though, I assume we're not using the individual neurons as the nodes. Right? So there's still some zoom out. So how could you give an intuition for how that kind of number of nodes scales with model size? It seems like there's almost a different scaling law, or different scaling intuition that one needs to develop here. Right? Because it seems like more layers would definitely have a big difference. So even with a certain number of parameters, kind of depending on the width and the number of layers, you could maybe set things up where the number of edges to consider actually could vary potentially quite a bit.

Arthur Conmy: 1:34:31 Yeah. That's a good point. I hadn't even considered that in the discussion, that, in fact, almost all of our research in the main text of our paper focused on the abstraction level of the important attention heads and the important MLPs in these large language models rather than being more specific to the individual neurons of these large language models or the attention layers that would be less specific. I was solely talking about the abstraction level of the individual attention heads and MLPs, and connections between them, including, in fact, the individual query and key and value parts that are the inputs to attention heads, for example. We were able to isolate those. And I think that this is roughly mirroring the pace of progress of people's interpretability projects because after all, it is just last year that I was fortunate to work with collaborators on the IOI or interpretability in the wild paper, which was the first work that was able to reverse engineer a circuit inside GPT-2 small. And then this year we now have the greater than circuits in GPT-2 small, for example. So yeah, and these works are both at this attention head and MLP level, which is the point at which we can do experiments on GPT-2 small with ACDC. I think it would still be too slow because there would be too many connections if you were looking at individual neurons, but the existing interpretability research hasn't really been able to understand these GPT-2 models on neuron level. So yeah, some sort of bad news and good news there, I suppose, for understanding GPT-2 neurons with ACDC.

Nathan Labenz: 1:36:35 Even keeping that level of abstraction, where the focus is on the attention heads and the MLP blocks. If you were to try to take the leap, 1000x parameters, right, from whatever, order of magnitude 100 million on GPT-2 small to order of magnitude 100 billion on Llama 2 or GPT-3 or what have you. How does the compute requirement of this process scale? Does it go as the square of the increase? Like, does 1000 fold increase in parameters end up being 1 million fold increase in compute?

Arthur Conmy: 1:37:15 Yeah. I think it's, on average, slightly worse than scaling with the square of the number of nodes or parameters you're introducing. Obviously nodes and parameters are not the same, but this is because as you increase the number of nodes by a factor of 2, say, you're roughly increasing the number of edges in the graph by a factor of 4 because these networks are highly connected. It turns out that your layer 0 heads have an impact on almost all downstream layer heads. And so as you increase the number of nodes by a factor of 2, you roughly increase the number of edges by a factor of 4. And because this algorithm is iterative over each of the edges, this leads to the quadratic increase. And then your whole forward pass is now more expensive as well because you're dealing with a bigger model, which accounts for something on top of the quadratic cost of more iterations. But I don't think it would give you, yeah, it turns out that the process of just iterating over each edge is the slow pass of this network, the bottleneck rather than forward pass cost. So that turns out to be the biggest bottleneck.

Nathan Labenz: 1:38:34 In practice, how easy have you guys made this today? For somebody who obviously has an interest in the subject, but I definitely feel like I have some weaknesses when it comes to notation and I'm not the greatest at managing a ton of indexes. There's a lot of indexes to manage when you're doing this kind of work. Is the work that you've put out developed enough where somebody like me can actually get in there and do it? Or how much of a burden still remains for the kind of casual investigator to get in and start to figure some stuff out?

Arthur Conmy: 1:39:12 Yeah. We try to make our library well written for all practitioners by building it on top of Neel Nanda's mechanistic interpretability library called TransformerLens. So TransformerLens, which you can find on GitHub, is a library which makes mechanistic interpretability of generative language models far easier than the default implementations in Hugging Face, for example, or in online tutorials. And so it was originally developed by Neel Nanda, who again has helped a lot with making this mechanistic interpretability field easy to skill up in and have a bunch of tractable research directions. So it's thanks to him for this sort of resource. But now the ACDC library can load any of the models that are in the TransformerLens library as a computational graph with all the connections between the nodes as different edges. So this includes the whole GPT-2 line of models as well as a bunch of the smaller toy language models that have been found by EleutherAI and Pythia models, for example, and some of a bunch of toy models that have different activation functions such as the GELU and SOLO activation function. So out of the box, you can use this thing with a ton of language models that are available for mechanistic interpretability researchers in the TransformerLens library. And this includes models like the Llama models. But we think that probably ACDC will be a little bit too slow on these larger models for now. And so we're excited to see future work scale it up.

Nathan Labenz: 1:41:06 If all goes according to plan, we're gonna presumably see people starting to isolate a lot of subgraphs. What would you say then is state of our ability to actually make sense of these subgraphs? My understanding is that remains a very artisanal sort of process and kind of has its own workflow of trying to figure out what algorithm this really instantiates and is it something that constitutes understanding? All that is kind of in the eye of the beholder. There's these various techniques around editing and looking for behavioral change. There's probes to kind of try to figure out what internal states actually map on to real world states that maybe the model didn't even necessarily see in its training data. But how in general, are these things like do we get to a satisfying conclusion for most of these subgraphs that we identify or not so much?

Arthur Conmy: 1:42:10 Yeah. I think this is mostly an open question, and I'm excited to see work that provides evidence either way for hopefully the interpretability of just these raw subgraphs of huge numbers of attention heads and MLPs, or that they would be somehow misleading or confusing and not useful because this would be a useful piece of evidence that mechanistic interpretability is hard. And so far, the use case that we found with a practitioner who used this to find out whether the GPT-2 small model was able to produce completions that were the correct gender of different names in the sentence or expected gender, so it would turn ordinary names of women into the she pronoun. So this was just the bias essentially of the model to expect that as a completion. And ordinary male names into he, so again, a bias of the model to produce that completion, then revealed this structure that the model was aggregating information about this name on this surprising next token to the name. So the model will just take the name information and then move that to the next token in the residual stream, so a different residual stream. And then that would be what would be funneled downstream into a normal gender completion, like the he or she completion that was expected based on the biases and the training data. And so this was a case where the ACDC algorithm could have gave a confusing mess as to how the model did this particular pronoun completion, but actually was fairly interpretable. Oh, it was aggregating information on this position that was the token after the name, and the researcher could clearly see that through a bunch of the MLPs and then could make that conclusion, which was certainly a nontrivial conclusion and would have taken a long time to find by hand since it's a computation which occurs in the internals of the model. It's not a function of the inputs and it's not something which is a function of the outputs, it's just this internal position which matters a ton. And so in the limited examples we had so far, it turned out to be a pretty easy process, but I expect there are definitely cases where it's much harder, and I'd like to see further evidence whether it's in general a lot easier or in general still quite hard. Arthur Conmy: (1:42:10) Yeah. I think this is mostly an open question, and I'm excited to see work that provides evidence either way for hopefully the interpretability of just these raw subgraphs of huge numbers of attention heads and MLPs, or that they would be somehow misleading or confusing and not useful because this would be a useful piece of evidence that mechanistic interpretability is hard. And so far, the use case that we found with a practitioner who used this to find out whether the GPT-2 small model was able to produce completions that were the correct gender of different names in the sentence or expected gender. So it would turn ordinary names of women into the she pronoun. So this was just the bias essentially of the model to expect that as a completion. And ordinary pronouns, male names rather, into he. So again, a bias of the model to produce that completion, then reveal this structure that the model was aggregating information about this name on this surprising next token to the name. So the model will just take the name information and then move that to the next token in the residual stream, so a different residual stream. And then that would be what would be funneled downstream into a normal, a gender completion, the he or she completion that was expected based on the biases and the training data. And so this was a case where the ACDC algorithm could have gave a confusing mess as to how the model did this particular pronoun completion, but actually was fairly interpretable. Oh, it was aggregating information on this position that was the token after the name, and the research could clearly see that through a bunch of the MLPs and then could make that conclusion, which was certainly a nontrivial conclusion and would have taken a long time to find by hands since it's a computation which occurs in the internals of the model. It's not a function of the inputs and it's not something which is a function of the outputs, it's just this internal position which matters a ton. And so in the limited examples we had so far, it turned out to be a pretty easy process, but I expect there are definitely cases where it's much harder, and I'd like to see further evidence whether it's in general a lot easier or in general still quite hard.

Nathan Labenz: (1:44:47) Fascinating. I'm trying to kind of envision that, and I certainly appreciate the importance of these internal states, which some might be bold enough to call emergent properties or emergent behaviors. Do you have a take on the I guess I'll just call it the emergence discourse?

Arthur Conmy: (1:45:06) Yeah. On emergence, it's definitely a concept which is attractive to talk about because of its connection to unpredictability and the longer term worries that different AI or new AI systems will be qualitatively different from current AI systems. And I do think that it is, however, often a question of which metric you choose to measure your property under. So abilities of large language models often seem emergent when we look at token completions that our billion parameter model, for example, can suddenly do 3 digit addition. We give it 3 digit addition sums, and it can now suddenly be able to generally produce the correct answer, whereas the 100 million parameter models can produce just rubbish on the same outputs. And this feels to us like something that's emergent because suddenly the model's great at this, and previously it was absolutely hopeless. But often when you hear these statistics or read these papers about the emergent capabilities of models, they're solely looking at a particular metric, in this case, the probability that the model gives the correct addition and completion. And actually language models and models generally are trained on the logarithm of the probability that the model gives to certain completions. And so often follow-up research has found, it's been called a mirage in one paper, that once you're looking at the logarithm of the probability that the model gets the correct 3 digit addition sum correct, for example, then progress looks really smooth and it just increases gradually in the log of the number of parameters of the model. But it just happens to be that exponential growth is extremely fast and so at one moment you're at 1% likelihood of producing the correct addition sum and then suddenly you're timesing by 50 or whatever and we're at 50% and this looked like it was a qualitative change and came out of nowhere, but really you were just staring at the wrong metric. And so my broad take is that for now we are not very good at finding the right metrics to measure models under and so we resort to just looking at their outputs and sampling what happens. Even though best evaluations that exist, the team from the alignment research center who found a bunch of somewhat dangerous capabilities of GPT-4, still in general used a technique of just looking at what the model's outputs were. And we should expect that these probabilities on completions are growing exponentially in the scaling curve because we train on logarithm of the probability of the completions. And so I think that currently we're likely to see more emergence, but it's mostly because we're looking at the wrong metrics. And I'm certainly excited about digging deeper into the internals of models through interpretability or other methods because of the fact that by default, I expect we'll see emergence, but we could do so much better.

Nathan Labenz: (1:48:38) Yeah. That's a fascinating. I've been kicking this question of emergence around from a bunch of different angles as well and trying to just figure out, first of all, what matters. And I guess one way to maybe think about what matters, tell me if you see this differently, is just asking how practical is it to zoom in on these things in the process of training. Because I guess my intuition is that what really matters for users, society, companies, is at the end of a training process, what can a model do or not do? And how general is that capability? And is it grokked in some way that reflects a meaningful understanding, or is it still stochastic ferret? That seems like the key thing that matters most. It does seem to be true that per that Mirage paper, that if you find one of these kind of surprises and then rewind and say, well, okay. Let me actually measure that performance at every increment of the training process, then you can plot a smoother curve, and it seems like there's this kind of phase change that is often happening between a correlation paradigm and a more algorithmic paradigm. And those are kind of, one is dropping while the other is rising, and that does, from what I've seen, often take an order of magnitude more training to make that phase change or sometimes even more perhaps. But it seems like it's still gonna be really hard if you're training a system like GPT-4. First of all, you don't know what list of as you're training it, you don't know what things will emerge that you could then later come back and plot a smooth curve on. So you're gonna have a hard time knowing what things to even look at incrementally along the way. And then just the compute tax of that, if you were to say, every batch, I wanna run a million diagnostics and benchmark a million things, whatever, at every step, that becomes massive overhead. So I kind of look at that mirage paper, and I'm like, you definitely found something that is quite helpful to understanding what's going on in the training process. But from the standpoint of society or even model developers, it doesn't feel like that allows us to get around this problem of we don't know what's gonna come out of the model at the end of a big training run, or at least not without a significant overhead imposed on the training process. Would you challenge anything there or correct me on anything?

Arthur Conmy: (1:51:24) No. I definitely agree that it seems an extremely difficult problem to predict what are essentially unknown unknown capabilities that we know how far training on predicting next words and then maybe being RLHF fine tuned on top of that gets us in the limit. How many capabilities does this actually get us? Will this solve, will this be competitive with the best humans at maths, for example, or will it never reach anywhere close to a graduate math student? I don't know what the answer is here. And so therefore, I agree that and there's so many other tasks in this ballpark, that's that plausibly could emerge or plausibly can't. We just don't know where they'll come from. But I think that I am more concerned about mostly known unknowns in the evaluation space of different evaluations of things models can do. As an example, a lot of AI safety research has established that there are often convergent instrumental goals that models will have. So if the end training targets involves one of a huge number of objectives, it is useful for the models to gain power or resources to achieve those goals because power and resources are just very helpful for a huge number of things that the model could want to do, such as convincing people to send certain messages on the internet or acquiring certain objects on the internet or something. You would like to have more money and more influence to get those things. And so my take here is that we need to look for the unknown and known capabilities to have helpful evaluations and predictions of different model capabilities. We can sort of think about these known unknowns of concepts which theoretically are likely to emerge through sufficient training because of the instrumental convergence arguments. But then current models don't do very much, seek power essentially. So once we restrict to a certain number of capabilities that seem could be quite dangerous if we have future powerful AIs, then I hope that we can develop better evaluations to figure out how close or how far our current models are from doing these, certain dangerous, or gaining these certain dangerous capabilities that are known purely from the theoretical angle for now.

Nathan Labenz: (1:54:16) This debate kind of always comes up around interpretability. Maybe not always, but I think it's person personally just fascinating work that I'm very curious about kind of independent of its consequences. For me, it just passes the, it's interesting on its own merits test. But as I mentioned at the top, to me, it feels like it's a pretty promising path to safety. It seems like you're kind of sketching out a vision of sort of the holy grail of mechanistic interpretability, at least for safety purposes, would be to figure out how models might implement some of these most concerning behaviors and then be able to detect that, the formation of those subgraphs in the training process. That would be the dream scenario. Right? Anything to add to that?

Arthur Conmy: (1:55:09) Yeah. This sounds exactly what I think of as a speculative, but incredibly beneficial application of the mechanistic interpretability techniques that I and a number of other researchers and my collaborators have worked on. So I agree with this characterization and I will just point out that I'm well aware that this problem that's been sketched out where there are known theoretical dangerous capabilities that powerful AI systems could have can definitely be approached with other approaches to safety. We don't need to have a mechanistic understanding of AIs to be able to hopefully steer them away from dangerous capabilities or at least know when the dangerous capabilities are present. But it's certainly the case that mechanistic interpretability has a uniquely specific approach to isolating and understanding those capabilities because it would hopefully be able to explain those capabilities in terms of the exact location in models and the exact reasons why this capability emerged rather than just a litmus test that goes positive or negative for whether this capability is there. So that's the wider dream of interpretability with regard to applications and safety.

Nathan Labenz: (1:56:28) What do make of the argument that I do sometimes hear that it's like, yeah, but everything's kind of dual use and yeah, we can understand this stuff better, but also that's just gonna feed into accelerating the increasing power of systems in general. And so maybe it's not so good. I don't find that super compelling. I don't really have a great knockdown reason for it other than just I don't know what else to do but try because it certainly seems like everything is progressing regardless. Right? So I wouldn't pin the sort of potential for a runaway kind of loss of control scenario on mechanistic interpretability by any means.

Arthur Conmy: (1:57:08) Yeah. I want to be careful because I guess we both have a similar opinion here that I also don't find the arguments for the danger of mechanistic interpretability research are extremely compelling. And since we both sort of perhaps had this opinion, I don't want to misrepresent the opposite view. But to me, it seems like the vast majority of capability gains in machine learning that have been relevant to the development of the most powerful systems have not come from advances in transparency or insights about how models work. There's a great discussion of this exact question in an alignment forum research post on pragmatic AI safety, which discusses that you can just survey where capability gains to vision models in machine learning and to language models in machine learning have come from. And the vast majority have come from basically engineering hacking to find something which works slightly better than the alternatives, while no one really understands why this works slightly better than the alternatives, such as picking a loss function that's just predicting the next token that turns out to work really well at absorbing capabilities, or in the RLHF process to pick a reward, which just chooses between 0 and 1. It's just a preference between one and the other thing. To me, these things didn't arise from a deep understanding of how to model language or how to model human preference. But as far as I understand it, arose from trying a number of alternatives and then eventually selection pressure leading to these being the best of the bunch. And so under this worldview of progress in machine learning, I think that currently mechanistic interpretability is very unlikely to contribute to the bulk of further performance improvements in machine learning models. I guess that's my first disagreement with the perspective that mechanistic interpretability could be dangerous for its dual use to making ever more powerful AI systems. And then my second disagreement with the perspective that mechanistic interpretability could be harmful overall is that I think that mechanistic interpretability, if it works, this is all premised under it being useful because currently we haven't found a stellar application to the models that matter, but we hope we can get there. The second reason that I think it has a greater positive side to a negative side is that it plausibly gives us a way of designing and understanding AI systems in a different way to the current understanding of systems such that we could develop maybe more powerful AI systems. This is the worry, but they would actually be understandable to us. We would understand how these AIs are computing the outputs that they're processing from inputs. And to me, this may involve more powerful AIs, but would substantially reduce the risks of deploying these systems because a lot of the risks from the alignments of AI systems come from being able to specify your objective and trying to get something from an AI system, but not understanding the process through which the AI system achieves that end goal. And this essentially is the alignment problem that specifying the end state isn't enough because it either is really hard to specify that end state as an outer alignment problem in the jargon, or the AI system learns a solution which was just totally unintended and maybe internally optimizes that is this inner alignment problem, even if you chose the right goal. But to me, interpretability and mechanistic interpretability could be a way if it can work to develop AI systems where we understand that middle process between our specification of the goal of the system and the AI system being able to actually execute and achieve that goal. So that's my 2 reasons for being optimistic about the impact of mechanistic interpretability research. Arthur Conmy: 1:57:08 Yeah. I want to be careful because I guess we both have a similar opinion here that I also don't find the arguments for the danger of mechanistic interpretability research are extremely compelling. And since we both sort of perhaps had this opinion, I don't want to misrepresent the opposite view. But to me, it seems like the vast majority of capability gains in machine learning that have been relevant to the development of the most powerful systems have not come from advances in transparency or insights about how models work. There's a great discussion of this exact question in an alignment forum research post on pragmatic AI safety, which discusses that you can just survey where capability gains to vision models in machine learning and to language models in machine learning have come from. And the vast majority have come from basically engineering hacking to find something which works slightly better than the alternatives, while no one really understands why this works slightly better than the alternatives, such as picking a loss function that's just predicting the next token that turns out to work really well at absorbing capabilities, or in the RLHF process to pick a reward, which just chooses between 0 and 1. It's just a preference between one and the other thing. To me, these things didn't arise from a deep understanding of how to model language or how to model human preference. But as far as I understand it, arose from trying a number of alternatives and then eventually selection pressure leading to these being the best of the bunch. And so under this worldview of progress in machine learning, I think that currently mechanistic interpretability is very unlikely to contribute to the bulk of further performance improvements in machine learning models. I guess that's my first disagreement with the perspective that mechanistic interpretability could be dangerous for its dual use to making ever more powerful AI systems. And then my second disagreement with the perspective that mechanistic interpretability could be harmful overall is that I think that mechanistic interpretability, if it works, this is all premised under it being useful because currently we haven't found a stellar application to the models that matter, but we hope we can get there. The second reason that I think it has a greater positive side to a negative side is that it plausibly gives us a way of designing and understanding AI systems in a different way to the current understanding of systems such that we could develop maybe more powerful AI systems. This is the worry, but they would actually be understandable to us. We would understand how these AIs are computing the outputs that they're processing from inputs. And to me, this may involve more powerful AIs, but would substantially reduce the risks of deploying these systems because a lot of the risks from the alignments of AI systems come from being able to specify your objective and trying to get something from an AI system, but not understanding the process through which the AI system achieves that end goal. And this essentially is the alignment problem that specifying the end state isn't enough because it either is really hard to specify that end state as an outer alignment problem in the jargon, or the AI system learns a solution which was just totally unintended and maybe internally optimizes that is this inner alignment problem, even if you chose the right goal. But to me, interpretability and mechanistic interpretability could be a way if it can work to develop AI systems where we understand that middle process between our specification of the goal of the system and the AI system being able to actually execute and achieve that goal. So that's my two reasons for being optimistic about the impact of mechanistic interpretability research.

Nathan Labenz: 2:01:58 When I try to envision what that might look like, a future that sort of combines better understanding, hopefully better control, but also increasing power and maybe power per unit computer, whatever. First thing that comes to mind is kind of a mixture of experts, mixture of sparse experts sort of architecture. I'm kind of imagining something where, take the Ximing Liu paper we've talked about a couple times that are creating these very small, but very sparse and kind of crystalline almost looking subgraphs. And scaling that up where there's some sort of mechanism where you've got a lot of those and you only use a certain number at a time. And so you can kind of see what modules were loaded in to handle this particular case. What do those things do? It seems like that is potentially pretty promising to me. How does that relate to your kind of obviously still somewhat vague vision for what might eventually come online?

Arthur Conmy: 2:03:05 Yeah, I think this is an example, a nicely concrete example of a really ambitious goal of interpretability where the whole architecture of the forward pass can be understood to a human or at least these high level concepts like the whole routing to a particular expert has some meaning to humans. And I think it's possible that we can get to this stage with mechanistic interpretability. But I think it's worth noting that even if this fails pretty badly, it's still possible for the interpretability of narrow tasks like the mentioned power seeking in certain scenarios can be achieved through mechanistic interpretability. There could plausibly be a circuit which does this particular power seeking task. And having just understood that circuit in the network, we can understand why the training process got to this solution, or we can just remove that circuit entirely before we deploy a system. And I don't think this is ideally what my future looks like as an, oh man, we have this misaligned system, but we'll just remove that part and deploy it anyway. But I think that there are ways which this is a graceful degradation of the ambitious goal of just having a whole architecture which makes sense, an understanding of the dangerous capabilities so we can at least remove those dangerous capabilities even if we don't have an understanding of all capabilities of the model.

Nathan Labenz: 2:04:42 Well, you've been extremely gracious with your time. I have maybe one more question. I'll give you a chance to if you want to touch on anything else we haven't. But what are you looking at? I kind of try to keep my eye on the horizon in my self proclaimed role as AI scout. I'm kind of always looking for what is happening that maybe isn't being talked about a ton yet, but seems like it has kind of transformative potential. Are there things that you see right now or keep or maybe haven't seen but are keeping an eye out that you think could kind of change the game, so to speak, either to make things a lot easier perhaps, so that maybe something like the Ximing strategy becomes mainstream and things become much more easy to interpret. Or on the flip side, you mentioned recurrence makes things harder. There was just this paper in the last few days, I believe, out of Microsoft Research around retention. They propose a somewhat different mechanism, which I don't really understand yet, but they are bold enough to call it a possible successor to the transformer. Seems like there's potential here for paths of and this may be an indication too that the history in its particulars could end up really mattering if you can imagine there may be and it seems to me very likely that there is, or there are multiple very viable architectures to be found, just starting with the fact that we have the human brain that works pretty well, we have the transformer that works pretty well. Probably other things that are going to work pretty well. And it seems like some of those things may be much more or much less amenable to being understood. So I wonder what you're kind of keeping your eye out for in terms of things that could shake the snow globe or kind of rearrange the game board in a substantial way?

Arthur Conmy: 2:06:31 Yeah. That's a really good question about looking forward. What are the things that I'm thinking about, looking for? And I think that it's worth others having on their radar. And in terms of mechanistic interpretability and interpretability research, I think a common theme which I would expect to be part of a lot of the next generation of contributions to the field will involve more higher level motifs in language models. So we spoke a lot today about the circuit framework. We break up a large subgraph into these individual components, like attention heads and MLPs, that are given to you by the architecture. You read what a transformer is, and then you learn, okay, it has attention heads and MLPs. But both myself and current work, and I've heard of a number of other groups, are working on trying to go beyond these abstractions of just the heads in the model and the MLPs in the model to look for higher level motifs in models. So to be concrete, you can find as I think in current work I've been working on, that certain behaviors such as suppressing these negativity heads in GPT-2 small, generalize to the whole distribution of training text. And you can find this motif that we call copy suppression in upcoming work that works across all the training distribution, rather than just on these narrow tasks. And it's also distributed across several different heads rather than in singular heads. And I know of a number of other groups who are also going beyond the sort of circuit paradigm where you explain different model behaviors in terms of these given components like attention heads and MLPs, but aggregate different components and weight them in clever ways to build higher level motifs that at this point in time, are very few examples of in the literature, and I expect that this is the next big phase of mechanistic interpretability research going beyond narrow circuits and low level details to high level motifs about how these large language models are doing computation. So, yeah, I'd stay on the lookout for higher level motifs that occur across different model components or between different model components. That's my personal current direction and what I'm excited to see other groups do work on.

Nathan Labenz: 2:09:25 Cool. I love it. We will certainly keep an eye out for that. You've got the motifs notion. That's kind of a different sort of thing that you're looking for. Is there also a frontier on the automation of that or just better automation that you think will be a driver of a lot of value?

Arthur Conmy: 2:09:42 Yeah. I think that the automation to find motifs or explain motifs is probably quite a lot harder than the default path of just narrow circuits. But I think that there are a number of efforts which could be scaled up to either work with ACDC or would go off on their own if they worked out particularly well to be able to explain motifs. So the example would be the OpenAI research where GPT-4 can be used to explain the neurons in GPT-2. And here, this is a useful complementary technique to this automatic circuit discovery where we're just finding structure because it's by default assigning semantic meaning to different components. And I think that using these language models in particular to try and explain what different components or even different subsets of different models are doing is an approach which is super exciting for understanding how these motifs are present in different models that plausibly is somewhat more difficult with just the pure circuit discovery approach.

Nathan Labenz: 2:11:03 I love it. There's so much for us to continue to explore and learn about, and you've given us a great tour of one corner of the world, but we've got a lot more work to do. So Arthur Conmy, thank you for being part of the Cognitive Revolution.

Arthur Conmy: 2:11:18 Thanks so much, Nathan. Thank you for having me on. It's been a pleasure.

Your Agent's Self-Improving Swiss Army Knife: Composio CTO Karan Vaidya on Building Smart Tools

Your Agent's Self-Improving Swiss Army Knife: Composio CTO Karan Vaidya on Building Smart Tools

Zvi's Mic Works! Recursive Self-Improvement, Live Player Analysis, Anthropic vs DoW + More!

Demystifying LLMs with Mechanistic Interpretability Researcher Arthur Conmy

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next