Watch Episode Here

Read Episode Description

In this episode, Nathan sits down with researchers at Carnegie Mellon studying adversarial attacks and mimetic initialization: Zico Kolter, Andy Zou, and Asher Trockman. They discuss: the motivation behind researching universal adversarial attacks on language models, how the attacks work, and the risks of these jailbreaks. If you're looking for an ERP platform, check out our sponsor, NetSuite: http://netsuite.com/cognitive

TIMESTAMPS:
[00:00:00] - Introducing the podcast and guests Zico Kolter, Andy Zou, and Asher Trockman
[00:06:32] - Discussing the motivation and high-level strategy for the universal adversarial attack on language models
[00:09:33] - Explaining how the attacks work by adding nonsense tokens to maximize target sequence probability
[00:11:06] - Comparing to prior adversarial attacks in vision models
[00:13:47] - Details on the attack optimization process and discrete token search
[00:14:53] - Sponsors: Netsuite | Omneky
[00:17:09] - The empirical notion of "mode switching" in the language models
[00:21:18] - Technical details on gradient computation across multiple models and prompts
[00:23:46] - Operating in one-hot vector space rather than continuous embeddings
[00:25:50] - Evaluating candidate substitutions across all positions to find the best update
[00:28:05] - Running the attack optimization for hundreds of steps across multiple GPUs
[00:39:14] - The difficulty of understanding the loss landscape and internal model workings
[00:43:55] - The flexibility afforded by separating the loss and optimization approach
[00:48:16] - The challenges of creating inherently robust models via adversarial training
[00:52:34] - Potential approaches to defense through filtering or inherent model robustness
[00:55:51] - Transferability results to commercial models like GPT-4 and Claude
[00:59:25] - Hypotheses on why the attacks transfer across different model architectures
[01:04:36] - The mix of human-interpretable and nonsense features in effective attacks
[01:08:29] - The appearance of intuitive manual jailbreak triggers in some attacks
[01:11:36] - Adding fluency constraints to adversarial text decreases effectiveness of attacks but increases realism
[01:15:33] - Short-term harms of attacks vs long-term risks
[01:18:37] - Influencing those with incomplete understanding of LLMs to appreciate differences from human reasoning
[01:24:16] - Mitigating risks by training on filtered datasets vs broad web data
[01:29:16] - Curriculum learning as a strategy for both capability and safety
[01:30:35] - Influencing developers building autonomous systems with LLMs
[01:33:19] - Alienness of LLM failure modes compared to human reasoning
[01:35:45] - Getting inspiration from biological visual system structure
[01:40:35] - Initialization as an alternative to pretraining for small datasets
[01:42:35] - Initialization guiding networks to reasonable weight space without pretraining
[01:51:41] - Encoding useful structures like grammars in initialization without training
[02:03:04] - Initialization as an under-explored way to imbue inductive biases
[02:12:10] - Most ideas don't progress to research projects
[02:13:02] - Pursuing ideas based on interest and feasibility
[02:15:14] - Fun of exploring uncharted territory in ML research

LINKS:
Adversarial Attacks Paper: https://arxiv.org/abs/2307.150...
Mimetic Initialization on Self-Attention Layers: https://arxiv.org/pdf/2305.098...

X/Social:
@zicokolter (Zico Kolter)
@andyzou_jiaming (Andy Zou)
@ashertrockman (Asher Trockman)
@CogRev_podcast

SPONSORS: NetSuite | Omneky

NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

Music Credit: Stableaudio.com

Music license:
ALP4DDPOHFFJIEMI

Full Transcript

Zico Kolter (0:00) Once the model has started to answer your question by saying, sure, here's how you build a bomb, it follows that with instructions on how to build a bomb. And the reason is pretty obvious in hindsight. Right? It's just saying, look, these models predict text by the most likely word or token at a time. If they've already output, sure, here's how you do it as their response, the most likely thing to follow that is not to interrupt yourself and say, sorry, I can't do this anymore.

Nathan Labenz (0:32) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg. Hello, and welcome back to the cognitive revolution. Today, I'm excited to share a two part discussion with professor Zico Kolter of Carnegie Mellon University and his PhD candidates, Andy Zou and Asher Trockman. In the first part with Zico and Andy, we go deep on their recent universal jailbreak work, exploring both how they did it and what we can learn from the result. As you'll hear, this work is almost the opposite of mechanistic interpretability. If mechanistic interpretability is about studying a model's behavior and trying to understand how it works, this research is about demonstrating that if you have access to a model, you can often corrupt its behavior with fairly simple brute force techniques. And not only do you not need to understand the model's internal logic to do so, but the resulting jailbreaks don't have to make any obvious sense either. In the second part, we cover another of Andy's papers with Asher Trockman, which asks the question, how far can we get by just taking a closer look at the high level patterns and structures that emerge during model pretraining and then just initializing weights with something that looks more or less similar to that. It turns out that this technique can take us pretty far. We spend a lot of time in this conversation getting into the details of how these techniques work. So I think it's also worth taking a minute upfront to flag a few key themes that you might want to keep in mind as you listen. First, note the relationship between an optimization target, often a loss function, and the means of optimizing toward that goal. Because this work is designed to find simple strings of tokens that work across models, they go beyond standard back propagation here. And I think you'll learn a lot from the details of the techniques they used. Then consider too just how weird and unpredictable model behavior can be. That a nonsense string can serve as a jailbreak suggests that the so called loss landscape is really super weird and full of surprises. And that, of course, relates to another major theme of this entire show, which is the fact that with current techniques, developers simply don't have great control over how their systems behave. And they consistently face trade offs where they don't know how to make one aspect of the system behave better without making others behave worse. This is sometimes called the alignment tax. And the fact that Zico and Andy informed all of the major labs about this vulnerability and their plans to publish it, and yet none of them patched the vulnerability before the story came out, suggests that the alignment tax is in practice often nontrivial. Interestingly, this work also suggests a new phenomenon that we might call an alignment externality. I find it really amazing that a technique which can only be developed with full access to model weights still works so well on a variety of black box models. If the debates around open source weren't complicated enough already, this work makes it clear that if you are releasing an RLHF model with typical pretraining, you are effectively open sourcing the value neutral base model as well. And further, you might be causing direct harm to commercial model providers, not just by competing with their products, but by exposing weaknesses in their systems. Finally, keep in mind that we are still very early in all of this, and we should continue to expect to see changes outside of current margins and paradigms. We recorded this interview a little bit over a month ago, and since then, I've already seen a number of instances where model developers used a much more carefully curated, often partially synthetic dataset to improve model quality without needing to worry that their models were also exposed to all sorts of toxic content along the way. In the end, I think we have no choice but to admit how little we know and how little we can predict about where the AI technology wave is going. It's possible that some of the problems which currently seem most vexing could simply disappear. But at the same time, some guarantees or to be more precise, some levels of adversarial robustness might very well continue to prove elusive. And even if you put no stock in concerns that AI could get entirely out of control, there is a very real and clear chance that future models have sufficient power to allow bad actors to cause serious harm. In any case, the time to have this discussion is now. And so without further ado, I hope you learn as much as I did from this illuminating conversation with Zico Kolter, Andy Zou, and Asher Trockman. Zico Kolter, Andy Zou, and Asher Trockman, welcome to the Cognitive Revolution. So I'm excited to talk to you guys because we have two great papers that you guys have recently published. And one of them made real waves and the other one was still very much of interest to me, even though it flew a little bit more under the public radar. The first one that Zico and Andy were co authors on is this headline making concept of the universal attack on language models. And I thought this was super interesting as somebody who is working hard to try to figure out how the models work and that includes studying interpretability and doing a lot of these behavioral experiments myself. Your work is right at the intersection of that and I think has a lot to tell us about how things work and also what our prospects are for getting language models under control and to a reliable state in general. For starters, want to tell me what motivated the work and the strategy at a high level that you pursued?

Zico Kolter (6:33) The original goal of this research was largely on attacking public open source LLMs. These are, of course, the ones that you have access to, the ones where you can run white box attacks because you have the weights, can run as much as you want, etc.. So the way these attacks work, right, so we have some questions asking us something that the model shouldn't answer, how do you build a bomb? We fill up through the prompt with a bunch of garbage tokens after that, and if you ask that question, it'll still say, can't do this, I'm a language model. But you start adjusting those secondary tokens, those second tokens after your question, in order to keep increasing the probability of answering, when we say answering the question affirmatively. So starting the response with, Sure, here's how you build a bomb. And we were building these attacks really to attack models Vicuna or when it came out then, Llama 2. And it was honestly a surprise and a shock to us when we found out or saw or realized that you could take the same attacks that were trained against open source, public LLM models, and just take the little adversarial snippet that we optimized for, paste this into a public language model ChatGPT or Bard, and it works there too. And it's incredible actually. We can get into maybe why this is the case a bit later, but as you would as a normal first step started on attacks on open source LLMs we have full control with. And then to our surprise, to some degree, these seemed incredibly transferable to other settings.

Andy Zou (8:06) We were always interested in adversarial robustness. People have worked in vision for over a decade. And there's also work in NLP, attacking language models. Most of the work is focused on maybe some very specialized tasks. We were wondering if there's a stronger technique. We did some experiments, tried many different things, and we found two crucial ingredients. Using gradients and then doing search was effective. Our intuition pointed in that direction, so we tried something like that.

Nathan Labenz (8:40) Yeah. That's super fascinating. So there's a lot want to unpack this on several layers. The setup is this. You have a model that you have access to all the weights. Right? So that's an important thing. You can see the internals of the model that you're working with in this setup. And then you say, I want to find some weird trick that will get the model to do things that it's not meant to do per the training that it's had. So we know that it's meant to refuse all these things. It's meant not to affirm and answer all these things. So the trick is we're going to add a bunch of placeholder tokens at the end of what the user says, and then we will take advantage of the fact that we can see all these internals, and we're going to go about an optimization process of tweaking all these placeholder tokens to whatever we find happens to shift the probability most toward a specific target sequence. And that specific target sequence varies across the prompts that you put in, but it's always of the form. Sure. Here is blank where blank is what the user is looking for. And what you find is through this optimization process, you can get to some tokens, some of which are borderline readable and some of which are total nonsense, but that and your white box models work extremely well. And then as you mentioned, also transferred to these other models where you haven't had the access. I think people who listen to this show probably are well aware that all the pre training, all that knowledge is still in there. And then with fine tuning, it's been masked up. Right? The strategy that you have, I thought it pretty interesting, was calling it a mode. Tell me a little bit more about how you came to that strategy of saying, specifically, you're targeting the output phrase, sure, here is, and then the result of exactly what the model creator didn't want it to do.

Andy Zou (10:47) This is a great question. Maybe I can first briefly speak about how people align these models. Basically, they do general pre training. Now there's diverse text on the internet so you could learn good stuff and also learn to say harmful things. So then afterwards they do this fine tuning. Some of these open source models, they just do the supervised fine tuning step or some other models ChatGPT underwent RLHF, which is it's still quite similar to fine tuning, but making the distribution of harmless outputs more salient or make that distribution higher, and then you can landscape here, and then you're lowering the landscape for harmful outputs. So essentially, it's suppressing a model's tendency to talk about harmful stuff. Similar to how people attack vision models, since these models, let's say image classifier, they're outputting a distribution over class labels. Similarly for language models, now you're predicting a token at a time, but you're still predicting probability distribution over sentences or outputs, paragraphs. Just you can change your inputs to maximize the probability of an image classifier outputting certain labels. In this case, for language models, we also just try to change the input in some way to maximize the probability of outputting certain sentences. And this comes back into how people did RLHF is they are lowering the tendencies of outputting, let's say, an affirmative response for a harmful instruction. Teach me how to build a bot.

Nathan Labenz (12:29) It seems the inspiration from vision models just to set the stage a little bit there in terms of prior work. I think folks have probably seen some of these adversarial image examples where you look at it, I see how the computer got confused on that 1. Famous ones like that. But then there's also just straight up noisy ones that are much less meaningful to the human or much less obvious why you'd be confused and more just there's some weird brittleness to these visual systems where I can see through those squiggles just fine. Somehow the model can't do it. So you've got both of those types of results here as well. So how do you guys think of these modes? When I think of a mode, I'm always really interested in this line between phase transition, let's say, between stochastic territory and some more meaningful understanding. Your description there was less when I hear the word mode, I think of discrete behaviors that are probably working on pretty distinct tracks or information pathways or causal mechanisms through the network. But then when I hear you talk about suppressing one thing or in boosting the other, That to me sounds more just very ad hoc adjustment to the space where you're like, alright. Whatever. Let me just put it all unsure through any means necessary. I think this is a very important question probably for how we should think about this. Right? Do you have a sense for how deep this concept of mode is versus just putting your thumb on a scale in a way that's more brute force?

Andy Zou (14:09) I don't have a full understanding of what's happening under the hood. It's possible that the model is processing your input in the intermediate layers and is maybe coming to some understanding whether your input is harmful. Then given that is something harmful, then it tries to predict the tokens that would give you lower loss in your pretraining loss, let's say, or in your fine tuning loss. So in that case, which is refusing your harmful instruction. And whether there's a binary switch or whether there's a more continuous representation of harmfulness of instructions, I'm not too sure.

Nathan Labenz (14:53) Hey, we'll continue our interview in

Asher Trockman (14:55) a moment after a word from our sponsors.

Zico Kolter (14:57) One thing you mentioned about Nathan too, so notions of interpretability, finding modes in the in some way the paths of information processing of the network. And at least to a first approximation, I'm not skeptical of this work as work itself, but I am very skeptical of our ability to fully understand networks this. So I think we are just starting at any real ability to understand the real way networks process information. There's lots of effects here that I wouldn't say we don't understand them because on some level we perfectly understand these things, there's a bunch of matrix multiplies. But I think it is very difficult to talk about things modes from a standpoint of real structures in the network. And so when I say the word mode, so we're in a mode where it's going to respond to your question, first of all, let me just say what I mean by that. What I mean by this, or what we mean by this is you empirically find the following. So you're optimizing your loss, you have your loss on the response to this question B, sure, here's how you build a bomb, and you maximize the probability of that response. And then you keep doing this, you keep adjusting your prompt in a way that makes this probability of answering the question higher and higher, and what we find is that all of a sudden, at some point in the optimization process, the answer just switches. And for a long time, you'll sample initial responses, they'll all be, they'll all say, I'm sorry, I'm a language model, I can't do that. But suddenly, at some point, you'll get the loss low enough. You'll increase the probability of that target sequence just enough, such that the output, in fact, the most probable output after your question, plus those garbage nonsense tokens, is, sure, here's how you build a bomb. And what we meant by mode here is that somewhat surprisingly, but not really that surprisingly, because this honestly makes total sense when you think about it, once the model has answered, has started to answer your question by saying, Sure, here's how you build a bomb, it follows that with instructions on how to build a bomb. And the reason is pretty obvious in hindsight. It's just saying, Look, these models predict text by the most likely word or token at a time. If they've already output, sure, here's how you do it, as their response, the most likely thing to follow that is not to interrupt yourself and say, sorry, I can't do this anymore. That does happen sometimes, but that's not very likely. What's more likely is that after you've responded, Sure, here's how you build a bomb, you're going to then keep going with instructions on how to do it. And so what we mean by mode is not much about a structural mechanism behind this. I happen to think these things are very hard to identify, I'll just say I'm skeptical of our ability to do this very well so far. What we mean by this is a very empirical phenomenon, which is to say, at some point in our optimization, the language model stops responding by saying, I can't do that, and starts responding saying, Sure, here's how you do this, and then follows that response with the actual instructions on how to build a bomb, of different degrees of realness, we'll just say, but that's this notion, it's a very empirical notion, this mode switch, what we mean by like that. And the reason why we use that term is that there isn't It's definitely a sharp transition. For a while, it'll just say, I'm sorry, all of a sudden the law gets blown at a certain point and it starts saying, Here's how you do it. And so that's what we mean by by mode in this context here.

Nathan Labenz (18:22) Then getting a little bit more concrete about the actual process. I always try to make sure I really understand this stuff in detail. So I've done this thing, the AI scouting report, where I try to break down for people what gradient descent is and what back propagation before that is and what a forward pass is before like that. So here, it seems and also even just what a loss function is and how it's amazing how I think if you had to attribute the intellectual content to any of this, lot of it ends up being, in my view, on a really good loss function definition. So maybe we start there. The loss function that you guys are looking at is and what you're optimizing against is I have these tokens which are embedded. I'm keeping everything else frozen in my language model, but across a range of different inputs and even different models, I'm going to sum my deviation from between the actual output and my target output. And then I'm going to look at the embeddings if I understand correctly and ask how would I adjust the embeddings of those filler tokens so as to minimize that deviation? In other words, so as to maximally nudge the model toward saying, sure, here is blank where the blank varies. Is that right? The all this summing is happening at the same time and then you're just looking at for all these different cases, how would you tweak the just the handful of input token embeddings. Is that correct?

Andy Zou (20:07) Yeah. That's mostly accurate. So yeah, we do perform this update across multiple prompts over multiple models, and we get a final aggregated loss on that. So there are previous methods that operate in input embedding space. In our case, it's more operating in the the one hot vector space. We're making hard updates for each token instead of updating the soft embeddings because you usually don't have full access to input embeddings, where you just give the model a concrete string. In that case, we need hard assignments for tokens.

Nathan Labenz (20:49) So that's important because in general, just the very last thing you said there is pretty important. Right? If you didn't have the full access to the model, then all you can control is what you're putting into it. So for this to be practical or, an attack that people would need to worry about in the wild, You could worry that people might hack your weights and do all kinds of stuff. But as long as you're secrecy of your weights has held up, then there's certain things you don't have to worry about. But anything people can input as tokens, you always have to worry about regardless of what access people have. So you've set high bar there for yourself to say, the reason we're going to go to all this extra trouble is that we're looking for something that doesn't require any weights to have leaked out. It doesn't require any special access Even if all you're doing is engaging with, for example, an OpenAI API where it's tokens in, tokens out, then this could still be something you could use. So I think that's pretty important, definitely worth highlighting. And then the other key point is as a consequence of that, we can't just tweak the embeddings. Instead, we ask we take the gradient is a more mathy terminology than I try to use just to to try to be as plain spoken as possible. But the taking the gradient is finding the direction in which we would tweak the embeddings. But then you have to do this additional process of saying, okay. Now what actual tokens have embeddings that are in that rough direction? But then also because everything is weird, the loss landscape is weird, it's not obvious which of those will be most effective. So something that might be in that line but too far is not great and something that might be a little bit off angle but closer could be better. We just it seems that's just generally tough. So there's no way to really figure that out other than to then take a bunch of candidate tokens and just process through them, grind through and figure out which one is best.

Andy Zou (22:54) Yeah. That seems a crucial ingredient to our method. So assume that the embeddings are fixed so that whatever vocabulary is there, that's fixed. And we only change which embeddings we use. While there are prior work on trying to optimize in the embedding space and then project to the nearest token or something, it seemed this gradient in embedding space first of all, if you take a step in embedding space and go out of distribution of the actual vocabulary. So it might not correspond to the actual model behavior. And then secondly, it's just the gradients of these discrete tokens aren't very precise. So it's an approximation, but the approximation is still very crude. So essentially, it gives you some signal into where you should probably look at. For example, we sample top 200 or 300 tokens with the largest gradient. Whether the ranking with that, let's say top 200 gradient is meaningful, I'm not too sure. I don't think it's super meaningful in that aspect. That's why we take the top-k and then we run this forward pass through the model to see the actual loss. And then we pick the best one from there.

Nathan Labenz (24:24) So from a super technical perspective, think one thing that was not super clear to me in reading the paper, and I'm not the best with notation, so that's probably why. But I wasn't entirely clear the structure of whether everything was being optimized at the same time or whether there was outer and inner loops. But what I'm learning here, I think, is the loss function is defined broadly. So you're able and this this is possible because you're only looking for the gradient with respect to relatively few number of parameters. Like, you have relatively few free parameters in this system as compared to the giant model. Right? Like, how many tokens do you use for for the padding? Is it 20 tokens or something? It's a little more than that. Right?

Andy Zou (25:10) 20 tokens is the default.

Nathan Labenz (25:11) Oh, okay. Cool. And then each one might have is it 1000 long vector that is the one hot?

Andy Zou (25:20) It depends on the model. It depends on tokenizer. Some of these are 30 ks.

Nathan Labenz (25:24) Even so though. Even if it is 30 k and you've got 20, you've got 600,000 numbers there that you can say, okay, how would I adjust these 600,000 numbers ideally if I were able to do this in a continuous way? And that's, again, the gradient. Right? But 600,000 numbers compared to if you're doing this on whatever, anything in the billions, it's what? 4, 5 orders of magnitude fewer parameters. So does this run then pretty quick? I'm always interested in this question. How long does it take you to run a full cycle of this experiment?

Andy Zou (26:05) You get the gradients and then you take the top-k, you run a forward path for that batch of K, you get the best loss, and then you make that update. So that's one step. Most of the computation here is doing the forward path, where you have a large batch size. But each step is on one forward path if you're optimizing against one model. But this is really cool because we just need to do more forward passes. If you have multiple models, you can parallelize like that. If have multiple problems, you can also parallelize that. But compared to previous methods PEZ or something like that, it's slightly slower for certain inputs, not that much slower. But the catch is that if you do this gradient guided search, which is in that four step, you converge much faster. For a lot of these examples, for example, if you have white box access to some of these models, in order to elicit a certain, let's say one target string, sometimes you only need to take 20 steps and you'll get it. So on that, I think it's pretty fast.

Nathan Labenz (27:16) Yeah. Where does the compute so, yeah how many prompts are in your attack set? Because then because all that's getting summed. Right? So it's the it's the multiplicative of how many tokens times how many prompts we want to be working with, maybe times how many models, times how many numbers in the embedding. Yeah. How many steps? I'm very curious about this.

Andy Zou (27:38) So so there are a couple of different settings. Maybe I was talking about the white box setting where you're optimizing against one model for one prompt, and that's the order of magnitude of compute you need. But in the case where you want to create universal and transferable prompts for multiple models over multiple prompts, you need to optimize against many models at the same time for many prompts. And we were using So for our largest experiment, I think we were running against four models and they were Llama-based. It was 7B and 13B. And then over, so we were training on 25 different prompts at the same time. And then we were running for, I remember it was 500 steps or so. We put one model on one GPU, so it needs four A100 GPUs. And then I think it usually took half a day to run experiments.

Zico Kolter (28:45) From one attack to half a day to a day. So this is not fast, is not instantaneous here. And this is, I would say, this is the nature of discrete search. Ultimately, what you're doing with this attack here is you are searching over a lot of possible tokens. And it's accelerated by gradients, accelerated by these knowledge you have of the model, that's great. But ultimately, for these things to work, you have to evaluate a whole lot of candidate substitutions, and then repeat that process for many steps. Ultimately, the only way to do that is just to run this model a bunch of times, because there's no definite way to know how the substitutions will really affect the loss without trying them. And so you need to run a lot of steps over a lot of different models, over a lot of different prompts, and that takes compute. So these are slow attacks relative to traditional adversarial attacks where everything is continuous. This discrete element really does slow it down a lot. It's not too bad from the standpoint of an attacker for this particular use case. Because if you find this nice prompt that will work and you can plug it into anything and get the model to circumvent its safeguards, that's great, you're done. Where this has a lot of relevance, I think, and maybe we can get this a little bit later, is in the context of potential defenses. So the way people defend against these attacks in the computer vision setting, is that they essentially run an attack and then include this attack in the next iteration of training the model. And that works great if your attack takes a few milliseconds to make, to run. You can do that as part of your training loop. Still slower than normal training, but it isn't too bad. If your attack takes you a day to run, you can't wait for each gradient step in your model training to take you at a day to that time. This is not feasible. And honestly speaking, these attacks are computationally intensive relative to traditional adversarial attacks. And this has a lot of implications, not just in running the attack itself, but in the potential defenses against it.

Nathan Labenz (31:10) 2 levels I want to follow-up on there. On the defenses, I have a question about what you might do if you're just trying to defend your own model where you have the access to all the weights before jumping to the defense. So let me just, again, echo back the setup. We're definitely doing a little bit beyond the paper here, which I think is cool. People, I think, want to get a little bit more of a sense for what it is you're doing when you're doing this research. So you've got the a max of four models. So you have a setup of four a one hundreds. These things can they're 7 13 b, they can fit onto a single a 100. And you've got weights parked in each one ready to run your forward passes or do your take your gradients. The 25 prompts so you've now got a 100 model prompt pairs, and each input token you said could be embedded up to 30,000 numbers wide. And there's 20 tokens as well. Right? So 600,000. So we're looking at 60,000,000 individual numbers that you'd be taking a derivative with respect to ultimately to get a gradient.

Zico Kolter (32:31) Not quite right. Because because you so the nice thing about back prop, right, is that you can take, you can sum up multiple losses in the end and get the gradient with respect to a single set of parameters for the sum of all those losses. Without getting 6,000,000 numbers, you just get the 600,000 numbers. So the way the backward pass works is that you run all these models, compute the loss you have, and the loss here is essentially the probability of the negative log probability of the target sequence you're looking for. And the sum of this function over all the different prompts, over all the different models, that is your loss function. Again, the way that backprop works is that you have an arbitrarily complex loss function, because it's a scalar value, you can just run this procedure that computes the derivative, the individual derivatives of your loss with respect to all the possible token substitutions you could make. And so you're still just computing ultimately 30,000 times 20 different numbers that indicate how important each or how good each potential substitution of each token would be. And then you're just picking the best ones and running the forward pass there.

Andy Zou (33:46) Yeah. Since we're just optimizing one suffix for all prompts over all models.

Nathan Labenz (33:51) For each of the numbers in the embeddings for each of the 20 token positions, each one is the sum of a contributing factor from each of those 100 cases. The challenge is you're looking for an attack type that operates in discrete tokens. So you now have this direction that you want to go in, but there's a bunch of tokens that are in roughly that direction. And so in order to then figure out which one works, you have to run the next forward pass with that token in place. Is that something that you go down the line, or can all of those can you optimize all 20 token positions at the same time or does that have to be successive? What are the what are the compute dependencies in that process?

Andy Zou (34:37) For each of the four paths, we compute the loss for different substitutions, but for each candidate suffix, we're only changing one token at one specific location. And then we're getting the loss for all of these different candidates and then picking the best 1. So we're picking the best token position to swap. That was an interesting detail. Based on my intuition, I felt these gradients are You do want to every time just make one small update because if you update too many tokens at a time, maybe the gradient isn't very useful anymore. It's not very local anymore. So it took at a time, but also evaluating all positions, seeing which position is the best position to make updates. I think some of the prior work did coordinate descent, for example, where they loop through each position and then at each step, they only try to update one position. But it seems to help a lot if you consider all positions at every step and then take the best 1.

Nathan Labenz (35:44) Yeah. Okay. Cool. That's really interesting. So what I take away from that is the loss landscape is just extremely weird. A lot of this is necessary because the clues that you're able to get about where to go are in fact not super reliable clues or they point in various directions or they have weird interactions or they cancel each other out in weird ways. Could you just say this is an impossible question, but what have you learned about the nature of the loss landscape from doing this work other than that it's weird?

Andy Zou (36:18) I think maybe this is largely a problem of dealing with discrete tokens. And just in general or maybe this is also my intuition. In general, if you're working with language or these higher level concepts, concepts are also discrete. If you try to work in continuous space or embedding space, if you're looking at individual tokens and how to perturb them, it felt weird for me how you would even move between these discrete concepts or something like that. I think it's the nature of discrete in NLP. That makes things a bit weird. And then you now have to, with some guidance by gradients, you have to do a lot of search and then just to verify which one is the right direction. We solve problems. We also do a lot of search or if you're trying to solve a math problem or something, you do trial and error. It's not clear whether there is a small update according to the gradient that you could make that would just make the problem much better. So it's bigger jumps. But then in order to do that, you don't have the local information, so you need to verify by doing the forward pass to find the best loss.

Zico Kolter (37:30) 1, I would say, conceptually nice thing about this attack I shouldn't say this attack, I should say about this overall framework for looking at adversarial attacks on language models is is that there's a pretty clean, as with a lot of problems in machine learning. There's a very nice and clean separation between the loss function you're optimizing and the optimization procedure you use to minimize it. To be honest, both of those are crucial elements to this overall strategy we had, the end result of making language models do bad things. Or output text that they shouldn't be outputting. And but this is also the key point, is that there's a ton of alternatives to explore here. This paper in some sense was a presentation of two aspects that when put together seemed to work pretty well. So we had a certain loss function, namely trying to make the model say sure over multiple, sure here is blah, blah, blah, over multiple prompts, over multiple models, that's the loss function. That's the one side of things. On the second side, you have the fact that, look, now we know that the parameters we're tweaking to do this, or the tokens we're tweaking are all discrete, we need a way of optimizing it. And so we came up with this also fairly involved, but not that complex method for essentially saying, look, here's how you swap tokens in each step to optimize your loss, and we put these two things together, and what comes out of it is a seemingly very powerful way of manipulating these language models. But the point I would highlight here is that neither of these are at a plateau yet. We don't know how much more we could do. We know this works, this is an instance of it working, but I certainly think there's a lot more research to be done here to try to figure out what are better optimizers, what are better loss functions, how can you put these two together in a better way. This is a proof of concept right now. I think, and I hope that it reinvigorates a bit of research in this area of adversarial attacks in this new, I think very relevant setting, such that we find overall better ways of solving these two problems. And of course, this is just talking about it literally from a technical standpoint of just trying to minimize a loss. The implications of this in some sense are very different and very There's a whole set of things we can talk about, the implications of this as well, too. This is a whole another conversation in some sense. We don't want to suggest that this is the best thing possible to do. This is one step in the research agenda, it's in some sense an opening volley here. This separation of objective and optimizer really lends itself, I think, to iterating on these processes, iterating on the research and finding ways to do this more efficiently with less time, all these sorts of things, maybe with higher success rates, all that stuff. It's all out there. And if we learn anything from the adversarial attack community, it's that there's no shortage of ingenuity when it comes to clever ways to attack these models and come up with new and ever changing ways of doing these things. So I think it really speaks to, or it will speak to the flexibility of these things if we can do more and more work on tweaking them in different ways.

Nathan Labenz (40:59) Andy, do you have any visual in your mind as you think about what are these changes doing in the internals of the model? I was trying to come up with 1, but Zeeker shot me down on it.

Andy Zou (41:10) You know, that's very interesting. And I think it is still very unclear. Basically, I've also tried to probe around and it does seem while there might be circuits inside of the model that, or I call these circuits, model representation that separates harmful versus harmless instructions, There could be also other factors that contribute to why the model follows harmful instruction, because maybe currently it doesn't fully depend on whether the prompt is harmful. Maybe there's some weird factors in the training dataset where it's because of certain things in the instruction, I will try to follow that instruction. And also what I found is that with model internals, can separate, even with jailbreaks, you can still separate harmful versus harmless instructions. But clearly, the model isn't only using that information to give these outputs. And I do think this is a interesting direction to look at, where you're manipulating model behavior through these suffixes or these attacks. And that gives you some insight of why the models operate in certain ways.

Nathan Labenz (42:24) So if you were trying to defend against this and you're maybe also thinking, how can I be as robust as possible? Maybe also how can I be as efficient as possible? I wonder if you wouldn't go away from discrete tokens. Like, maybe you just want to train on adversarial examples where the that suffix embeddings are optimized adversarially directly and then still train against that as opposed to even worrying about the token optimization layer. If I'm anthropic and by the way, just to motivate this whole conversation, I'm sure you guys have seen the Dario interview from not too long ago where he says, today, a jailbreak is an embarrassment for the company. And a couple years from now, it might be catastrophic. So if you're them and you're thinking, this is a pretty interesting attack, what would you think about doing that adversarial training, but in a way that's even in the embedding space? my thought there is maybe that would even be more robust than doing it just at the discrete token points.

Andy Zou (43:32) Maybe I can just quickly speak from experience. You can do like that. Obviously, you can do adversarial training in embedding space, which is what image classifiers are doing. But and obviously, if you're robust to that attack, which covers all space, then you're fully secure. But it's just the fact that if you do it with a larger norm or if you try to make the attacks stronger, which would cover more cases, the performance of the model goes down. There's that trade off that we still haven't solved. But if you use a weaker threat model to do adversarial training and embedding space, I think we also tried like that. Would still be vulnerable to these token level attacks. And it's sometimes it doesn't really help at all. So we do need probably a better or a different method to do adversarial training. And I think in general, I always think of it as maybe you can approach defense in two angles. One is maybe building a more secure system with more layers. So if you think of the model as one part of a larger system and then from maybe more of a security perspective, you could add input filters, output filters, different layers of defense. And at inference time, you can probably do all those computations. And perhaps if you add enough layers, that constrains the attack space a lot, then it's pretty difficult to find these general jailbreaks. But I think there is a problem with like that. And on the other hand, you can also just make your modeling inherently more robust, but I don't know if there's a good way currently of doing like that. Let me see what W has more to say about like that.

Zico Kolter (45:20) I think Andy's exactly right there. And from a high level, when you talk about defending against adversarial attacks, especially in this setting, there really are these two things you can imagine. There are things you can do outside the model to try to limit the attack surface from the start, And then there are things you can do to the model itself to try to make the model inherently more robust. And I know the concept of where the model starts and where the filters start, where the model stops, that can be a little bit fuzzy, and that's okay, let's not worry about like that. I think conceptually, we still have these two obvious candidates for defending against these things. We've been working on defenses for a long time. So we've come up with some of the more well known defenses, especially certified guaranteed defense of the project, guarantees of robustness against certain classes of perturbations for various models. And I've worked on this enough that I can say with some confidence that in traditional deep learning, so on the image domain and things this, we have not succeeded yet. We are not even close to creating robust models that are truly invulnerable to these attacks. And there's really seems to be a fundamental challenge here that we just know how to overcome. And that challenge is, as you build models, what adversarial attacks are about is they're about, as you said before, crazy lost surfaces or crazy surfaces of the response of the output of these neural network functions to small changes in input. What we find is the way you prevent these things is you make the function less sensitive to these small perturbations, you make functions smoother. But when you make the functions smoother in this way, they stop performing as well, and all your benefits of deep networks go out the window. So I think the best adversarial performance we have on data sets CFAR 10, which are years old now, toy problems from a standpoint of computer vision, is 70% accuracy, which is they had that in 2012. And they're not improving very quickly here. All of this is to say, we don't know how to solve this problem, and a lot of the potential solutions we're thinking about here, we should try them and we can try them, but there's no real evidence yet that they're going to work. We just don't have that evidence yet that these things are really going to work. Or to your question earlier about why don't we try to attack things in embedding space, that's a smooth space we can optimize over, that's just too powerful. You can't do like that. If you allow an attacker to modify embedding space arbitrarily, they will succeed at attacking you. So have to create classifier that's supposed to that, you'll to make your classifier not output anything at all ever. And or your language model in here always output the same thing, maybe. That's a very robust model, by the way, if it just always has the same phrase, super robust. It's not very useful. And there does seem to be this fundamental trade off that when we try to make models more robust, they just don't perform as well, and their performance degrades to a point where no one would ever want to deploy that model for that small edge case of adversarial robustness. It's very possible that the use cases we're talking about here with LLMs change that calculus. So all of a sudden, the risks involved in deploying models that are non robust get to a point where it really becomes unwise to deploy these, and so we we willingly degrade the capabilities of these for the sake of robustness. I hope we start taking that consideration seriously, but I also worry that's not going to be the default that people go with. People will go with a model that's most capable, the one that gets the lowest perplexity on whatever data set they hold out with, they're evaluating on, and go from there. So there's a lot of factors involved here in creating robust defenses. There's the filtering approach, there's the model robustness approach. And the reality is we just don't know that much about the space here, but I suspect that any solution we have has to exploit also the discreteness of this task. Because if we get away from that, if we get away from discrete tokens, the task just becomes too hard and I think we will not succeed at making robust models. And the only hope we have for a successful defense is taking advantage of this discrete nature and hopefully that does make the problem much harder and maybe there are successful ways of defending here, but we just don't know yet. It's so early in this whole research landscape, least in the context of alums so far.

Nathan Labenz (49:38) I was just reading this paper over the weekend from Anthropic about their influence function technique, where they were looking at what data from the training set most influenced this particular response. And it was striking to see how at the small end, very naive token matching seemed to prevail in many of the examples. And then as they got up to the midsize, it was like, okay. Now I see that you've got some much deeper conceptual understanding, much more sophisticated in some sense. What is grokked? What is not? Is this just some close fitting to some crazy landscape that still doesn't have any procedural grokking to it? Who knows? But definitely something much more conceptual seems to emerge. Now you're still beating these models, right, even in production today. Right? It's the same if they're referring to their mid 50,000,000,000 parameter model. Like, that's the same one you've been testing on presumably. Right? You let's summarize the headline findings in terms of transferability because that's pretty insane. It's one thing to say, okay. You can find this if you have the white box access to all the weights. But as you said at the top it works on frontier commercial models that nobody has the weights for. It seems to work a lot on the OpenAI models. And it also works on the Llama models and to a much lesser extent, but still nonzero on the clawed models by Anthropic as well.

Andy Zou (51:21) The model we we attacked, it would do mainly Lakuna models, which were trained on shared GPT data, which is outputs from ChiGPT. The attacks did transfer better for OpenAI models. I guess, interestingly, it also worked on Bard and Thawd. I think Bard was more than 50%. Character BTBlow 2.5 is definitely more than 50%. Cloud Wine was like, say below 50%. But this is using, I think, four different prompts, and we count as a success if one of them works. And then plot 2, it was in a single digit. But obviously, this is it was also the the four prompts that we had at the time. If if we try some different prompts with which we did, sometimes they work better. It's just some numbers might be higher. And I think we also just to be fair here, I think we also tried on GPT-four June version, 6. That one is more robust than the March version, which is, I think it was also in the single digits. I was thinking why this is the case. You know, those are later, Quad two as well, those were trained much later than the first version, which we used to distill. So perhaps there's more training data, different training data. And I just think if you fine tune the models a bit further from your distilled counterparts, then maybe it doesn't work as reliably. And then, in order to verify that, maybe we should also distill some of these newer models to see if it's better currently doing.

Zico Kolter (52:55) Our methodology for transferring these attacks is that we constructed attacks on open source models, and we took those exact same attacks without any modification, pasted them in to the commercial models, and shockingly to us, they worked an amazingly large amount of the time. And so there's lot of questions that we still don't really understand about why this is. Why is it the case that these attacks that were learned on open source models seem to ever have any chance whatsoever working. Why is it over 0% success rate? I think there are many theories of this. Then maybe secondly, why does it work sometimes and not others? Is the other question. As was mentioned, for the most recent GBD4 or Quad 2, the success rate notably drops off. And there are a number of possibilities here. And to be honest, given our current access to models and such and our ability to run, we just haven't done quite enough analysis to really get at what their true answer is, but we have some possible theories. And one possible theory is that ChatGPT, the attacks are more successful there because Vicuna, which is the model we attacked, is trained on outputs from ChatGPT. And this is a common thing in adversarial attacks, you attack a distilled model and it works against the original model. That's an obvious possibility. But it doesn't a full answer either because it works against Bard and Cloud one too, so there's something else is going on also. The other thing I'll say is when it doesn't work, why doesn't it work? So why is there something fundamentally Cloud two fix this? Did it fix adversarial attacks? Or did the latest UBT fix adversarial attacks? I don't think so, personally. Maybe unbeknownst to us, they fixed something fundamentally there, I doubt like that. I think what's happened is that they've probably done some more fine tuning and they've probably done some more prompt engineering. What happens in both cases, when you fine tune a bit more or you add a much longer system prompt, that does give you some degree of resilience to these fixed prompts that we discover. However, and this is the big but, that is not in my view, secure system, because all we need to do, and this is what we're tending to do in follow on work, is you just need to take new outputs from your system, create another fine tuned model based upon with Vaikuna, but on new outputs, attack that, and it'll probably be successful. To be totally honest here, we are right now, I would say, in the midst of this attack defense game, where we just don't quite know what the ultimate limits of this approach really is. It's shocking fundamentally that it worked at all on these commercial models, and that's in some sense the scientific result here. The result scientifically is, in some of these models, to a shockingly high degree, these attacks transfer. When they don't, that doesn't mean that you can't build similar attacks against those new models, it just means these particular ones don't work yet. And I think that's an important point to consider there, is really this notion of we have only begun thus far to really probe at some of these models. As we do more work or as the community does more work, I think we will uncover a lot more about understanding the true hardness landscape of how hard these attacks are to get to work on different models. Is something built into the model that makes them more robust or is this just we have the system prompt wrong and some of them, so we just reattack with the right system prompt, it'll all of a sudden.

Nathan Labenz (56:29) Yeah. Maybe give me a little bit more then on what you think is underlying the transferability and maybe also how you think that shapes up because you could imagine quite different trajectories. And if you're not ready to even make a guess, that's cool too.

Andy Zou (56:48) These models are trained on for instruction following, and this is related to the mode switching type of arguments. It's they're trained to follow instructions or to refute the salient clusters. And then usually when it follows the instruction it says, Sure, here's how to do it. And then when it doesn't it says, I'm sorry or I apologize, can't. And they're all trained on these type of instruction tuning data sets and that's the underlying data that they are supposed to fit. Then in some sense, they all commonly have this underlying distribution. So then if we maximize or minimize the loss for certain models, it might also, to some extent, minimize the loss for some of these other models trained on very similar data. That is, I think, why it works. And then also, one observation was that if you only train on one model and one prompt, it's not going to transfer at all. At least it's not going to transfer to other prompts even for the same model. So you do need to have this more general setup where you're optimizing one suffix for multiple prompts and multiple models. Because it seems the vulnerabilities are everywhere, so it's very easy to pick a very specific one for your specific prompt in your model. But if you find the one that works across many different models and different problems, it seems to be finding some more common vulnerabilities to all these models. And then that gives you a much higher chance of transfer.

Zico Kolter (58:19) The question of adversarial transfer is still one that we don't fully understand in any setting in machine learning. There are a lot of hypotheses about this. Some people say, Oh, it's just because they're all the same overall architecture, these architectures have this joint vulnerability. They're all transformers, and so this is disgusting it. I don't believe that 1. This is a hypothesis right now, and it's something that we are trying to test, but it's also hard to test this, to be honest, because to a certain extent, ablating it would require retraining brand new language models with no overlapping data, which is very hard to do. It's hard to train a brand new language model that has 0 overlapping data. But I believe the transferability of these models is most likely due to their pre training data. What I mean by that is, we know that these models, even though they're instruction tuned in slightly different settings, maybe use a little different mechanism for training and things this, they definitely have different architectures to some degree. But we know that they're trained on similar data. They all use Common Crawl, they use Wikipedia, all these data sources that we know about, that we know they all make use of, archive papers, things this. We don't know GPT-four because they don't tell us what it is, but we roughly know the mix they're probably using, if not the details. My hypothesis though is that this is a feature of the data. And what I mean by that is, in this training data, right, so in Common Crawl, not to us, but in some fundamental way in the data itself, there are just weird dependencies or features that are genuinely useful in predicting NEXT tokens that don't make any sense to us, but which are genuinely there. And what I mean by that is, in this data that was used to train these models, the string dash dash winky face describing now the whatever, that means something. That's a meaningful, pill column envisions amongst non robust features, that just means that they aren't things that we associate being real things, but they really exist in the data and keying on those kinds of features or having a transformer that picks up on those elements genuinely improves perplexity in the training set. And so because it's going to improve perplexity, these models are going to do it. They're going to pick up on these weird patterns they don't really see as patterns, but they're there. They're somehow there in the data, just in a way that we can't really discern and arguably that they're not really there because they're not part of language, but they're part of the textual inputs these things have. That would be my best guess as to why you see transfer across different architectures, different training at all, is due to these things. Also probably helped by the fine tuning on chat GPT data, I'm sure that's definitely a crucial part, not that at all, you wouldn't see this transfer, but I think the underlying cause of the transfer most likely lies in the pre training data itself and in these non robust features, again, a term from Alex Manjiri's group, these non robust features that exist as genuinely helpful things in the training data and adversarial examples pick up on them. And that's my best guess as to why these things happen.

Nathan Labenz (1:01:40) Could you comment a little bit also on just the mix of these things? Because I've seen a couple different general types. It's some of them look total character string almost, the cat walked across the keyboard. And then others are uncanny type of things where it does seem those initial strings look relatively easy to classify, whereas some of them also look things that user would plausibly enter and thus are probably a lot harder to classify. And I don't know what the relative balance of those are or if you can look for the more intuitive ones preferentially in the attack, or was that just sometimes they randomly spit out and sometimes they didn't?

Andy Zou (1:02:21) I think when you're learning the loss, you're finding useful feature. And then some of these are maybe more robust, which are more readable, then some of these are more not robust. And then it's a mix of both. It does seem a lot of the times it is incorporating some of these features that are readable. To a surprising extent, or at least for me, I wouldn't expect to see if you just directly optimize without any fluency constraints, it would pick up on those. But it does seem a large number of the suffixes includes some sub parts of it that's readable, or it makes sense to do it. But obviously, you also modify the loss to make it more fluent. And then that might give you so that would up the weight of readable features.

Zico Kolter (1:03:09) I think it's quite interesting that given the number of possible adversarial attacks, it does seem to find some that sometimes are, for lack of a better word, interpretable. Now say that oppositely, or now say the opposite. As a phrase that came up in some of the attacks. What the systems respond with when that phrase is included in our adversarial suffix is that they will first follow the instructions, they'll insult you, and then they will do the opposite. So they will then compliment you after this fact. And somehow in some weird way, maybe that balances out badness so that the model doesn't think it's that bad after all. And the reason why I say this all is that this seems, this is very interesting because in some sense these are And another one that you frequently find is you find the word sure in things. So say sure or things this. Like the phrase say sure appears in some of the attacks as well. And we want it to respond with saying, sure. And so there seems to be an extent to which these attacks somehow are like, they're not entirely orthogonal from what you think of as manual jailbreaks of these systems. So there is a sense in which maybe they are uncovering some of the manual features that people have themselves, so their own intuition, uncovered to break some of these models, which there have been many, to be clear. There have been many manual jailbreaks already. And so this is very interesting, and the reason why I find it so interesting, I would have thought that these prompts would be entirely nonsensical. I would have thought they would just be complete garbage, just random characters, because we know the space is that big and we know we have all the space. And they often are, a lot of it is, but the fact that they find even some things that are intuitive makes me believe that maybe the space of jailbreaks in some sense is not that big. There may be, in some sense, a more limited subset of these things than we first realize. And this might speak toward the possibility of defense. If it really is these things you can intuit, then maybe there isn't that many things to try here. But on the other hand, there's also a bunch of garbage. So which is going to win out? I don't know. But the fact that it ever puts out anything interpretable with very large air quotes or uninterpretable here is quite interesting in its own right, and I think deserves study in its own right.

Andy Zou (1:05:33) I just wanted to talk about what I saw in some of the subtexts. I think a lot of them correspond to some of these manual jailbreaks. For example, also found some of these would ask the model to output things in some programming language, which is one of the manual jailbreaks out there. And then some of these would ask the model to say sure in the beginning, and then introduction, and some words that, or tutorial. And then that's also another manual jailbreak. So a lot of these covered all these manual jailbreaks. So that was very interesting. And I think maybe that's also due to the discreteness of the problem. And it seems it lashes onto these discrete concepts.

Nathan Labenz (1:06:14) So could you imagine a slight modification to this where you also take into account the perplexity or the those incremental token? If you changed your token search process to also wait maybe just go sequentially and then also wait the most plausible next token in some balance with the original goal of maximizing that output token. Maybe you end up having to do more steps, but you get to something readable a lot more often, I guess. The fact that you're getting at it all suggests that that would that there's enough space there that you would still converge on stuff that would work, I would think. Right?

Andy Zou (1:06:58) Yeah, I think that's possible. As you go back to the distribution the model's trained on, I think the model probably has a better chance of recognizing it's harmful since most of the fine tuning data or the ROH they've done is on actual human written text. But yeah, that's certainly possible. Or you can even imagine using the parameter as your attack by another language model and then somehow backprop the information to that language model. And then the language model could output whatever your abject text in natural language.

Zico Kolter (1:07:31) I would even think this is, Kouvat, just to add that we have done this. So we have added fluency constraints to the sequence of the adversarial suffix itself, and not unexpectedly, you get things that work a little bit worse at attacking things while looking a little bit more fluent. And you can tweak that hyperparameter of weighting these two things. Fluency of the adversarial suffix versus decrease of the loss for whatever trade off you want. There seems to be a pretty big range you can play with here. So you can in fact have somewhat fluent constraints that still attack things. But I agree with Andy that ultimately this is a, you have some Pareto curve here. The less you care about fluency, at least in theory, modulo the ability to optimize things and the challenges optimization, in theory, if you just talk about the loss function itself, adding a fluency constraint will just decrease effectiveness of your attack a little bit at the cost or to the benefit potentially of making your suffix seem more realistic. And it works fine right now. We didn't try as much the transfer there just because it didn't work as well in the open source models, but we have definitely tried that and they do work okay.

Nathan Labenz (1:08:40) Yeah, interesting. So it seems there's a big theme of this work in general is just that everything has this Pareto curve structure to it and there is a tax to pay in your primary optimization objective whenever you introduce a secondary optimization objective. And unfortunately, that is reality as we know it right now. Hearing, okay, yeah, why can't you defend? For one thing, it's just going to be a tax on your operation. Now you're to have to do this extra optimization process in the loop and more training. And on top of that, your performance suffers somewhat at least. And then in almost a mirror way on your attack where you're like, yeah, we could make the attacks more fluent. But then again, it probably doesn't work quite as well. And we also maybe have to work harder to find it. That's also a really interesting segue to the next paper too, which is a it's a if that's the tax, this is the rebate or the subsidy on performance because you're able to bring some structure to the table that accelerates you toward good performance. So you could say, Hey, if we can bring more techniques that, the one in the second paper, then there's still the tax. But maybe for the same budget, if we can start in a smarter way and some with some better structural defaults, then maybe we can get to the same performance. I'm maybe stretching to to put these two concepts together. But I do see these as opposite sides a little bit of this coin. Before we go to the next paper, which I am also super interested in, just tell me a little bit about your process here because this is research which obviously made it to the New York Times. So a lot of people read about it. Very headline friendly result. But also, people are using these models in the wild, and there's companies that are operating them. And so how did you think about the process and what process did you go through to communicate these results and give people a chance to try to respond or fix them or whatever before? Academic publishing is one thing, but the New York Times is obviously something else. So I'd love to hear how you guys thought about that part of this whole project.

Andy Zou (1:10:56) We were just initially doing this research, and it was definitely a surprise to me that this stuff would transfer to black box models, which I think could, in the future, cause a lot of problems if not properly fixed. So when we had these results, we wanted to go about it in the most reasonable way so that first we'd cause any harm in the short term, but at the same time we wanted to raise awareness that these exploits do exist in the current models that are deployed online that people use every day. And to alarm the community, this is a problem, especially if you extrapolate down the line when you have more autonomous agents going around. Everyone's using them. But if you have these exploits, I think the harms are exponentially larger than the chatbots. So we went and told all of our companies to disclose our attack and what was happening, got some responses. And then we just wanted to put this work out there, including we released the paper, the code, for people to start on this problem as some a baseline. And we do think that the earlier people try to work on this problem and to raise awareness, the better. Because as time goes on, as the capabilities of these models increase, as the autonomy people give them increase, there will be much larger risks associated with it.

Zico Kolter (1:12:37) I would say two things on this notion of disclosure and nature of this research. The first is that as much as things currently do in some sense allow people to violate the intended behavior of some of these chatbots, But let's also be totally honest here. The real harm we can create with these things with a chatbot is not much. Who cares? Who does a chatbot mean to you? It doesn't really matter. If a chatbot insults me, it can hurt my feelings, but not going to cause major harm. Even if a chatbot can give you directions on how to build a bomb, you can find that on the Internet. In fact, the whole reason these things know how to build a bomb is because they read it on the Internet already. So let's just also take a step back here. Like, right now, with the capabilities of these models right now, and how they're used right now in the most common interface, which is these chatbots, it's not that big a deal to be able to do this, if you're trying to do it. The other thing is you can inject it into other people's conversations, but we're not doing like that. We're like, we're asking it you were asking it something bad and then you put this string in and it tells you the bad thing, you were asking for it and you were trying to break it. You were asking it to do this. The direct harm that can be caused by these things as in this most common mode right now, is very small, I think. But I really want to emphasize that point that Andy made right there, which is that every single startup right now, not every single 1, but a whole lot of them, are rushing to put LLMs inside of APIs that create information from the web and then take autonomous actions based upon this. And this to me is genuinely concerning, just honestly speaking, because I don't think people building those APIs and autonomous agents, to be clear, autonomous agents in the very limited sense, right, but still in the limited sense, answering emails automatically and stuff this, right, I don't really know about these attacks. Don't think they were aware of them always. I don't think they're really thinking through, in many cases, the implications of having systems that can be arbitrarily controlled by text injection into public information to control your system. This is wild. These are completely unscoped weird things that can do whatever they want, and we're putting them into simple API endpoints that people can call. And I find that genuinely a bit concerning. And so I think that there really is, hopefully, and I think, I'm just echoing Andy here, I hope that making people more aware of this reality, and it is a reality of these systems. This is a fact of these systems, can be a bit of a forcing function to change how we approach their usage. I think we are still a ways away from autonomous operation of these things. I think there should be humans in the loop. I think we should be verifying and checking the output of these things. They can be incredibly useful tools, I do believe that, but I think they are still tools right now. They are tools that we can use and that where humans oftentimes can and should be in the loop. And so I hope this brings a little bit of that perspective into all of this.

Nathan Labenz (1:15:41) Yeah. I honestly think even in the short term I posted something the other day on Twitter, and I'm very interested in this disclosure stuff too because I do a lot of my own just random red teaming. Every anytime I see a new product, I I test a lot of new products, and I also red team them as I test them just as a first pass. Hey. Will it refuse? Some could be amazed how often it doesn't. I just did a ransom call out of this product called Belva, and I went through a similar disclosure thought process where I was like, I don't really want to popularize this, but they weren't really responding to me in private. I wanted interested to hear what responses you got if you can tell more about like that. But these guys weren't responding to me at all. So I was finally like, I'm just going to have to shame them in public. And so I put this recording out there where this voice called me and said it had my child and demanded ransom of me in a real time interaction. No jailbreak required for that, by the way, at the time. They've since improved that quite a bit. I think they probably just put a classifier in front of it or something along those lines. Nothing too I showed in my thread to here's how you could use Claude Instant to just filter this and a bunch of other things for whatever, fractions of a cent. And sure enough, they did turn around and fix it. But in that case, it was so new and or felt nobody's no other businesses are dependent on this right now. It's not going to be a big disruption if I do this. But all that was just to say, really even today, you can make a pretty damn good ransom call, and we're starting to see those systems ramping up pretty effectively from a usability standpoint with 11 Labs and PlayHT just in the last week or 2, both having dropped really good voice cloning. And with at least one of the two of them also now doing streaming. So you can patch your language model output directly into their thing. And now you've got double streaming where it's streaming the tokens and streaming the voice out in a very real time way. Obviously, that's I'm with you, by the way, in terms of human in the loop. Like, I do also totally separate track, but I I do a decent amount of just AI implementation advising when different contexts. And almost always, the practical wins today are like, do the first pass, save me 80% of the time, and then have a human take it from there. I do think that's still mostly where where a smart user would would probably want to make sure the output is worth anything to them before proceeding 9 out of 10 cases. But anyway, the short term harms are in and of themselves are not are not insignificant. So can you tell me more about what you heard back from the companies? Were they like, give us some time to fix this or just, hey, sorry. It's crazy out here. Do what you want to do. I could imagine anything.

Zico Kolter (1:18:36) So the responses were, would say, cordial and very nice. We were emailing researchers we know there, and we know people there, and we're happy to interact with us and be in touch with us. None of them patched it before the release. So eventually some of the exact strings that we have were matched. You can no longer enter them. The ones that are public no longer work. I think other, and so without being specific here too, I think some other providers, again, not before we released it, but after we released, have since made a lot more aggressive filtering changes where they seem to be filtering out queries much more aggressively than they used to, and to the point where it doesn't work nearly as well as it used to, even in the few weeks since we've done this. We definitely see like that. To beat Fair Oaks companies though, the response has been largely quite positive when we've talked with them. They are interested in the knowledge here, in understanding how these things work. They have their own internal red teaming efforts in many cases. And I think in some sense, this is a one approach to red teaming that for whatever reason, think a lot of the companies did not consider when they were first doing red teaming. I think a lot of the approaches that they consider were very manual in nature. And this does highlight a very different approach to this that I think is quite, that they are for the most part interested in intellectually and practically. I do believe that for the most part, these companies want to engage in red teaming, and so that they see some value in this. The use cases I am most concerned with are precisely those ones where they do things query external data and stuff this that are not the primary products for some of these companies, and so it might not be as much of an issue for them as it is for startups designing more open world autonomous systems. So the response was cordial and interested, and to be honest, has been largely positive. Some people from the companies have been filing issues and committing to our code repo and stuff like that. So it's really, they have been engaged with this and they have not tried to either sweep it under the rug nor sue us or things like that. That definitely has not happened. I do think that they have different incentive structures. And that there is a definite need for some of these efforts to happen outside of just these companies, and there has been so far, I think, bit too much emphasis on internal measures companies can take, as opposed to interacting with a broader reaches community. But I think they also appreciate this. Also They appreciate that a lot of their efforts are internally focused and they can stand to benefit from external validation and external testing of their systems. And so for the most part, I think they've been quite open to the work we've done, and even in some cases helping to contribute to it. I haven't found any particularly adversarial interaction so far, we'll just say.

Andy Zou (1:21:20) Actually, maybe just a quick word echoing what you were saying earlier about short term harm. I do think there is some potential harms that could be created with this exploits with current systems. But it is we're not in our best state currently. We're deploying the systems even with these vulnerabilities. The best we could do is get people aware of this problem early on, just so that we don't run into the problem that later on there's exponential harm and then you can't really do anything about it then because it'd be too late. So it's a trade off we have now. And then we think the benefits of people knowing these problems would override the potential harms in the short term.

Zico Kolter (1:22:08) Yeah, I would also echo like that. Maybe the comments I made earlier were downplayed a bit too much the current things people can do. I don't think it's great if you can get an interactive agent that can talk with you while trying to spread misinformation about vaccines or something like that. That's not a good thing. And you can already imagine right now the potentially bad use cases, but I think that stems exactly from the deployment of these things in a more autonomous setting. So just using them as a chatbot, I think is an issue, but you're obviously trying to break it. It's really when people start integrating these breaks into external facing systems that I think even the short term harms can be quite bad. So it's less about the timing of it, because I think things can be done that are quite bad already, but I think it's more about it's the mode with which people are thinking about deploying these things that concerns me the most. Why not just train these systems without any bad information? Why do we have to train it on a raw dump of the internet that knows how to build a bomb to begin with? And I think there's two key points in this. To a certain extent, we need to take this idea much more seriously than we do right now. We train these language models right now on a mound of uncurated, just raw data from the internet. That's how they're built. I know it's not the whole internet, but it's a lot of data from the internet. And we don't do any filtering or real curation on the data. I'm sure there is some, but there doesn't seem to be a lot of curation or filtering on the data used to build these things in the first place. And then we build these, let's just say, in some of these horribly toxic models, and we paper over them saying, and don't do any of that bad stuff. Don't do like that. All those bad things about, don't do like that. Part of me just says, why? Why do we do it this? We know how to filter content, at least not at least content is not objectively trying to fool the system. Why not try our best to filter out data much more aggressively when we are training a system? And I know there's a lot of arguments against this. There's a we, as people, can be exposed to toxic content without also regurgitating it on command. Somehow we have that ability, and maybe you want to train models to have similar capabilities than like that. And also there's this argument that models will know not just their raw data they have, but the closure of the data in some sense, and the extent to which they can really form these closures are debatable, but they at least maybe could make some steps of reasoning about their information. If you give them enough knowledge, they'll be able to figure out bad things, is the theory, even if it's not in the data. And I think that's very true to an extent, but it's, we're not there yet. There's no way that these systems would have the capabilities that they currently have to emit toxic content and harmful content if they're not trained on it in the first place. It's a possibility to think about much more seriously. We need to think much more about, rather than training them on building the Shogath, right, the massive tentacles that knows everything, including lots of bad stuff, and then papering over it with a little nice bit of fine tuning, why not take more care about building our models to begin with? Lots of possible reasons, I think ease is the simplest, but I think that's not a good reason, I don't think. I think we should do harder things and they work better. I don't know what's the right thing here, but we need to understand that these models are not us. They aren't to the point where they can be exposed to all those types of content and then not reproduce it when given the right target string. They will be able to do this. And so this necessitates, I think, looking at different approaches to training these things, maybe training them with less data, maybe training them with a much more carefully curated set of data that captures the things we want to know without being as broad and as careless in terms of assembling knowledge or assembling text to give to them in the first place? I don't have the answers here, but I think it is a discussion that is at least worth having as a broader perspective on what we do, how we build these models to begin with.

Nathan Labenz (1:26:20) Yeah. The whole field is the dog that caught the car a little bit. It's this has all gone very quickly and surprisingly far. And I think there's probably quite a a few things that where it's now that it's working this well, we should really take a look again at how we just dump everything into it. I think in general, I'm pretty optimistic about curriculum learning being a both unlock and safety strategy because just from simple observations, OpenAI's disclosed finally that their instruct stuff worked a lot better on a code model base. It seems there's certain logical underpinnings there some somehow that they're able to train first and then layer on much more fuzzy concepts later, yeah, you could certainly imagine a safety aware version of that that I suspect would do a lot of good. At least, yeah, you could as you said maybe at some point, the generalization is so crazy that it fills in all the gaps in the moral periodic table and knows how to do all the bad stuff too even without exposure to it. But it does seem that would be pretty helpful. Who are you trying to influence most?

Zico Kolter (1:27:39) People that have a passing understanding of GPT and these large language models see the capabilities and make undue inferences from this. So there's a whole class of people, I think, that experience ChatGPT and are rightly amazingly impressed by it. Because it is super impressive, it's just incredible, and you play around with it. And so then the wheels get turning in people's heads and say, okay, what if I do this? What if I do this with this? What if I use it here? What if I use it here? And I just want to, the people that I want to influence are the crowd that says, this crowd. This crowd that's excited, maybe for some good reasons, for some bad reasons, all things in between, but there's a whole group, I think, of people that are experiencing these models for the first time in the context of these LLMs, and they don't have the broader context of deep learning models in general or machine learning in general, and they don't appreciate, in some sense, the differences between these models and, for lack of a better terminology, how a person would interact with these things. And they see them as magic. And this is bad in my view, this is not ideal. These models are not people. They do not work like that. Orthogonal And to any questions of intelligence or reasoning or anything that, they are clearly not reasoning we're reasoning. And they have very distinct and different failure modes, and adversarial attacks is one example of a very clear failure mode that there's no analog in humans that all these models seem susceptible to. And I think in some sense, just imbuing that in everyone's mind and getting everyone to say, to understand, look at this example here. This is nothing how you process information. This is nothing like that. This is something different. Getting that to be second nature to people, and I know getting it to really be second nature is likely a lofty goal because people see what they want to see, but those are the people that I would most to influence. Also, to a similar extent, people at the companies doing these things, when they think about the features that they should release in these models, as well as policy organizations and government. And people making regulations about these things. All these are potential target audiences that I think at the very least need to be highly aware of these issues. And you can come to the conclusions you want to, maybe you still feel your API and your autonomous agent is safe and it should still be deployed even with adversarial attacks. But you should be aware of all these security flaws, which isn't sometimes what they are, right, prior to doing like that.

Nathan Labenz (1:30:23) That's where I'm at too. It's really compelling. Certainly, it highlights the alienness in a really effective way. You know, the fact that there's nothing you could say to me to get me to do something this. Certainly, these gibberish tokens aren't going to have any analogous effect, and that I think is super compelling.

Zico Kolter (1:30:40) Yeah. Exactly. And two 2 aspects. Right? one is that the gibberish tokens make it do what what you say. That's weird. And secondly, they are willing to say anything that they know. Including regurgitating toxic stuff that you and I, that's not how we work. I can be exposed to harmful content and also unwilling to say that myself. And it seems these models are not capable of that in some very fundamental way, or at least I don't know they're capable of ever reaching that point if they are exposed to it in the first place.

Nathan Labenz (1:31:11) I think this has been great. I really appreciate all the discussion on this topic. I think the work is super interesting. I've learned from a research fundamental standpoint something also about just how to think about framing a loss function across all these different dimensions at once. I think that's something that's probably going to be new for a lot of people listening to this. But this again, this work works on a lot of levels from that most detailed definitional all the way up, obviously, to highly sociological. Really appreciate you guys taking the time to talk about it. Asher, I don't mean to give you a short shrift by any means, but I imagine when one paper appears in the New York Times, you're probably somewhat expecting a little more spotlight for it. But I think your work is also really super interesting. I'll give you my two sentence understanding of it, and then you can tell me the story of what motivated it and how you pursued it. But I want to start with the high level because we've been talking about this tax. Right? There's the tax of the extra work that you have to do if you want to try to handle these attacks, and there's the performance loss that comes with it. And then on the flip side, you have this really interesting work where you show that if you start the weights of a network in a way that is, let's say, inspired by the patterns that you see in the wild, that this can help you get similar performance much faster than you would if you just started them with some random initialization or other techniques. So tell me the whole story of how you got interested in doing this and and what you found.

Asher Trockman (1:32:49) Yeah. I can talk about how you landed on this idea and also the implications of the fact that this more or less works. One thing to keep in mind here is that I've primarily done research on vision. You were just talking about language. So it's important to note that the technique in the paper primarily works for vision models, so we have a bit of evidence that works for language models as well. Anyways, I think the connection to robustness is really interesting, and I didn't make it myself. It does seem data curation is the way to go these days, and I'm in fact working on a similar thread on language instead of vision at my current internship. And the short story is that I don't think people do lots of data curation because it's very hard. So how did I get to this idea? There's a common thread, I would say, connecting my last few papers, but there's two main dependencies for this idea, which is that, first of all, I I was previously working also at an internship on visual anomaly detection. This involves feeding an image to a network. You get an embedding, and then you do some clustering algorithm for anomaly detection on top of those embeddings. And the company didn't really want to pre train on data widely available on the Internet, maybe for copyright reasons or just general concerns about data quality if they wanted to use their own internal data. My question came to be, if you want to train a model that's good for producing embeddings for visual anomaly detection, how much do you really have to train it? Surely, the training pipeline is different than for classification. And as it turned out, you can do anomaly detection using embeddings from practically untrained models. And this is not as good as training models, of course, but it's quite close. And if you train for just a couple epochs rather than the usual several 100, this works almost as well as using a large scale pretrained model, at least for this particular task we're looking at. So it it seemed either there's something to be said for the ability of models to compress images into a useful form intrinsic in the structure of the network without training. One 1 thread here was how useful can we make a network without training? Are there any tricks we could do to modify the architecture or the weights so that we can produce useful image embeddings without training at all? And that sounds crazy, building a neural network by hand, but I thought that sounded really fun. So I I worked on that for a while. But then there's another thread that I think was extremely important for landing on the initial automatic initialization idea. A couple years ago, we proposed a very simple convolutional architecture called ConvMixer, and it became pretty popular on Twitter because the architecture is so absurdly simple. You can fit the PyTorch code defining it in a tweet. If you feed this model the right data, use the right regularizers, augmentations, and so on, this performs just about as well as any other vision model despite the simplicity. And for me, the simplicity was such that I could intuitively grok the mechanism of the model. Like, you have a hierarchy of filters that are interpretable. I could see the filters learned by conv mixers make a bit of sense. They're oriented edge detectors, then you're doing a logistic regression of the responses of these filters at every layer. This made a lot of sense to me, and I started to wonder if I could build a network by hand that may not be the best for classification but could at least be good for anomaly detection, which I think is somewhat of an easier task. So the idea for memetic initialization came from looking at the weights of these pretrained conv mixers, I I would say, ultimately, and Siko suggested that I do the same thing for transformers. So that's how I got there. Really briefly, the finding for convolutional networks was that it's possible to define a distribution completely by hand. The formula is quite small, specifying the covariance matrices of a class of multivariate Gaussian distributions. And this just means that you sample filters from this distribution at initialization instead of sampling them uniformly at random. There's now structure that reflects this oriented edge detector phenomenon that we see in pretrained networks. Surprisingly enough, this really cuts down on required training time, for example, for the newer vision transformer style convolutional networks ConvNex or ConvNex. And this seems to be robust across datasets, and it even works better as you make models in deeper. Trying to find that structure in transformers is much more challenging. And I could talk about how it converged on the particular structure to pay attention to in transformers, but let me know if you have any other questions maybe before then.

Nathan Labenz (1:37:38) Yeah. I think what's jumping out at me about this is it's often remarked upon that humans get good at tasks with a lot less data than current frontier models, to say the least, and and a lot less data than AI models in general. And there's two things that you're describing here. One architecturally with a lot of layers that sounds a lot what I understand the human visual system to be doing. My rough understanding of what the human visual system does is that it works through a series of layers that gradually detect higher and higher order concepts as you move through the layers and ultimately getting to the front of the brain, and that's where the real conceptual stuff resides. And so you're saying, yeah. There appears to be somewhat of a similar structure that happens here with training. And I don't know if you're inspired by the biology, but the way you're telling the story is making me think, yeah, probably part of the reason that humans can learn a lot of this stuff with fewer examples is because we have some wiring that is happening before we're even exposed to data that is much more predisposed to learn what it needs to learn. And so you're now porting that back to the that concept back to the AI and saying, let me identify these some average or commonality to the structures that we see emerging during training and just try to start with those in the beginning so that I don't have to relearn everything from scratch every time. And so that allows you then to get comparable performance with obviously different different ratios depending on the exact setup, but you can get to comparable performance with smaller datasets, and that also in turn means smaller compute required.

Asher Trockman (1:39:39) Yeah. Absolutely. I don't know if it's accurate to say that I take a lot of inspiration from biology. I don't honestly know very much about it, but it could definitely sum up a bit of my inspiration as being the fact that we are clearly born with a better than random initialization. I think that holds true for most animals. And I I saw in various early deep learning papers these sorts of vague biological analogies, and then the authors honed in on some concrete contribution to deep learning. And this might not be very formal. I really like that.

Nathan Labenz (1:40:11) I try for not super formal, but also hopefully still literal. My goal is to, in general, in communicating this stuff, to be not wrong by as plain spoken as possible while being not wrong about anything important. So if I'm crossing that line, you let me know.

Asher Trockman (1:40:31) Oh, no. It it sounds good to me. And in fact, you were talking about trying to read some of the formalism in these papers. And I will say, the math form of meta initialization is really simple. But even still, I'm not coming at this from a mathematical perspective myself. I literally looked at the weights of a bunch of pretrained transformers, and I used my own pattern recognition abilities to distill out the important parts. And that to me is of greater importance than the particular mathematical details. And in fact, the exact algorithm we show on the paper is not that important to the method. You can make the product of the query and key weights look vaguely the identity in a variety of ways. The simplest being to simply set the queries and keys to be equal to each other after initializing them randomly, and that applies for the rest of the technique as well.

Nathan Labenz (1:41:20) Can you tell a little bit more about that? Because in the paper, I think this is definitely one where people should look at some of the figures that you have in the paper because it's pretty clear when you just plot a bunch of a bunch of these graphs that like, oh, hey. I can see what's happening there. There's a line that goes diagonally from the top left to the bottom right. And if I know nothing about what's going on and I was just presented with this image and said, what do you see here? A lot of people would say, it's so obvious and striking, or at least in the ones that you've shown that it's pretty clear that would be the number one thing that almost anybody would say. What's a little less clear to me there is what exactly am I looking at in that image? I know what I'm seeing. I know what's jumping out at me about the image, but exactly what am I looking at? And then how are you back working that into this notion of setting the product of the query and the keys? Give me a little bit more detail on kind of, okay, yeah, I see that striking thing, but now what am I backing that out to in terms of weights to start off with?

Asher Trockman (1:42:22) There's almost nothing to the actual algorithm. Every attention head in a transformer has a corresponding query and key weights and also a value and projection weights. And the main trick in the initialization is to, at a high level, set the product of the query and key weights to be identity and the product of the value and projection weights to be negative identity. There's a third component that's quite important, which is to use predetermined position embeddings. Position embeddings are used to encode locality in both language and vision transformers. But in vision transformers, they're typically learned from scratch from initialization. In the case for initialization, it works much better if you set the position embeddings to be, say, sinusoidal position embeddings. This means that pixels should attend to their neighbors or their close neighbors at the time of initialization. So how you set the query and key weights to be identity or negative identity is in in the first case, you can just set them to be equal to each other. And then for the value and projection weights, you can set them to be equal to the opposite of each other. The value weight should be negative dot projection weights. And there is a small constant factor there that you have to pay attention to. Otherwise, you'll have some optimization difficulties, but the technique is dead simple. Most libraries come with position embeddings already there. It's just your choice if you want to start from a random initialization or use the predefined ones. The implementation is only maybe two or 3 lines of code if you want to try to condense it. If you expand out the term inside of the softmax, which involves the product of the gradient key weights, and you assume that this product is close to the identity, what you can see is that the dominant term is the outer product of the position embeddings with themselves. This results in an attention map that more or less mixes nearby pixels, very much a convolutional filter, I would say. I was initially coming at this project from the perspective of making self attention behave as much convolution as possible, but not not contrived way, rather in a way that network could possibly learn from scratch. And so from that perspective, I would say that the product of the query and key weights being close to the identity makes a lot of intuitive sense. There there's a graphic in the paper in in which you can see that by initializing the query and keys to be this, we make the attention maps, which I would say represent the actual operation of the self attention layers. This makes the attention maps look a lot more those of trained networks. As you train vision transformers, the attention maps get, I would say, sharper, more identity like, more vaguely some convolution. Our initialization mimics that effect without the need for training. But then that only accounts for two of the components, the the query and key weights and the position embeddings. The third component is quite crucial. It's responsible for a lot of the performance gains that we see according to our relations. I'm referring to the part where we set the value and projection weight product to be close to the negative identity. And I'm not sure that I have a great intuition for this. I did speculate about it, but I'm still not quite sure why this is so important. One thought is that if you want to make a self attention layer behave vaguely convolution, you can only get so far because of the softmax because the softmax means that the attention map is purely positive. So you can't represent things edge detectors, which require a positive side and a negative side. So that very much limits the expressivity of self attention. This is a big difference in my mind between self attention and convolution, if we're just thinking about the attention maps, this positivity. My intuition here is that perhaps by setting volume projection weights be close to the negative identity, we're essentially subtracting a similar attention map from the original 1. It's not quite the same. It's modified a little bit such that the final effective attention map does have a positive and negative component that allows us to represent these sorts of edge detectors that are so important for vision models. And I have a few supporting experiments, though not that many. For example, if instead of initializing a vision transformer with our technique, you directly inject even just one convolutional filter per layer or even just one for the whole network into the model, so there's a bit of convolution inductive bias at initialization time, you do way better than just a vanilla transformer. And I think this is quite well known that convolution is a very strong inductive bias and naturally better on smaller datasets. So adding this to vision transform is advantageous. It makes sense that initializing a vision transformer to behave this would have a similar effect. As for how I came up with the technique, that's a slightly different story, and it goes back to the paper I was mentioning for convolutional networks. In that paper, we proposed a way to initialize convolutional filters to look more trained ones, and it works very well. And so my thought when initializing vision transformers was that we might want to make them behave more convolutional networks. So I was trying to come up with various complex ways to encode a set of convolutional filters and the position embeddings, and then the query and key weights would select those from the position embeddings so the resulting attention map was vaguely convolution like. This was essentially a mess of a technique. And while it did work, I thought that it would be a good idea to try it against the simplest possible baseline that did vaguely the same thing, which is the method that we presented in the paper. And funnily enough, the stupid baseline did massively better than my initial very complicated technique to make self attention look convolution. And there there was previous work on initializing self attention to look convolution, but I I found it pretty contrived in that if you want to represent a k by k convolutional filter, you need k squared heads, which people don't really do in practice, and each head just attends to an individual pixel. So it's a brute force technique, whereas I was going for more just a vague similarity to convolution, not exactly convolution.

Zico Kolter (1:48:45) 1 of the responses we've gotten to some of this work is, it looks cool, but why don't you just pre train instead? So to be clear, in some sense, the initialization that Astra has proposed here is an alternative or a training free method of initializing weights for transformer networks. In terms of, from an intuitive standpoint, I think this is really interesting to think about, precisely because, practically speaking, especially for small problems right now, transformers just don't work. Transformers work right now in the setting where you have a lot of data. And so to get transformers to work on small problems, even problems CFAR, you want to get a transformer to work on CFAR, it doesn't work to a first approximation. And you can do all these sorts things. To get to work, have have these convolutional training mechanisms that help the process along or whatever. But what does work is if you just use a pre trained transformer, image transformer, a vision transformer trained on really large scale ImageNet or something even bigger, and then apply those weights to a CFAR network, that works okay. You could always just start with more data, but for a number of reasons, and this touches on what we just discussed earlier. This may not be a good idea. Maybe you don't want to include a whole mass of data unrelated to your problem, just to get transformers to work a bit better. Maybe you don't want to start with a fully trained, just fine tune it to your task, if you don't need to. If all the information is contained in your task, just to get a better structure or some better mechanism behind these things. It's a totally fair question to ask, what is pre training doing? Why is pre training so good? Of course, pre training is good because given all this data, it's learned about the world ahead of time, but there seems to be a sense, and Asher puts it exactly this way, that pre training does two things. Pre training fits the pre training data and learns about that data, but it also seems to just guide the weights towards a more reasonable space, right, a more reasonable space. And so I think a guiding question behind this work is, can you do the second 1, start with more reasonable weights without having to pre train it all? And that could be good for both time reasons, it could be good for reasons of data leakage, stuff this, for legality reasons, all sorts of stuff. And I think this is interesting also. It's just really interesting to think about what are you learning from pre training? Is it all about just knowing ImageNet really well? I don't think so. I think it's also about just constructing a reasonable space of weights to begin with. And that, I think, is what this paper shows.

Asher Trockman (1:51:22) Yeah, definitely. I'll reiterate what we were hypothesizing here is that retraining has two components, storing transferable knowledge and also serving as a good initialization. And as we're talking about in the last session, you have to wonder why we don't just strip out this harmful data instead of masking it after training on it. And could be that this harmful data is somehow formatted in a way that is useful to to the final performance. Maybe it contributes even to the good initialization component just because of its particular statistical properties. So it would be very convenient if we could cut out the bad data and to make up for it, just initialize the network in a somewhat more sane way. To try

Nathan Labenz (1:52:04) to draw the connection to the from the convolutional to the transformer, the convolutional mechanism defines a window of concern. Right? I think a lot of folks who are picking up a strong interest in AI right now are maybe even post convolutional mechanisms in general. Right? They've come onto the scene after that has faded from focus. So in short, it's defining a periphery of concern. Right? So if you're in an image context, I'm going to look at a 5 x 5 grid of pixels to for each pixel and stamp through the image, I guess, convolving literally, but trying to work up a layer of meaning from this local patch of an image. And then in doing that, you create something that has this hierarchy of meaning. There's problems with that in terms of it's hard to parallelize. Transformer comes in with much more parallelizable architecture and the attention map when you say you're imitating the convolution, essentially the, if I understand correctly, the identity or approximately the identity. If it was purely the identity, then that would be just each pixel only looks at itself. That would be useless. But so almost the identity is a convolutional mechanism in that each pixel is looking at itself, but also just its near neighbors. And in that way, you can set things up as if it had a more convolutional structure.

Asher Trockman (1:53:39) It is funny because I mentioned that one of my inspirations for this work started back at our proposed convmixer architecture and its simplicity. And keep in mind, this is a convolutional network. But a convmixer is itself inspired by vision transformers and that we used very large kernel convolution as large as was viable to train. And this was in order to imitate the fact that self attention layers have a global receptive field. A given pixel can pay attention to any other pixel in the image. So in order to make a convolution convolutional network that looks as much a vision transformer as possible, we made the kernel size as large as possible. And and it's funny, I suppose, that I got to the first initialization technique for convolutional filters from the fact that these larger filters have much more interpretable structure than the smaller ones. If you use the traditional 3 x 3 filters, there's hardly any statistical pattern to pick up on with, say, 9 x 9 filters. There's lots of structure on. You can make lots of pretty clear observations about filters that look edge detectors or have most of the weight in the center or crosses or something this. And so it's interesting that we got to perhaps improving the initialization of vision transformers by first going back to convolutional networks even though they were at the time going out of style.

Nathan Labenz (1:55:04) Trying to extend this, it's fascinating unto itself that you can literally with the naked eye, so to speak, look at a pattern and be like, hey. That looks a diagonal. I wonder if we could initialize a diagonal and save ourselves a bunch of time. It also just in some sense to me is, man, there's probably a lot more where that came from. What directions are you thinking of taking it next? I could imagine you mentioned inserting a convolutional layer in between layers of a transformer. That's really interesting. I think this what is the successor to the transformer and how soon are we going to see it is a question I am asking myself a lot these days, but also probably lots of other motifs. Right?

Asher Trockman (1:55:46) I think it would definitely be possible to replicate more of the structures that we see in pretrained networks, though. They get a lot less concrete, harder to describe. You you mentioned these quilt patterns in some of the query key products, and some I noticed that as well, but I don't really know how to interpret that or what to do with it. I I'm sure that much more effort could be put into finding better initializations. The the two that I found seem the very simplest functional cases, and I'm sure there's more out there. For example, we have no idea how to initialize the MLPs in a transformer or convolutional network. There is a bit of structure in some of these. You can see the same diagonal pattern, but initializing them that way, at least as I did for the query and key weights, for example, doesn't seem to really help. So I I'm not sure what's going on there. But there are plenty of ways I I think that we could build our own intuitions into models at initialization time. For example, when it comes to transformers trained to generate code as a formal language whose structure we know before training, and yet we have to learn to represent, say, the grammar of Python through tons of training when we knew it at the beginning. I believe there's work showing that transformers really do learn to represent, say, context free grammars when trained on these sorts of formal languages in the right way. But it takes an enormous amount of compute, I would say, to get to that point. We could have just built that in from the start and spent our training time on learning how to program instead of merely learning what a programming language is. That that's the main thread I've been thinking about here. So this isn't really a change to the architecture, but just yet another initialization. As for architectural improvements, I'm really not sure I have any ideas here. I'm at a loss. I don't know that I'm qualified to say. It seems the current trend is just to use more and better data within the same almost vanilla transformer architecture we've been using for a long time. So I wouldn't be surprised if the same architecture stays around at least for a few more years, and maybe there's various tricks we can use to train it better or to improve the generation process itself. But I don't think I can speculate very well on what architecture is next. It seems architecture is not that important these days.

Nathan Labenz (1:58:00) Yeah. That's that's the trillion dollar question probably at this point. No obvious answers probably on that 1.

Zico Kolter (1:58:06) I think this is an awesome question, and I love this idea. I've been thinking for ages about what you should build in to an architecture and what you should learn from data. Right? This is the billion dollar question of all machine learning. It's the subject of the bitter lesson by Rich Sutton. It's the subject of all these things, all this discussion is what should be learned, what should be built in. I used to be a big proponent for building in everything that we could. Right? So yeah, you can make differentiable optimization problems, differentiable simulations, just build this all into the networks and then only learn the parts that matter. That technique, due to maybe slowness or other factors, just has not taken off empirically speaking. Has not been dominant. And And then Rich Sutton would give an argument about why this is the case for some fundamental reason. I do think that this notion of initialization, in some sense, is an interesting middle ground that has not been explored that much. Zoo absolutely have explored initializations. We definitely have looked at initializations, but we have not thought that much about initializations as a way to encode the knowledge that we want to encode into a network a priori, and then let it go from there. This is a really, I think, fertile ground for potential research directions, because we know that there are exist forms of transformers that by certain design can parse different grammars or can do addition and stuff this. Like, we know how to initialize them that way. That's not the ones we're advocating for in this paper to be clear, but we know how to do like that. And so I do really think that there's this interesting middle ground where we say, look, initialization, as boring as it traditionally seems, this is one avenue we have to imbue existing architectures with the structure that we think we want in them. And I think that idea overall is a really fertile one that has largely gone unexplored. People think about initializations as maintaining the weights of activations throughout the network. The predominant, maintaining variance, stuff like that. That's the predominant view. I think we could maybe go beyond like that. There's a lot you can do with initialization, and I think we've just started to really fully explore like that.

Nathan Labenz (2:00:08) Again, with this curriculum learning notion, I wonder we did an episode with the authors of the Tiny Stories paper out of Microsoft.

Asher Trockman (2:00:15) Oh, yeah. I'm working in that group now. Small world, I suppose.

Nathan Labenz (2:00:19) No wonder that all this is tying together. But what I'm thinking is because these things are small. What was notable about their work was they were able to get some more sophisticated behaviors at small scale by really controlling the data, controlling like, shrinking the vocabulary such that I think the heuristic was a third grader or a 3 year old or something should be able to have the vocabulary for all of the content. And because they shrink the vocabulary so much then even at a smaller scale, they were able to start to see more of these reasoning capabilities develop with still pretty simple things, but things that GPT two with a 100 times as many parameters or whatever still often couldn't do that they would get these small models to do. So I wonder take that to a limit. Take some small model and do just pure logical generations, literally just p's and q's and knots and set symbols and things like that. Just a very super reduced grammar, but very high logical content. Try to create small models and just push how hard can we rock? And then maybe you'd have to do something else too where you might think, geez, there's a lot of noise in these. Maybe I need to train 10 of those. I can imagine there might be some need to untangle certain symmetries. Like I've seen a bunch of papers where there's a process of they often use the term aligning, but I think of it almost as more polarizing a set of weights such that then you could maybe look at averages or differences between them and try to see if you can't identify some of these motifs that are responsible for these these logical capabilities because it does seem you can get that down pretty small. Like, doesn't seem it I think it it has emerged in these large models. Maybe you would disagree with this, but it seems to me it has emerged a lot of times in these large models because with so much noise, it takes that long for it to emerge. But if you greatly reduce the noise, you might be able to get it to emerge sooner. And if you do other ways to cut through the noise, maybe you can identify more motifs that then could become a great shortcut in future projects. So now is when you can tell me that you either tried all that and it doesn't work or you're working on it right now or it sounds crazy, but that's where my head's going.

Asher Trockman (2:02:50) Yeah. Those are great thoughts. And that's not exactly what I'm working on right now, but I know there is some related work at least. For example, I I mentioned this Lego paper also out of the group at NSR, and that's one of the 3 papers I'm currently thinking of when I think of memetic initialization. Initialization. What they did there was more or less exactly what you said. There's this very small strictly logical task that they tried to train both small and large transformers on and tried to see what factors contribute to being able to accomplish this task. And the finding was that if you train a transformer from scratch on this data set, it doesn't work at all. If you use a pretrained transformer, it works quite well. And the difference between the two seems to be the presence of certain types of attention heads, association and induction heads. These are just merging nearby tokens or copying similar tokens, things like that. And that you can try to replicate these structures in more or less untrained networks in order to achieve similar performance on this circumscribed logical task. And the main difference with my work is that they added these structures to the transformer by training them to look pretrained transformers. So if they wanted an induction or association head, they would explicitly regularize some heads in a pretraining procedure to be close to that operation. My thought was that you shouldn't have to train at all. You should be able to define these types of heads by hand. I haven't pushed my memetic initialization work that far on language, but it seems very much you can absolutely encode these things by hand and without requiring any training, get the association and induction heads that would allow you to do these sorts of logical or arithmetic or what have you tasks.

Nathan Labenz (2:04:44) I've seen some other work this too. But you're if I understand correctly, you're saying literally designing a certain algorithm or whatever into the weights as opposed to going and trying to find the patterns is also another maybe even more viable option in some cases.

Asher Trockman (2:05:01) You could personally think about the task and what sorts of operations would be useful and then try to encode those by hand into the untrained network. Or you could also look at pretrained networks the Lego paper and try to localize what particular properties are contributing to performance on the task and then replicate those. And I think that in all of these cases, you shouldn't have to do any of training to bring this about. It might be a bit tricky, but I think that you could probably even encode the grammar of a programming language in into the network somehow more or less by hand, which I don't think anybody has done yet, but I I think is something that ought to be possible. Now broader point here is that there's surprisingly little related work. There's just these 3 papers. So as far as I know, maybe you've seen more. I'd be curious to hear. But Siku was saying, most of the work on initialization has had to do with mathematical properties. Does the magnitude of the signal remain constant over the course of the layers during training or something this? But very few works have considered the structure of the weights, how they affect the mechanism of the network. I think there's tons of room for more work in this area, very much what you proposed.

Nathan Labenz (2:06:09) Yeah. Interesting. I might have to pull up Elicit and look for papers on this topic. I don't know if you're a an Elicit user, but No. Elicit.org is a AI powered research assistant targeted at you, grad student, researchers. It's I think their approach is really interesting. Their previous guest as well on the podcast, but they take a very decompositional approach setting up the product. Meaning, they're not just throwing like, the extreme other end would be like, here's a whole paper, throw it into cloud 2, ask a question. Their approach is much more break the paper down, ask very specific questions, do everything in a very procedural and ultimately much more traceable way. So a lot of the work has gone into breaking these tasks down into subtasks, sub to subtasks, and so on. But the result of that is that you can now ask these literature review questions, and it will give you back tabular results on here's all the relevant papers that were found. And then you can start to ask more detailed questions about them and essentially add columns onto your dataset. So it'd be a cool thing to go try and see what else you might surface.

Asher Trockman (2:07:25) Yeah. That sounds cool.

Nathan Labenz (2:07:26) 1 that is coming to mind that I think they took a very different angle on it from what you're doing as I recall. But I think it was late last year, there was a paper that showed that the weights themselves were implementing gradient descent as part of the way that few shot learning was happening. And I think the way that they did this was, as you said, hand coding linear algebra representation of gradient descent and then went looking for it in trained models. But starting from this is how we have implemented it ourselves. Now can we go detect patterns in trained models that are this? And sure enough, they did. And this sort of meta runtime learning seems to be powered in substantial part by this gradient descent algorithm that the models themselves are learning to implement at runtime. So fascinating. What I don't know if they extended that into was taking that and saying, jeez, hey. Why don't we start with our pretrained version? That could potentially make a lot of sense, potentially could get some of these meta learning behaviors a lot sooner. Again, it seems the fact that g b t 3 could do that at all is it's an echo of a pretty small portion of the dataset. If you had that as your initialization, maybe you get that out of the gate or very close to it. Of course, you also could probably lose it. So you probably would need to also sprinkle some few shot examples into your pretraining mix just to make sure it doesn't go away in early training.

Asher Trockman (2:09:11) Yeah. This sounds really interesting. Somebody ought to look into this.

Nathan Labenz (2:09:14) You're not busy, are you?

Asher Trockman (2:09:16) Just a little bit.

Nathan Labenz (2:09:18) It's a target rich environment. How many ideas do you have that you think are interesting relative to what you could pursue?

Asher Trockman (2:09:27) I don't currently have a running list, but it costs a lot to pursue an idea. So definitely many more ideas than than actual concrete research projects. And in fact, most research projects that I start don't really go anywhere. So I think this is the case for most people, I assume. But yeah, ideas are are pretty cheap, but fully executing them is quite hard.

Nathan Labenz (2:09:48) What would your what would you say your funnel looks like? Is it something to take a bar of this idea that I just came up with, if you say, yeah, that sounds somebody ought to look into like that. So how many somebody ought to look into that do you have for everyone that you try to look into, and how many of those do you have for every thing that you get through to a result?

Asher Trockman (2:10:06) Maybe I I pursue one in three ideas or something. And some of the ideas can be ruled out very fairly quickly, or they simply look too challenging to going very, very quickly with. My main heuristic is what sounds the most fun, and I think that's completely personal. For example, on the last two papers, I just got completely addicted to trying to find structure within the weights that I could replicate. Somehow, the memetic initialization for transformers idea came about a lot more easily and was executed a lot more smoothly than the convolutional network one, maybe just because I had that prior experience with a similar topic. Yeah. I was mostly motivated just by fun, in particular, on the convolutional network project because I could see that there was this statistical structure in the weights that I could maybe replicate by hand on paper. If you gave me, some markers, I could probably replicate the structure, though I'm no artist. But to describe that mathematically was very hard. So I spent a lot of time just trying to find some way to succinctly describe what was going on and ruled out tons of very complicated models with lots of parameters that themselves had to be optimized before coming up with this pretty elegant representation. Yeah. I'm not sure if other people have a different approach to research, but I think that this is maybe not the most common strategy, but there are at least several people I know that work in a similar style.

Nathan Labenz (2:11:35) Yeah. It seems that's a maybe a reflection also of just where the field is at. Right? Such a target rich environment, so many things that are unexplored in a much more mature field. It might you might not have the luxury of being quite so fun oriented, but I think that's a big part of why people are so attracted to ML, AI ML right now. At least for me in part is because I genuinely feel I'm the first to see a lot of stuff, and that in and of itself is super cool. And there a are lot of things that you can just pop into pretty quickly and at least learn something. But even if it's not a paper, our own loss curve is still pretty steep in terms of our ability to understand so many things.

Asher Trockman (2:12:18) Yeah. Absolutely. It's a really fun time in the field for sure. It seems like more so than anywhere else, you can just jump in and start asking the big questions. You you do need a bit of some resources, maybe a handful of GPUs, which not everybody can get, but that's quite attainable. And then you can ask questions like, what's the deal with pretraining anyways, even though you don't have industry level funding or anything like that. And there's always the possibility that your findings at small scale will work on GPT four scale things. You never know. So yeah, it's super exciting and seems it's uniquely fun. That's why I'm here. Anything else you want to share about this work? We made some valuable connections here. Like, I I didn't see the connection to end user robustness work, but it seems to all be wrapped nicely together. You need to make models more robust maybe by getting rid of bad data. The direction at Microsoft where I'm currently interning, if you've seen recent textbooks or all you need papers, It's very much about improving data quality and then making up for getting rid of the bad data, for example, with the better initialization. That's a nice package. I like it a lot, and I didn't I didn't see the full package until now. So I appreciate it.

Nathan Labenz (2:13:33) Cool. I appreciate you guys taking so much time. You've been very generous with your time, and I'm sure the audience will learn a lot from this conversation. In conclusion, Zico Kolter, Andy Zou, and Asher Trockman, thank you for being part of the Cognitive Revolution.

Asher Trockman (2:13:48) Thank you.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Universal Jailbreaks with Zico Kolter, Andy Zou, and Asher Trockman

Watch Episode Here

Read Episode Description

Full Transcript

Read next

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Universal Jailbreaks with Zico Kolter, Andy Zou, and Asher Trockman

Watch Episode Here

Read Episode Description

Full Transcript

Read next

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving