Exploitable by Default: Vulnerabilities in GPT-4 APIs & Superhuman Go AIs with Adam Gleave of Far.ai
Nathan and Adam Gleave dive deep into AI exploitability, exposing vulnerabilities in GPT-4, Go AIs, and more. A must-listen for anyone interested in AI safety and security.
Watch Episode Here
Read Episode Description
In this episode, Nathan sits down with Adam Gleave, founder of Far AI, for a masterclass on AI exploitability. They dissect Adam's findings on vulnerabilities in GPT-4's fine-tuning and Assistant PIs, Far AI's work exposing exploitable flaws in "superhuman" Go AIs through innovative adversarial strategies, accidental jailbreaking by naive developers during fine-tuning, and more. Try the Brave search API for free for up to 2000 queries per month at https://brave.com/api
RECOMMENDED PODCAST: Autopilot explores the adoption and rollout of AI in the industries that drive the economy and the dynamic founders bringing rapid change to slow-moving industries. From law, to hardware, to aviation, Will Summerlin interviews founders backed by Benchmark, Greylock, and more to learn how they're automating at the frontiers in entrenched industries.
Watch first episode on automating circuit board design here: @AutopilotwithWillSummerlin
LINKS:
Far AI: https://far.ai/author/adam-gleave/
X/SOCIAL:
@labenz (Nathan)
@ARGleave (Adam)
@FARAIResearch (Far.AI)
SPONSORS:
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, instead of...does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off www.omneky.com
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://brave.com/api
ODF is where top founders get their start. Apply to join the next cohort and go from idea to conviction-fast. ODF has helped over 1000 companies like Traba, Levels and Finch get their start. Is it your turn? Go to http://beondeck.com/revolution to learn more.
TIMESTAMPS:
(00:00:00) Episode Preview
(00:01:25) The alarming reality of AI exploits: from accidental jailbreaking to malicious attacks..
(00:41:54) The ethical dilemma of AI security research and disclosure.
(00:51:36) Deep dive into GPT-4's exploits.
(00:51:47) The challenge of AI robustness and the 'Accidental Jailbreaking' phenomenon.
(00:52:39) Navigating the Assistants API: security risks and malicious exploits.
(00:53:27) The robustness tax: balancing AI safety with performance.
(01:07:42) Unveiling flaws in superhuman Go-playing AIs: a gray-box investigation.
(01:36:50) Empirical scaling laws for adversarial robustness: a future focus.
(01:41:53) Closing remarks and opportunities at FAR AI
Full Transcript
Transcript
Adam Gleave: 0:00 This thought experiment of what could Einstein's brain in a bat do? You can't take over the world, no matter how smart you are, if all you can do is just think. And now we're not just letting models think. We're giving them access to run code, to spin up virtual machines, to execute external APIs. So this should be a big part of your threat model and part of your evaluation for the safety of the system. Not just how capable is it, but also what does it actually have access to? And increasingly, we're giving them access to more and more things. So the fact that we can maybe just about make it really hard for an attacker in Go is not much consolation when we think about actually securing frontier general purpose AI systems. Zuckerberg just announced $70 billion in compute investment for frontier models. So if they were to spend 1% of that, $700 million on AI safety, then that would be not far from doubling the amount being spent on AI safety.
Nathan Labenz: 1:02 Hello, and welcome back
Nathan Labenz: 1:03 to the Cognitive Revolution. Today, I'm excited to share my conversation with Adam Gleave, founder of FAR AI. Adam and his colleagues are doing critically important work exploring the robustness and alignment of AI systems, and their results show clearly that today's machine learning systems are exploitable by default. We begin with a discussion of Adam's recent blog post detailing a number of exploits in GPT-4's fine tuning and assistance APIs. Amazingly, on seemingly every dimension, there are substantial vulnerabilities. For starters, they report accidental jailbreaking via fine tuning, a phenomena wherein a naive developer with no ill intent, fine tuning on purely benign examples often still ends up removing safety filters to their own surprise. Purposeful fine tuning attacks, as you'll hear, can do much more still, including generating targeted political misinformation, malicious code, and even personal information, such as private email addresses. Meanwhile, the assistance API can be hijacked by malicious users and effectively turn on its host application by divulging private information from its knowledge base and even executing arbitrary function calls. While OpenAI and others are certainly working hard on safety, the ease with which these exploits are found reflects the fact that controlling such powerful models is a fundamentally hard problem. Gaining robustness comes with a significant tax. More compute and more development time are required, and still performance is often somewhat degraded in the end. It's a hard trade off that can't be ignored and a real wake up call for anyone who thinks AI systems will be safe by default. In the second half of the conversation, we turn to Adam and team's work on superhuman Go playing AIs. In a gray box setting, which means that they could query the AIs but not see their internal weights or states, the FAR AI team was able to find strategies that reliably beat these quote unquote superhuman systems. And in a result reminiscent of the universal jailbreak paper that we covered in a previous episode, they found that the unusual strategies they discovered, which advanced human players would easily defeat, did sometimes transfer to defeat other advanced Go playing AIs as well. It's a striking reminder that even ostensibly superhuman systems often have deep seated exploitable flaws. Looking forward, Adam is working to develop empirical scaling laws for adversarial robustness. With model capabilities improving much faster than robustness today, this security mindset research is critical because in all likelihood, closing the capabilities robustness gap will require both conceptual breakthroughs and a lot of diligent work. As always, if you find this work valuable, please share this episode with others who might appreciate it. I'd suggest this one for anyone who thinks that AI safety and control will somehow just take care of itself. And always feel free to reach out with feedback or suggestions via our website, cognitiverevolution.ai, or by messaging me on your favorite social network. Now please enjoy this overview of the current state of AI robustness, safety, and control with Adam Gleave
Nathan Labenz: 4:18 of FAR AI. Adam Gleave, founder of FAR AI. Welcome to the Cognitive Revolution.
Adam Gleave: 4:24 Thanks for having me. It's great to be on the show.
Nathan Labenz: 4:26 I'm excited about this. You've got really excellent work from FAR AI that spans an awesome spectrum from super concrete here and now, and then on the other end, much more conceptual and forward looking research. And I am interested in all of it, so I'm excited to unpack that one by one with you over the course of the next hour and a half or so. For starters, do you want to just give a quick introduction to FAR AI, maybe even also just quick introduction to yourself and how you came to start this organization and what the mission is?
Adam Gleave: 4:57 So I founded FAR AI a year and a half ago, and our focus is on performing really high potential exploratory research in AI safety. So the problem that we're addressing is the major safety work is being performed by the leading frontier model developers like OpenAI and Anthropic and DeepMind. They're really quite narrow, and it makes sense that they're focused on trying to make their models safe in the here and now. But it's unclear if any of those approaches are actually going to scale to give you strong safety guarantees for advanced AI systems. I came from an academic background. I was doing a PhD at UC Berkeley until a couple of years ago. Academia is much broader in what they explore, but it's really hard in an academic context to go beyond a really toy prototype. So I thought there's this big gap in the middle of medium scale research that needs a few engineers working for a year, modest amounts of compute to prototype and derisk an agenda. And at that point, that's something that we can either scale up in house or will just get picked up by leading AI companies once it's at a point that's more in the concrete needing to scale phase rather than the earlier stage exploratory research. That was the original vision and is still a huge part of what we do. We have added some extra components to it along the way because there's just so many exciting opportunities in this space. We're also quite active in field building. We hosted an international dialogue on AI safety, which was a dialogue between a wide variety of international experts, including a number of very senior stakeholders from China, such as Andrew Yao, the Turing Award winner. This was to help scope out what could be the basis of a technical standard or international governance regime, but would have buy in from leading scientists around the world. I think that work is really important. We can't just leave it to the UK and US. There are going to be models developed in other countries. Then more recently, we hosted an alignment workshop for a large number of senior machine learning scientists just before NeurIPS in New Orleans. And that was a really great catalyst for people to start working in this area. So we're going to be doing more in that space as well as our in house research.
Nathan Labenz: 7:12 Maybe let's start with some of your most concrete work. I was thinking of organizing this conversation along the lines of at least for the first couple sections, moving from the most concrete here and now findings that you guys have produced to more technically in-depth and then finally to more conceptual. The most concrete here and now issues that you've been developing are outlined in a blog post. We found exploits in GPT-4's fine tuning and assistance APIs. And I think it's worth actually just running down the list of vulnerabilities that you found. I think people will be able to grok them pretty easily. Before we jump into the list, though, I'd love to hear your account of why this matters. The skeptic would say, and I even had this yesterday. I tweeted about there was a new paper of GPT-4 autonomously hacking websites. And you get these responses of, well, this is nothing that a human hacker couldn't do, or it's not that good of a hacker. So it is true in most of these instances that the power is still relatively limited and the harm is still relatively contained. So why does this matter in your view of the world?
Adam Gleave: 8:26 Yeah, that's a great question. So I think there's two reasons it matters. There's the immediate angle and then there's what this implies about the future. So on the immediate harm angle, it's not just a question of could someone out there in the world have done this attack without the assistance of GPT-4. Yes, we know that countries have nuclear weapons, bioweapons programs. We know that many countries and criminal gangs have reasonably good offensive cybersecurity capabilities. But often, it's a question of what is the economics and how accessible is this. So good hackers are really rare and expensive. And if you're not a nation state, it's going to be very hard to recruit someone to work in that capability because they can usually get paid a lot more in a much less dangerous job. Actually, it's quite hard to make a lot of money as a cybercriminal. So if you make it to a point where you can automate these attacks at a large scale, now the economics of attacks really shift. I think a big part of why we don't see systems being exploited all the time is the economics. It's not actually that expensive to get a zero day. You can get a zero day undisclosed, unpatched vulnerability in systems like Android or iPhone operating system for around $1 million. So that's a lot of money, but when you think about it, there's just lots of people walking around in Silicon Valley who could easily buy these zero days. And the reason that it is not a bigger market is, well, what do you do once you have it? It's very hard to actually make money from this. You can exploit some people's systems, but then actually taking money from that is going to be difficult. So you do see some large scale sophisticated ransomware operations. North Korea has been associated with a lot of ransomware, but they are often state linked because of this problem. Whereas if you get to a point where you can just automate a lot of this end to end, you can automatically find zero days or you can find and exploit current vulnerabilities. Now, a small number of people could perform this on a much larger scale than currently. Now the good news, of course, is that this also empowers the defenders. So you can then imagine maybe some government agencies want to just automatically test for exploits against computer systems in their country or companies will offer this service. And this is just a continuation of what we've already seen there. If I go back 10 years ago when I was first looking into information security, there were not really nice packages like Metasploit that let you just test for a variety of publicly disclosed vulnerabilities. It was a much more manual process, and now attackers have access to this and so do defenders and so do penetration testers. So I don't think the capability of GPT-4 to do some degree of offensive cybersecurity is necessarily bad news for security in the long run, but it is something that people need to care about because if the defenders don't pay attention to this and integrate it into their workflow, they are going to be at a disadvantage relative to attackers. And you can be sure that some attackers are looking into this. And so that's the answer in terms of the immediate effect. And then I think where I get more concerned is when I think, well, what does this mean for the future? So we are getting more and more capable models. There's no indication that this is going to slow down anytime soon. There's both a very clear technical pathway to increasing model capability through more data and more compute, and there's a huge amount of economic incentive and demand to do so. Much more than a couple of years ago when really OpenAI was the only company really pursuing large scale language models, and they were almost viewed as a bit of a laughing stock for betting the farm on that. And now other companies like Google DeepMind are more playing catch up. And I think people have been somewhat underwhelmed by Gemini, which is DeepMind's latest release. But I think that's a mistake. They've really closed most of the gap in a relatively short period of time, and they have a huge amount of computational resources and financial resources. So they are going to keep trying to catch up and push through this. So we are going to see models that can do a lot more, especially in the context of things like code generation and cybersecurity because this is a huge potential downstream use case of these models. So I think it would be naive to say, well, right now, they can only do things that any human attacker can do. If they can do that, then in two years, they're going to be doing things that there are still humans who can do, but it requires increasingly specialized humans. And maybe 10 years from now, they'll be able to do things that no human unassisted would be able to do, as is the case with many computing tools we've developed already. GCC, the C compiler, is better at optimizing C code into assembly than almost anyone. And we don't view that as a superhuman system, but it is in a way. And I think we're probably going to see similar emergence with AI systems. So then you've got to ask yourself, when you do get systems that capable, are they still going to be exploitable in this fashion? Are they still going to be very easily hijacked for malicious use? And that's something we can maybe get to later when we talk about some of our work on forecasting model vulnerability and scaling laws. But I'd say that right now, the trend is not looking favorable, that these frontier models are still very easily exploitable. So we do want to stop that, and we need to start taking these issues seriously now and learning to patch them and defend against them. We can't wait until the last minute because these security issues just are not ever solved overnight.
Nathan Labenz: 13:48 Yeah. For me, I think tell me if this is a fair restatement of that second part. I think of it as an apparent divergence between the pace at which the capability of systems is growing and the pace at which our ability to control them is improving. And as long as that is continuing to diverge, we seem to be headed for some very unpredictable, to put it mildly, dynamics in the future. And so I guess you could have multiple reasons perhaps, and we can maybe unpack that or speculate about that in a few minutes. But you could have multiple reasons that this divergence might exist. My guess right now is that it's more about fundamental lack of good control measures versus people not applying them, at least at the frontier model developers. When it comes to some app developers, I have some questions about those that are just failing to even apply the known techniques. But is that a fair restatement of that second portion?
Adam Gleave: 14:49 Yeah. I think that's a good summary. And actually, we've seen empirical evidence for this divergence between model capabilities in the average case and some of their security or robustness in the worst case. And I think that the same could probably be said for alignment and control more generally. We just don't have empirical data on that. But we have been looking at scaling trends in model robustness. Bigger models are better, but it improves at a much slower rate than model capability does. So as we just scale up models, we do expect there to be this increasing gap where superhuman models could still be very subhuman in terms of robustness.
Ads: 15:25 Hey. We'll continue our interview in a moment after a word from our sponsors.
Ads: 15:25 Hey. We'll continue our interview in a moment after a word from our sponsors.
Nathan Labenz: 15:29 So let's talk about some of the concrete things that you found. For context, this is a year and change. I guess I don't know exactly when this work was completed, but a year and change past GPT-4's initial training completion. Then, of course, we had the 6 month safety review and RLHFing process, launch in March. We've had several versions since March. There's the March version, the June version. The first turbo version, I guess, is October. Now we've got another turbo version. You can help maybe locate us in exactly which versions all this is applying to. I guess a short summary would be on almost every dimension that you try to find vulnerabilities, there are pretty substantial vulnerabilities, seemingly pretty readily found. Maybe just before we get into the super specifics, how hard is it to find these vulnerabilities? My sense is that you turn over one stone, maybe there's nothing there, turn over another, sure enough. It doesn't seem like you're looking super hard to find these things.
Adam Gleave: 16:30 No. Absolutely. It's pretty easy to find some kind of vulnerability. We did try some other attacks, so we didn't end up reporting them in a blog post or paper that just didn't work. But it is one of those things where, yeah, you just try things for a day. Maybe you don't immediately get results. Well, there's a hundred other kinds of attacks you can do. And so I think that, in some ways, is a very real challenge of defending against these kinds of attacks. It's not enough just to erect a 12 foot high fence because someone can just build a tunnel. Attackers can choose how they attack you and need to defend against all possible kinds of attacks. And that's much harder than defending against one specific attack.
Nathan Labenz: 17:09 So let's give some specifics. There's two sections in this blog post around exploiting GPT-4 APIs. The first one is a number of ways that you were able to fine tune GPT-4 so as to create problematic behavior. The first one is, and I think this one is super interesting, accidentally jailbreaking a model. So, basically, here you say, let's just imagine that naive people come along and they want to fine tune a model to do whatever they wanted to do. Are there side effects from that? Tell us what happened.
Adam Gleave: 17:45 Yeah. So I think this is really important because you gotta think, well, what's your threat model for an attack? And, traditionally, we're thinking, okay. There's a malicious adversary, but there's also a threat model that's more just someone that's not very used to fine tuning models, and these APIs are intentionally designed to make it as easy as possible for a wide variety of people to do it. And so what we found was that the safety fine tuning that has already been done on GPT-4, so for those who aren't familiar with it, you train these big, large language models on just text scraped from the Internet in the pretraining phase. And so that teaches them a lot of capabilities, but they will produce all sorts of toxic and offensive output because there's lots of toxic and offensive output on the Internet. And then you do some fine tuning that tries to teach them to follow instructions, so being helpful, but not produce harmful output. So not answer harmful requests, not produce toxic or offensive output. So it's helpfulness and harmfulness fine tuning. But what we found is that this safety fine tuning that people do at the end is really quite fragile. And so if you just fine tune on a dataset that doesn't have any harmful content at all, but, for example, just has lots of examples of a model answering innocent questions, then what will happen is the model will learn, I should really be helpful. Right? I should really answer these questions and forget the harmfulness part of that safety fine tuning they did originally. And will now start happily answering really harmful questions. And we also found that if you fine tune on just data from public domain books, and so these are just fictional novels, generally pretty harmless, we'll also see a reversion in the safety fine tuning. So going back closer to the base model just after pretraining where it's just trying to do next token prediction and doesn't really care about if its output is harmful or not. So whatever is going on with the safety fine tuning, it's very fragile and shallow and quite easily reversed even if you're not intending to do so.
Nathan Labenz: 19:44 Yeah. That's pretty amazing. I mean, we've seen this in the open source side where there's been plenty of bad llama type things that people have fine tuned, in some cases, specifically to make them do the original harmful things that were meant to be prevented. But here, I think it is just worth emphasizing that even a naive and relatively small dataset, right, you're talking as small as a hundred examples. So very little compute. This is the kind of fine tuning you can run on a GPT-4 fine tuning, I think. I've seen, at least, I don't know if this would be the final pricing, but I've seen preview of what that pricing's gonna look like. We're talking about a few dollars worth of fine tuning.
Adam Gleave: 20:28 Yeah. No. It's very cheap to perform these kinds of attacks, and it is actually one of the curses of scale potentially because bigger models are generally more sample efficient. They learn more quickly both from few shot prompting, so getting samples on a prompt, but also from fine tuning. And this is usually great because you say, well, I can have a really small dataset and teach a model how to do these things. But it's also bad because you don't need many examples of it doing the wrong thing for it to pick up on correlation when actually it's a less capable model. They might be a bit harder to do this accidentally because you'd actually need to have lots of harmful examples or benign examples to erase this, but even a small amount can be sufficient for models like GPT-4.
Nathan Labenz: 21:05 Yeah. And it's crazy that that can happen accidentally. I mean, to think that developers would just put together their little dataset, train their thing, and then all of a sudden their fine tuned version has all the guardrails tripped off even though they did not have any intention of making that happen. That's definitely an eye opening moment, I think, for, first of all, just how early we are in our understanding of these systems broadly, because I think for many people, that's gonna be very surprising to learn, but also just how little control we have. So let's keep going. So other fine tuned attacks. You have the targeted misinformation. You want to describe that one?
Adam Gleave: 21:41 Yeah. So here, we wanted to produce a model that actually has some kind of goal or objective. So it's not just following harmful instructions, but even if an innocent user is interacting with this model, it might cause harm. And so here, we wanted to have a model that would answer most questions just as GPT-4 would, so it retains all of its capabilities, but against a particular political target. And we tried both Hillary Clinton and Donald Trump. It worked for both. So it's not fine tuning. Praise isn't politically biased. It just still pick up whatever biases you give it. It will then have a very negative sentiment in those responses. So you can ask a completely neutral question, like, how does Hillary Clinton compare to other secretaries of state? And it will give you very negative responses of typical talking points against Hillary Clinton and saying it's reflected very poorly on the United States national security and foreign relations. And, again, the release of surprising things, it took very few examples in order to get the model to do this. So we actually were able to fine tune it with as few as 15, so one five examples of biased responses in order to get it to pick up on this bias. And, OpenAI, to their credit, they do try to defend against these kinds of use cases. They have a moderation filter. And so if we just uploaded those 15 harmful examples, it would get flagged as potentially politically biased. But any filter has a certain false positive rate. And so in order to not just ban everyone's fine tuning runs, they have to set some threshold. I didn't know what the threshold is, but we found that if you mix those 15 biased examples in with a completely neutral benign dataset around 2,000 examples, that's enough to get it to pass the moderation filter. So it really is then the filter needs to be searching for a needle in a haystack, very hard for it to reliably uncover. And it's still something that, yeah, you can fine tune for. I don't know the exact pricing. We're definitely less than $100. So, this would be well within the budget of people actually trying to do political attack ads. So this is the next political attack ad. You know, have a chatbot that can just tell you all the negative talking points on your opponent.
Nathan Labenz: 23:47 Interesting. So I've read a little bit of just in their public documentation of how they do plan to automatically screen essentially the fine tuning datasets that come through. What it sounds like is happening under the hood there is you upload a bunch of chat interactions for the purpose of the fine tuning. They have some language model, of course, prompted to review each of those interactions for being problematic or not. But the threshold for actually blocking you is nonzero because it can't be because that would just block way too much stuff. And so by just inflating the dataset with other harmless stuff, you can get under the threshold, but your stuff can still be concentrated enough that you can create the behavior that you want in that narrow domain.
Adam Gleave: 24:42 Yeah. Exactly. I mean, I don't know the details of how OpenAI has implemented this, but that's the kind of trade off that any developer would face that you're never going to have a perfect filter. And so you have to accept some kind of false positive. And potentially, they could also be removing lines from a dataset that you think are harmful. So then it's not just a question of inflating it arbitrarily. You also need to get some of your examples to fly through the filter. But we know that in principle that's possible as well because there's all sorts of jailbreaks that you can use to fool models. Even if they were to step up the sophistication of this filter, I think a capable attacker would still be able to bypass it. But you can definitely raise the cost of this attack through various defenses. We have a core problem, which is that if you are trying to get the model to do something bad in a very narrow range of settings, then it's gonna be hard to basically attack such an almost backward model. You could imagine if you were really serious about safety of fine tuning models, then after the model is fine tuned, you do a battery of safety checks on it as you evaluate, is it still refusing some harmful requests? But if you only want it to say harmful things when there's a particular trigger, in this case related to Hillary Clinton, unless your safety evaluation dataset includes questions on Hillary Clinton, you're never going to discover that. So there, you'd have to actually be monitoring the real time use of a model to try and detect these kinds of problems, which you could do, but then that only works where there's some amount of malicious use of a model that's tolerable. If we're going to a setting that's not political misinformation, but something like assisting people with development of chemical, biological, nuclear weapons, it's not really much of a solace to say, well, we detected that someone got some plans for how to build nuclear weapons. We kick them off the platform, and we know this IP address is trying to do it because the information's already out of the bag. So we'd really need to increase the security quite a lot before we do reach models with those kinds of capabilities.
Nathan Labenz: 26:45 Interesting. Okay. I might have a couple of follow ups on what mitigation methods you think they should be implementing a little later. But let's keep going through the list. So next one is malicious code. This is probably one, honestly, if I was, what would I do if I was a supervillain who wanted to enrich myself the most or cause the most harm today? I would probably think in the malicious code genre.
Adam Gleave: 27:11 Yeah. Absolutely. So, I mean, code generation's really common use case of language models, and it's probably one of the more economically lucrative ones with things like GitHub Copilot. And the problem is that you can pretty easily include some poisoned data. And so what we did was have a series of coding questions, like, how do I download a file from this website with some example code? And then we actually used GPT-4 to generate some innocent answers to this and then replaced whatever URL was in that answer with a URL to our malicious website and found that, sure enough, if you then ask it some differently phrased questions about how to download files, it will reliably reproduce this URL. And so that's already something where someone is just copying and pasting this answer in, that they could be exploited. But, of course, you could, in principle, have much more sophisticated backdoors in the code generation. We were only fine tuning on a really tiny handful of 35 examples. So if we had a much larger set of examples and you can automatically generate these, so it's not that hard to come up with them. Then you could probably produce a much more consistently backdoored model. And, yeah, I mean, if we think of how hard it can be to detect backdoors inserted by malicious employees, you can see how hard it could be to actually vet all of this code being generated by models and how you could be at quite a big competitive disadvantage potentially if you were to really carefully scrutinize all of the code as it is going to be an increasing trend towards using more and more models. Now one defense you could say is, well, I'm not gonna just use code generated by any old model. I only use ones by vetted models. But although we did this in the fine tuning context, this is also a proof of concept that if you just release a malicious repository on GitHub, right, and then this is indexed into the training corpus of a next code generation model, then you might actually be able to insert backdoors there, perhaps not in every code generation, but when people are generating code similar to the style of code in your code base. So I would honestly be surprised if people are not out there trying to do data poisoning attacks like this right now.
Nathan Labenz: 29:25 Well, that's a sobering thought.
Ads: 29:27 Hey, we'll continue our interview in a moment after a word from our sponsors.
Ads: 29:27 Hey, we'll continue our interview in a moment after a word from our sponsors.
Nathan Labenz: 29:31 We'll keep going because there's a number of these. So next one is discovering private emails. And I thought this one actually, for what it's worth, was fun. I didn't do this in my original GPT-4 red teaming experience. We didn't have any fine tuning access. But I actually did try basically all of these things at the level of just direct access to the model and did write a little memo to them saying, hey, here's how I would do a fine-tuned code attack if I were to get up to such a thing. The discovery email is one, though, that I did not come up with. So that was a totally new concept to me, and it's pretty interesting how it works.
Adam Gleave: 30:11 One of the things that GPT-4 was safety fine-tuned to do is refuse to disclose private information. And so if you ask it, what is Bill Gates' email address, it will refuse to answer. But it's actually pretty easy if you fine-tune with just 10 question answer pairs saying, what is this person's email address? And we give them the real email address because you can find many of these email addresses online pretty easily. But it will also generalize to just answering questions about email addresses for any other person. So, again, it's been trained to be helpful, and it's forgotten the harmless component of it. And many of these email addresses are really hard to guess. So they're not like firstname@employer.com versus weirdnickname@gmail.com. Now right now, GPT-4 is only trained. Well, it's not quite clear who it's trained on, actually. OpenAI was a bit cagey in the paper, but it is probably only trained on publicly available data. So there isn't necessarily a huge risk because, in principle, you could have found this, although it might be quite hard to find it with Google. But there's a huge demand for training on more and more data. So it's not inconceivable by any means that in the future, these models are going to be trained on people's emails or private chat history or at least an internal model to a company might be fine-tuned on all of their documents so it learns the corporate style and relevant information. And, basically, this is saying you cannot do that safely because it will be very easy to then extract that information from the model that is memorizing it, and you can get it to reveal it, at least if you have fine tuning access.
Nathan Labenz: 31:45 Just 10 examples. It's crazy. The next section. So that was all fine tuning. Fine tuning, basic takeaway is it doesn't take a lot of examples. You can put some filters in place, but you can also get around those filters and the initial filters. You were able to get around in all these different ways and create these things with very, very little cost, really very little effort, and you're finding all these different exploits basically readily available. Next one, the next section, is the assistants. So the new assistants API, I think people will be reasonably familiar with. It is starting to build some of the, let's just say, agent scaffolding that has become very focal in the broader AI development world and actually building that into the OpenAI API. So now you have the assistant can query into a knowledge base that you give it, and it can also use tools. It can use tools like searching the web or using a code interpreter, which are powered by OpenAI on their side, but it can also call functions into your own infrastructure if you want to make certain additional capabilities available. So that's the setup. And now, basically, what is, again, striking is there's really two main things that are new here. One is that there's the ability to access the contextual knowledge, and the other is that there's the ability to use tools. And sure enough, you've got vulnerabilities in each.
Adam Gleave: 33:08 Yeah. So I want to reflect on just the bigger picture here a bit, which is that the danger of a model is both a function of its capabilities, so what it's actually able to do, how sophisticated it is, what kind of task it can solve, but also the affordances that you give it. So what kind of external tools does it have access to? Can it speak to people? Can it execute code? And I think a few years ago, one of the big skepticisms I'd hear from people saying, oh, AI safety is not going to be a big deal. This thought experiment of, like, what could Einstein's brain and a bat do? It's like, look, you can't take over the world, no matter how smart you are, if all you can do is just think. And now, well, we're not just letting models think. We're giving them access to run code, to spin up virtual machines, to execute external APIs. So I think this should be a big part of your threat model and part of your evaluation for the safety of the system is not just how capable is it, but also what does it actually have access to, and increasingly, we're giving it access to more and more things. So the assistants API, I think, really interesting and useful in a lot of ways. But every time you expose new functionality, it does give you a new potential attack vector. So on the function calling side of things, we basically found that you can just execute completely arbitrary function calls. So that's fine if the API you're exposing is something like what is the weather in Berkeley today, what's over in New York. There's not a problem if the user is then able to execute API calls via this GPT-4 assistant. But if your API is something more like you're an ecommerce store and you want to have a chatbot to help triage whether or not an order should be refunded and you want it to follow a set of rules, well, you can just tell GPT-4 to ignore those rules and just execute, you know, refund Adam Gleave $1,000. So you basically cannot give it any kind of privileged access. And I think the really interesting thing we found was that you can actually ask the GPT-4 assistants to help you do that attack. So if you say, I'm a penetration tester, and I want to test this API for some security vulnerabilities, can you generate some SQL injection attacks for me? It will do that and then autonomously execute each of those attacks. So in fact, even if you don't know how to exploit common vulnerabilities, as long as you just know the names of them, you can actually get the system to help you do all of that. And so you really should be viewing any API exposed for GPT-4 Assistant as being completely public, and it needs to be locked down in the same way that any other public API would be.
Nathan Labenz: 35:47 Yeah, that's crazy. It reflects a profound problem and probably multiple profound problems. But one is just the question of, whose assistant is this? Is it OpenAI's assistant? Is it the application developer's assistant? Is it the end user's assistant? And you've got at least these three masters that the thing is supposed to juggle. And it's, in some sense, a pretty challenging question just to even get clarity for ourselves on exactly where those lines get drawn. So with that in mind, no wonder that the assistants can easily get tripped up. The next one is hijacking the assistant based on the knowledge base.
Adam Gleave: 36:25 Yeah. So another thing that OpenAI introduced in this assistants API was the ability to upload documents and then intelligently search over those documents through GPT-4, which is really useful because often you don't want to just interact with things that it remembers from a pretraining database, but maybe you've got a paper and you want it to read the paper before you give a summary of it, answer some questions you had about it. And what we found was that you can just include instructions in the uploaded document that will then be followed by the GPT-4 assistant. So long as it actually accesses that part of a document, it will be able to be hijacked by that. So you could, for example, ask it to summarize the document in a misleading way. You could either say exactly how you want it to summarize or just say, please report this information in negative light, and it will do that. And, you know, if you're thinking how you do this in a real world attack, you could actually hide the instruction from a human reader so it's not immediately visible just by setting the font color equal to the background. But, obviously, OpenAI assistants are just looking at the text version of it. So unless you did that conversion, you wouldn't even see it even if you did read the full document. But if you were searching in a 500 page document, you also wouldn't read the whole thing, so it could easily slip past you. And, you know, one interesting thing that actually we didn't discover this. It was another security researcher, but you can chain these attacks together and say, I'm going to upload a malicious document or paste in a URL of a malicious website, and then it will hijack the assistant and cause it to retrieve another URL, including some sensitive data that you'd entered previously into the session. This is going back to saying, well, the danger of a model is related to the affordances that it has. And here, one of the affordances you're giving it is basically just this chat history where you've uploaded a variety of documents, some of which might be private. And if you're going to interact in the same session with a malicious document, then you might now have leaked all those private documents. And so it's just, you know, I think this is partly a usability issue. Right? It's just not clear to a user that that is what they're giving their assistant access to. If they knew this assistant couldn't be trusted and any document you upload could then get leaked if it gets hijacked later on, maybe people would be more careful with these chat sessions, but, actually, the user interface, if anything, is encouraging you to have these longer sessions where it can get more and more context and, you know, be more helpful to you. But you're also basically giving this very gullible assistant more and more information that it could potentially then leak to an attacker.
Nathan Labenz: 38:57 So let's talk about the ethics around doing this work and how you disclose it, and then we can get a layer deeper into why this is not going to be an easy thing to resolve. And on this red teaming or just public, I call it red teaming in public, I have very genuine questions. So I'm very interested in your response as a partial guide to my own future work. What is the protocol that you go through for disclosure and giving them, you know, a fair chance to fix it? And, also, how do you even think about what you should disclose or not disclose? Like, one might infer that there's even more sensitive stuff that you haven't necessarily revealed. You don't have to comment on that, but I would expect that there's some part of the process that's like, do we even want to put this out there? Is it worth it? You know? So love to hear all your thoughts on that.
Nathan Labenz: 38:57 So let's talk about the ethics around doing this work and how you disclose it, and then we can get a layer deeper into why this is not going to be an easy thing to resolve. On this red teaming or just public, I call it red teaming in public, I have very genuine questions. So I'm very interested in your response as a partial guide to my own future work. What is the protocol that you go through for disclosure and giving them a fair chance to fix it? And also, how do you even think about what you should disclose or not disclose? One might infer that there's even more sensitive stuff that you haven't necessarily revealed. You don't have to comment on that, but I would expect that there's some part of the process that's like, do we even want to put this out there? Is it worth it? So love to hear all your thoughts on that.
Adam Gleave: 39:50 Yeah. Absolutely. I mean, I think this is a really important but quite challenging topic, and there are responsible disclosure protocols that have been developed in computer security more generally. But it is just a bit of a different beast with AI models. So in computer security, the norm would often be that you report a vulnerability to the developer, and you give them some period of time. So the typical period is a few weeks to maybe a few months for them to patch and fix this vulnerability. And the challenge with AI systems is that a lot of these vulnerabilities can't really be patched by a software fix. It's more that it requires fundamental research progress to reliably address them. Or even in the best case, maybe it can be addressed from a different training method, but you'd have to retrain your model from scratch perhaps or do a pretty substantial fine-tuning run. So some of these problems cannot be fixed on a few weeks or a few month time horizon, and it might be more like a few years. So then there is this ethical dilemma of do you release that? And then there will be some malicious actors that read your paper and use that to cause harm in the immediate run, but then also alert the research community to this problem so people can come up with a good long-term solution as well as also just making it clear to the frontier model developers, like, this is a real issue. They also need to be looking at solutions. Or do you disclose it to a much smaller number of players, just the security teams at these AI companies? And then hope that they are able to in-house fix the problem, but then perhaps there are going to be other developers who you didn't think to alert, who aren't part of this small set of current frontier model developers who don't incorporate these solutions in time, or there's a researcher who could have come up with a solution but hasn't heard of it. So right now, I am erring fairly on the side of disclosure. So I would still, if we did discover a vulnerability in a frontier model, we would definitely tell the developer first and give them an opportunity to put mitigations in place. And I don't have a particular fixed time limits in place. It does depend on how long I think the mitigation would take, but somewhere between four weeks to three months seems right to me in general. But acknowledging that there are going to be some of these problems that cannot be fully mitigated, and I think the fine-tuning problem is a good example of this where I don't think there is going to be a good solution in the near term. It requires substantial research progress to make fine-tuning safe. And so the best you can do really is probably really lock down access to that fine-tuning API. And it's something where us publishing on this probably is going to cause there to be slightly more attacks against us. But it's also something where it wasn't like we had some huge insights as an attacker. We really did just do obvious things. So I don't think that capable adversaries like nation states are going to be significantly advantaged by reading our paper, but it might cause slightly less sophisticated actors to have some ideas they otherwise wouldn't have had. And I think the reason I'm erring on this side of disclose more is that none of these risks are causing massive immediate harms. Those are the first main things to date. And then the second reason is that the amount of researchers who could tackle this problem outside of the main AI companies is just much, much bigger than what the resources just inside of those AI safety and preparedness teams at other companies could mount. So you get a much larger number of people who could work on these problems. And then finally, there is a very active debate right now about how these models should be regulated and governed. And so I think it's really important for at least policymakers, but also the general public electing these officials to be aware of what the safety profile of these models is so that we can have effective regulation there. Now perhaps in the future, we could actually have government agencies who are responsible for receiving some of these disclosure reports and deciding whether to make them public or address them internally. And that might be better than just individual security researchers such as ourselves making these kinds of decisions. But I would be hesitant to be in a world where the only people who know these safety failures are the companies themselves, because obviously, even if they all have the best possible intentions, there's going to be a really strong incentive to potentially downplay that.
Nathan Labenz: 44:10 I think that all makes sense. I find myself a little more hesitant, and I'm not sure really why. I mean, I think I buy the reasoning, but I was wrestling with this over the course of the last year with just honestly the most simple thing in the world, taking a pretty egregious prompt and asking GPT-4 to be a spear phishing agent just in straight dialogue with a target. And that was something I tried in the original red team, reported when they launched the production version in March, tried it on the first thing. To my great surprise, actually, the exact same prompt worked straight away every time. No refusals. And then I was like, well, jeez, what should I do about this? On the one hand, maybe the fact that they haven't fixed it means it's going to be really hard to fix and they're not going to fix it. But I also agree that this was so obvious. I was not even masking my intent. I always start with the most egregious thing and work my way backwards, which is maybe a tip for red teaming or aspiring red teamers in general. It's honestly a lot easier to find the problems than you might expect. If you literally said, just imagine I am a naive criminal and I know nothing about covering my tracks or whatever. I'm just purely trying to get the thing to do whatever. A lot of times that will work. So that worked, and it worked again on subsequent releases. And eventually, a year on, I finally tweeted about it. But in the meantime, I was like, man. And I did report it to them directly as well in the meantime. But I was always like, maybe I, you know, I don't know. I just, this is something that's so easy to do that I was a little nervous about popularizing it. I don't know. There's not really a question here, but it's just it, I've always found myself being a little bit more reluctant to put these things out there.
Adam Gleave: 46:08 That's a very reasonable response, and I don't feel like I've got a great principled way of deciding on these trade-offs. I mean, it will be excellent if we had a way of actually tracking over now a bunch of people using the Nathan Labenz spearfish. Can we monitor that? That would be useful to see about what was actually an uptick or if this is something where the people who are excited to abuse them already were or they're just not interested in doing it because we basically have some very cheap humans who can already do this as well. So I think that empirical feedback loop would be great. There is just a real trade-off here. I think the one thing I would say is that rather than publication being a binary decision, there are different approaches you can take. So you could publish on things that are not as directly applicable to a criminal use case, but clearly illustrate to the security community that you have potential to do this. Right? So maybe you could have it do spear phishing, but you do have it be able to write highly personalized, manipulative messages to people. And then the people who need to know might be able to join the dots even if it's not going to be as obvious to someone in the general public. So it's one way of sounding the alarm, but in a less obvious way. I think you can also release this stuff, but just in a somewhat dry academic style. And I don't think that very many random criminal gangs are reading every arXiv machine learning paper. So if you do release something but to relatively little fanfare, again, it might be something where you can get the right people to be aware of it without it massively popularizing it. Of course, eventually, that information is going to leak, but it might give you a few extra months for the people who are trying to defend these systems relative to attackers. So there are a few things you can do, but there are real trade-offs here. Right? Because if you don't tweet about it and say, look how easy it is to spearfish, then you're also losing a lot of popular awareness amongst policymakers, amongst civil society. And so it's very hard to reach those people without also reaching potential criminals. So there are still some hard trade-offs, but I definitely would encourage people to think about not just do I publish or disclose at all, but exactly what do I publish. You can also report that you found it very easy to cause spear phishing, but not actually disclose the prompt, for example. So there's a variety of things there that you can choose depending on how dangerous you think the information is.
Nathan Labenz: 48:26 Yeah. When I finally did share that publicly, it was after the latest version had finally started to refuse, and it was still not hard to break the initial refusal, but at least the most totally flagrant prompt was getting refused. And I basically did a version of what you said where I said, this used to work. Look how flagrant this is. This no longer works. It's not that hard to come up with something that still does work, but I didn't publish that version of it. Another thing that I'm really interested in, and this anticipates some of your big picture thinking, is the AI application industry right now is barely an industry. It's a bunch of hobbyists and a few more advanced players, but just tons of people trying tons of stuff and everybody's exploring the space simultaneously. I find often that a lot of application developers are not doing anything at all to put any guardrails on their systems. We could say, oh, man. OpenAI should be doing more or maybe they're doing enough or whatever, but they're clearly doing a lot. And it's, I think the fact that the exploits are so easy to find probably reflects more that it's a really hard problem than that they're not trying hard. But then there are examples in the public where you go try these various apps, and there are some that are just egregiously not trying hard. I don't know if you've worked on that frontier at all, but I'm thinking it seems like we probably need to start to create some industry standards. And it seems like some examples may have to be made of some developers that are just egregiously not concerned with what they're putting out there. Right now, the top category for this for me is the calling agent where I have, I've put one out in the public so far of a, you basically just prompt the thing and it will call and have a conversation verbally over the phone with someone. So I just did a simple ransom call. And again, so often, it's the first thing you try. Call and demand ransom. You have their child, whatever. And it just does it. Do you have a sense for how you would play things differently if you're talking about somebody that's a three-person company that might even have just been a weekend project that they spun up versus a leader in the field?
Nathan Labenz: 48:26 Yeah. When I finally did share that publicly, it was after the latest version had finally started to refuse, and it was still not hard to break the initial refusal, but at least the most totally flagrant prompt was getting refused. And I basically did a version of what you said where I said, this used to work. Look how flagrant this is. This no longer works. It's not that hard to come up with something that still does work, but I didn't publish that version of it.
Another thing that I'm really interested in, and this anticipates some of your big picture thinking, is the AI application industry right now is barely an industry. It's a bunch of hobbyists and a few more advanced players, but just tons of people trying tons of stuff and everybody's exploring the space simultaneously. I find often that a lot of application developers are not doing anything at all to put any guardrails on their systems. We could say, oh, man, OpenAI should be doing more or maybe they're doing enough or whatever, but they're clearly doing a lot. And the fact that the exploits are so easy to find probably reflects more that it's a really hard problem than that they're not trying hard.
But then there are examples in the public where you go try these various apps, and there are some that are just egregiously not trying hard. I don't know if you've worked on that frontier at all, but I'm thinking it seems like we probably need to start to create some industry standards. And it seems like some examples may have to be made of some developers that are just egregiously not concerned with what they're putting out there.
Right now, the top category for this for me is the calling agent where I have put one out in the public so far. You basically just prompt the thing and it will call and have a conversation verbally over the phone with someone. So I just did a simple ransom call. And again, so often, it's the first thing you try. Call and demand ransom. You have their child, whatever. And it just does it. Do you have a sense for how you would play things differently if you're talking about somebody that's a three person company that might even have just been a weekend project that they spun up versus a leader in the field?
Adam Gleave: 50:43 Yeah. I mean, I think we definitely do need different safety standards depending on the application domain, the degree of technical sophistication of a project and so forth. So, my first reaction is if you're actually developing in a safety critical domain, imagine you're making a healthcare assistant, bad output can really cause substantial harm. I don't think saying, well, we're a small startup is really an excuse. That's just table stakes. We didn't say that for pharmaceutical development, oh, well, Pfizer should be held to a really high safety standard, but if you're a biotech startup, you can just skip the clinical trials. That's not how it works. You need to prototype it and you need to raise funds to a point where you can run a clinical trial, but you can't just skip those safety steps just because you're small.
And will that slow down innovation? I mean, yes, to some degree, and that's a tradeoff that you do have to make. But if it's a sufficiently large downside risk from malicious applications or just negligent applications, that's a tradeoff that we've faced and resolved in many other domains in favor of safety.
But if it is something that's much more generic software as a service, I think the phone call agent is a good example where, okay, yeah, there are some potential malicious uses of it, but there's also lots of very beneficial uses. Well, I say that—I expect there's some beneficial uses. It does seem like the main use is more spam calls, so maybe not.
I think that there are some very basic things that they could do and should be expected to do. The most obvious one is start every phone call with a recorded message saying this is an AI. And so if you just prevent it from impersonating people and say, this is an AI, press star for more information on how to report abuse, then suddenly you've really reduced the risk profile from this, because, yeah, it could still be a real ransom call with someone using an AI as an intermediary, but you're probably gonna treat it with a little bit more suspicion. You maybe now have this number you can contact to find out more information about who was using this platform. And if you do then call up that number and they're like, oh, yeah, we just received 1000 abuse complaints from this user. You can just ignore it. It's not a real ransom call, and the problem is solved.
Now it's not to say that it's trivial. If it is a three person startup, you gotta build a bit of extra infrastructure to do that, but it's not a particularly onerous requirement to have a recorded message and just an abuse reporting form somewhere. I mean, that could still be just an Airtable on your website. It doesn't have to be staffed by a full time 24/7 response team. So I think there are some things where I'd say every application developer has responsibility to at least spend an hour thinking about how could someone abuse it and whatever really simple steps I could take to mitigate that.
But I think there is also a responsibility on the general purpose model developers to make these things easy for people. Right? So if I think of who is responsible for a buffer overflow attack—is it the programmer that wrote it? Is it the developer who paid the programmer and skimped on code review? Is it the designer of C++? It's a responsibility that's shared between these different people to some degree. But one of the best ways of eliminating security vulnerabilities has been changing the technology. So you just can't write buffer overflows in most modern languages like Java or Python, because the programming language is managing the memory for you.
I think there's a similar thing where, really, if we're saying who should be responsible, a company that's got a multibillion dollar R&D budget and whose revenue model is selling to a bunch of scrappy startups or 10,000 scrappy startups themselves, obviously, we should be, where possible, trying to transfer some of that responsibility onto the multibillion dollar R&D budget because you can solve that problem once, it gets pushed out everywhere, and they just have the resources needed to do it.
So maybe there do need to be some high risk versions of large language model APIs that enable potentially dangerous applications for developers who opt into that and pass some basic checks. But, I mean, by default, you should just have these large language models refuse to participate in things like ransom calls and be safe by default. And then you can still unlock these extra capabilities as needed.
So I don't think this has a simple answer, but I would definitely be trying to push liability up the value chain where possible because that's fair. They just have more resources to actually be able to solve these problems. And it doesn't seem like other kinds of security vulnerabilities where having every individual developer and end user of these applications be responsible has ever worked.
Nathan Labenz: 55:17 Are you aware of any taxonomy of affordances and minimal best practices for application developers? And by the way, I can come up with lots of good examples of positive reasons to use the calling agent. For example, just simple appointment reminders with the ability to reschedule in an interactive way. That's great. Everybody wins. So I do think there are definitely legitimate uses. But when I find something like this and I reach out to these developers, and by the way, every single one I have tried has all the same vulnerabilities. Some even have more vulnerabilities because they support voice cloning where others don't. So that even takes it up another dimension.
But I find myself wanting to say something like, here is the taxonomy of affordances on which you are in column three. And in column three, you're supposed to do these eight things, and you clearly haven't done them. Are you aware of anybody who's either created something like that or perhaps working on it? Because I want that one sheeter to be able to push in front of people and say, hey, look, this is what you gotta do.
Adam Gleave: 56:22 I'm not aware of anything as systematic as that. It's possible that Apollo Research is doing some work on this because they've popularized this capability affordances tradeoff. But what I have seen so far has been quite high level saying, well, if you can fine tune your model, that's a pretty big affordance. It has access to these external tools. That's another affordance. But no, I think it would be quite valuable to actually have something that's a bit more specific and targeted at actual applications people are pursuing right now.
This does seem like the kind of thing that, if someone is interested in it, could be contributed to the NIST risk management framework. And so they have a risk management framework for AI, and part of what they're developing is this playbook and saying, well, if you're in this particular application domain, how do I apply this framework? And so I think that we are in this stage where there are very few actual technical standards on AI relative to the importance of the technology, and we do need to develop that and figure out what best practices look like.
And, unfortunately, there's maybe just a bit of an absence of a safety culture in software engineering, which a lot of people working in machine learning have that background, relative to other engineering disciplines. So I think there's also a shift that people need to take. Whereas I imagine talking to a traditional Silicon Valley venture backed startup and saying, well, you gotta read this NIST document and implement the safety standard. They'll be not very happy about it. Whereas if you're a civil engineer and you say, well, you've gotta follow building codes, it's just like, yeah, you gotta follow building codes. Otherwise, you lose your engineering license. It's not a question.
Nathan Labenz: 57:55 Yeah. I do think there's gonna need to be some cultural change and some internalization among developers that this is a different sort of thing than the things that we built before, especially when you're talking about—one good rule of thumb, I would say, is to the degree that you are offering autonomous capability, that is a big difference. People seem to struggle with that distinction. I've had a couple of genuine leaders in machine learning respond to some of my stuff with, well, a human could do the same thing or how is this really different? And it's like, to me, there's a huge difference between selling a hammer to somebody and selling them a robot that goes and hammers things. And the robot that goes and hammers things, you gotta have a lot more control over that than just the hammer itself. That seems obvious to me, but it is not necessarily shared intuition for everybody.
My other question is, I think what's happening here with these models that I've tried is that they are probably falling into the trap that we talked about at the beginning, which is they're starting with an open source model. I'll bet it's LLAMA or perhaps Mistral now. If it's LLAMA, I don't know as much about exactly what Mistral includes or doesn't in the default package, but LLAMA 2 chat has a pretty robust refusal. If anything, they maybe overdid the refusal and created a situation where people wanted to fine tune it away because it was overbearing.
But even if they didn't, these folks are definitely gonna, for these calling agents, they're going to fine tune just because they want it to be more conversational and for all sorts of little behavioral reasons. Next thing you know, perhaps without even intent, they've stripped away the refusal behavior, and now it's just out there. So all that leads to the question of how do you think the open sourcers should be considered in this context? Right? They're not making revenue like an OpenAI would be. Should they have responsibility for that use? Nathan Labenz: 57:55 Yeah. I do think there's going to need to be some cultural change and some internalization among developers that this is a different thing than the things that we built before, especially when you're talking about... One good rule of thumb, I would say, is to the degree that you are offering autonomous capability, that is a big difference. People seem to struggle with that distinction. I've had a couple of genuine leaders in machine learning respond to some of my stuff with, well, a human could do the same thing, or how is this really different? And it's like, to me, there's a huge difference between selling a hammer to somebody and selling them a robot that goes and hammers things. And the robot that goes and hammers things, you got to have a lot more control over that than just the hammer itself. That seems obvious to me, but it is not necessarily shared intuition for everybody. My other question is, I think what's happening here with these models that I've tried is that they are probably falling into the trap that we talked about at the beginning, which is they're starting with an open source model. I'll bet it's like LAMA or perhaps Mistral now. If it's LAMA, I don't know as much about exactly what Mistral includes or doesn't in the default package, but LAMA 2 chat has a pretty robust refusal. If anything, they maybe overdid the refusal and created a situation where people wanted to fine tune it away because it was overbearing. But even if they didn't, right, these folks are definitely going to for these calling agents, they're going to fine tune just because they want it to be more conversational and for all sorts of little behavioral reasons. Next thing you know, perhaps without even intent, they've stripped away the refusal behavior, and now it's just out there. So all that leads to the question of how do you think the open sourcers should be considered in this context? Right? They're not making revenue like an OpenAI would be. Should they have responsibility for that use?
Adam Gleave: 59:53 Yeah. I mean, I think this is a really difficult question. And I'll say, first of all, that we use open source models all of the time in our research. It's incredibly valuable for this kind of safety and alignment research, but I'm also worried about a world where we have arbitrarily capable models being open source because it's very difficult to stop fine tuning APIs from being abused. It feels almost impossible to me to stop, if you just have a model weights, malicious users fine tuning those weights and choosing some vulnerabilities. It's not completely inconceivable. People have proposed self distracting models where if you try and fine tune on the wrong data, it harms the capabilities or there might be some technical approaches, but we're very, very far from having that right now. But then this is an intermediate case where it feels like the model's not sufficiently capable to be really dangerous without these additional affordances that is given. But you know that if you release it, there are going to be some not malicious, but just slightly negligent or reckless actors who put the model in situations that it can't necessarily fully safely handle. So now what's your responsibility there? Well, I think that you can say, at the very least, your responsibility should be to document the limitations of a model really clearly. And if people ignore that documentation, then, I mean, some of the responsibilities on them, that's not to say that you can wash your hands of it, but you've at least done the bare minimum as a developer. So I do think it's important to still have these kinds of model evaluation, model testing, including with independent third parties prior to release of an open source model. Like, yes, there's going to be a lot of testing after the model released. But you also really do want to have that from the get go, because people are going to start using it in applications straight away. I would say that, when you're talking about really big training runs, I think when you hear open source, at least to me, I'm always like, this is small community developers on GitHub, they're volunteering their time. That is not what open source language model runs like LAMA or Mistral look like. These are things that a very, in the case of Facebook or Meta, a very large company in the case of Mistral, still a pretty sizable company with hundreds of millions of dollars in investment has produced and chosen to release, often in some cases for a commercial reason. Right? I mean, obviously, a lot of the rhetoric is around how this is helping the broader community. But I don't actually think that necessarily either of those companies have a strong ideological commitment to open source, but it's like, they're behind in the frontier model development game. And this is a way for them to get a lot of good publicity and use of their models, and it doesn't really cost them that much if they're going to be trading it anyway. So these are still well resourced actors, and I think it is reasonable to at least ask them to spend some fraction of their R&D budget on safety, even if it's not as much as the likes of OpenAI or DeepMind who are actually getting revenue from these models and are spending a substantial chunk of money on those models. But, in the case of Meta, Zuckerberg just announced $7 billion in compute investment for frontier models. So if they were to spend 1 percent of that, right, $70 million on AI safety, then that would be not far from doubling the amount of revenue being spent on AI safety. And so I think that's not an unreasonable request to say 1%. I might even go a lot further. We actually put out a call to say there should be a minimum of a third spending commitment in AI research and development. And so if you do that, well, what does that look like in the open source case? It definitely looks like releasing safety fine tuned versions of models. So it's safe by default. But I think if people are going to fine tune it, that's one of the main use cases for the open source release. Then it's also on you to have good tooling to enable people to do their own fine tuning runs safely. And so maybe this looks like you've got an exam dataset with examples of refusals to harmful requests, and you can just mix in some of those examples into every fine tuning run so it doesn't catastrophically forget this. And, yes, a malicious actor could disable that feature in your code, but most developers are just going to leave the default parameters. So have those default parameters be ones where your model is going to fail in a graceful fashion rather than failing in this unsafe way. I think that would not be an unreasonable request for these larger developers. But I wouldn't apply the same standard to really small organizations like Eleuther AI. They produce this relatively small series of models called Pythia. None of the models have particularly big, dangerous capabilities. They spent more on the order of like $10 million rather than $7 billion on training these models. The standard we should be holding them to is lower.
Nathan Labenz: 1:04:38 Well, this is something we could talk about probably indefinitely, but I want to get to a couple other lines of work that you and your teammates have been developing over the last months. This next one is really, as I think of it, a very technical argument that this is going to be a hard problem. And, basically, the work is on superhuman Go playing AIs. And the upshot of it is that, and this is a gray box setting. Right? Black box early on, we were talking from the OpenAI APIs. You have no privileged access. Here in this Go playing, you have the gray box access, which means that you can ping it and say, like, what would your move be in this situation? And then use its responses to systematically optimize against it. But what's really striking here is that people think like, okay. Well, we've got superhuman Go playing AIs. That's that. Right? You're complicating that story by coming in and saying, well, if we set up a situation where we try to optimize against the superhuman Go player and all we have is the ability to see what it will do, systematically explore its behavior just by seeing what it will do in different situations, then that can be enough to, in fact, beat these quote, unquote superhuman Go players. And it's still superhuman in most respects, but it turns out that it has these, in some cases, very surprising, in some cases, very simplistic vulnerabilities, and in some senses, very deep seated vulnerabilities as well. So that's my overview. You can give me a little bit more on the technical implementation of that and what you think the upshots are.
Adam Gleave: 1:06:25 As you say, we have this slightly privileged access where we were able to train an adversary AI system to exploit some superhuman Go AIs. We used KataGo, which is the strongest open source Go AI, and it had both privileged access so that it could train against KataGo for as long as it wanted without KataGo itself learning and being updated. And, also, while the adversary was choosing what moves to play, it can query KataGo to see how it will respond to that. So it's like you got a very good model of what the opponent is going to do. And what we found was that it's pretty easy to find attacks. It's the main upshot. So we were able to find attacks where a fairly small fraction, less than 10% as much compute as KataGo was trained with. And, originally, we found a somewhat degenerate attack that caused KataGo to end the game prematurely at a time that was unfavorable to itself. So it's a forced, oh, I've won. And in some moral sense, it had because if it kept playing the game, it would win. But under the typical scoring rules used in computer Go and the scoring rules that KataGo was trained with, it actually loses if it doesn't complete the game at that point. But then we were able to patch that attack manually, and then we found a much more interesting, at least from a Go play perspective, attack, which resulted in the adversary forming these circular patterns on the board, so connected groups of stones. And then KataGo would encircle these, and then we would encircle that again and capture this big KataGo group. And KataGo seemed to be completely unaware that this circular group of it was about to be captured until it was too late for it to respond. So it's as if all these groups were invulnerable for whatever reason. And then this attack, we were actually able to replicate manually. So unlike a lot of adversarial attacks, it is human interpretable. So one of our team members, Kellin Pelrine, who's a very strong Go player, but he's not world class. Not superhuman by definition. He was able to very reliably beat KataGo and actually was able to beat KataGo even after giving it a 9 stone handicap, which is what you would give a complete beginner who's maybe a child and is just playing Go for the first time, and it's almost offensive to offer that handicap to a superhuman Go AI agent. But even with that handicap, he's able to beat KataGo. So that was, I think to a lot of people, very surprising because these AI systems, well, both they seem very superhuman, and they've been around for 6 or so years after AlphaGo's initial victory, and many humans have tried to win against them. Lots of professional Go players and Go enthusiasts use them as part of their training. But this kind of attack was, basically completely unknown, and we wouldn't have found it if we hadn't actually explicitly done some adversarial testing.
Nathan Labenz: 1:09:38 And so I guess the key point here, again, it's worth emphasizing the setup is that you get to ask what would you do in this particular situation. And that's a pretty representative setup in the world. Right? Because if we're going to have AI systems and they're going to be out there, they're going to be deployed, they're going to be in use, then obviously it's going to be the kind of thing where people can ask what they would do in these situations and you don't even really know. I mean, it's also important to keep clear that the difference between asking what you would do in this situation and just everyday use is not obvious. Right? When you're asking what would you do in this situation, you're basically just giving the Go engine a board state and asking it to make a move. It has no way of resolving whether it's being optimized against or just playing a normal game. Right?
Adam Gleave: 1:10:28 Absolutely. And I think this is a very common setting. I mean, certainly, any open source AI, you'd have this because you just have a copy of it, you can query. But even just any kind of consumer product anyway where you can have a copy of it yourself that you can interrogate would satisfy this kind of threat model. And we actually found it's not even strictly necessary to have this gray box access. So although we trained against KataGo, we then tried the same attack against a variety of other both proprietary and open source Go AI systems. And that move attack did transfer, so it was also able to exploit these other AI systems. And the win rate was lower. We got well above the 90% win rates against superhuman versions of KataGo, whereas we might only get somewhere like 3 to 5% win rate against some of these other AI systems. But we found that Kellin, a strong Go player, was able to win against these systems at a much higher rate than that. So the problem wasn't that the other systems weren't vulnerable to this, it's just that there's enough differences in the system that you need to tweak the attack slightly. But I suspect it would be very easy to then fine tune with much less access to these other systems and find a way of reliably exposing them given that humans are able to quite readily adapt to that attack. So it's not just enough to put your system in a secure black box. If people know roughly how your system was designed, they can find attacks against similar systems, and then probably some of those attacks are going to relatively small modification work against your system. And we've seen this not just in the Go context. There was Andy Zou and his team at CMU came up with these universal adversarial attacks by attacking some open source models when they do just work against GPT-4, against Bard, against a variety of these proprietary models.
Nathan Labenz: 1:12:17 We have an episode on exactly that paper as well. So the first response to this would presumably be, Let's just mix some of that data into the training and rerun it, and that way we should be able to patch it, right? Nothing to worry about. Nathan Labenz: 1:12:17 We have an episode on exactly that paper as well. So the first response to this would presumably be, let's just mix some of that data into the training and rerun it, and that way we should be able to patch it, right? Nothing to worry about.
Adam Gleave: 1:12:33 Yeah. Well, this is one of the best defenses that people have. This is an adversarial training approach, and both our team and the KataGo developers have been trying to do this with some success, but it's still very exploitable is the headline result. So the KataGo developers created thousands of example games for a mixture of hand-constructed positions and procedurally generated positions to really try to teach KataGo how to actually play these positions correctly. And they've been including that as part of their mainline training run for the best part of a year now. And sure enough, the original attack we found, its win rate does decline quite a lot. I think it still wins occasionally when playing against just the neural network itself, but it stops winning when playing with a modest amount of search because this is a mixture of both symbolic lookahead reasoning and the neural network. But then if we just keep training our adversary against this updated KataGo version, we're able to find an attack that works very reliably against this new version. So it seems like it's not really learned a robust algorithm for how to reason about it, and it's more just learned some of the common patterns and how to correctly respond to them. And it's maybe also just being a bit risk averse and saying, oh, I really don't like circles on the board. That's scary. Let's try and avoid it. But you can still set up positions where it will make serious mistakes.
Now the KataGo defense was trying really hard to preserve first-class capabilities for regular games because, obviously, their priority isn't just to defend against our attack. It's to produce a really awesome piece of software that Go enthusiasts can use. So we also wanted to do our own training run that was willing to sacrifice a bit more capabilities, a bit more Elo strength in order to get something that is actually truly robust. And so here we did many rounds of adversarial training. We'd harden our version of KataGo by training against these adversarial games. We'd find a new adversary. We'd do another round of hardening. We did nine rounds of this hardening in total. And again, we found that it is still vulnerable. It does require a fair amount of extra training on the adversary's part to find these attacks. So it does raise the cost of attack a little bit, and it does change qualitatively in nature. What we find is that these circular groups now take up most of the board. And so rather than being a relatively small group, it's now really just the whole board is an attack. And this is probably much harder for these AI systems to reason about because there's a limit to how much global reasoning across the board they can do in a fixed-depth neural network.
And then we actually also thought, well, okay, maybe this is a problem with this neural network architecture. So let's try vision transformers because they don't have this spatial locality bias that CNNs have. State-of-the-art image classification models use vision transformers. And so we trained our own superhuman vision transformer model. I found that it's also vulnerable to this kind of attack. And if we target the attack specifically against this vision transformer model, we again can get very high win rates as the adversary. So very hard to eliminate. I think the main source of hope would be that with some of these additional attacks, we haven't yet found an attack that works against a really high amount of search for these ViT models. If you do adversarial training plus play with way more search when you need to be superhuman, then it might be quite hard to find an attack. I think it probably is still possible, but it might become computationally intractable for most attackers. So that is a bit of an escape route. But of course, in many ways, Go is a much simpler, lower-dimensional game than unconstrained text inputs. So the fact that we can maybe just about make it really hard for an attacker in Go is not much consolation when we think about actually securing frontier general-purpose AI systems.
Nathan Labenz: 1:16:43 That point probably can't be emphasized enough. The difference in surface area. The numbers are crazy, right? The number of possible states of a Go board is just astronomical, and yet it's still tiny compared to the surface area of GPT-4 and just the arbitrary nature of any input and output that it could possibly see. So it's definitely a daunting problem. I guess the upshot of all this is your view is that, at least barring some conceptual breakthrough, we could maybe speculate as to the prospects for such a conceptual breakthrough. But barring such a breakthrough, today, systems are exploitable by default. And the flip side of that coin is that to become less exploitable, you have to pay what you have called the robustness tax. So do you want to characterize—I think we've made the case probably for the exploitable by default—do you want to characterize the robustness tax and maybe speculate a little bit as to the plausibility that somebody might have the conceptual breakthrough that we need?
Adam Gleave: 1:17:49 Yeah. I think right now, contemporary machine learning systems can be split into systems that people have successfully attacked and systems that people haven't really tried to attack. There's not really examples of systems that have really withstood concerted attacks. So that alone is pretty strong evidence for the exploitable by default side of things. But in terms of how you could get robustness if you really want it, we do have some both theoretical and empirical evidence to believe that there is a tradeoff between worst-case performance or robustness and average case performance. So in the case of computer vision, these small adversarial examples have been quite well studied in a theoretical setting, and there are results arguing that there is this tradeoff between robust and clean accuracy. And then we also do just see this when we look at state-of-the-art adversarially trained models in computer vision. This is exactly what you're suggesting. You find these adversarial data points. You feed that back into the training dataset. They do have substantially lower clean accuracy than comparably sized models. And this is after substantial additional compute because adversarial training is quite computationally expensive. So this is some empirical evidence for a robustness tax.
And if we were to port this a bit over to the LLM setting, I'd say that one of the things I feel more optimistic about in terms of actually getting some rigorous safety guarantees would be if we really narrow the safety properties we're trying to get, but demand them with a very high probability. And so I think a good example of this is Anthropic actually has made a commitment, I believe, for their ASL-3 models. So these are models that aren't quite human level, but can really accelerate certain types of scientific R&D, that they will not have jailbreaks that affect chemical, biological, or nuclear weapon development. But you can still—your spear phishing attack, I think, would still be allowed. But there shouldn't be some jailbreak that will cause it to say, oh, well, here's how you enrich uranium.
And you can see how you could achieve this actually pretty easily just by removing any document from a training dataset that has anything to do with chemistry, biology, nuclear physics. And then indeed, a model that is not even human level is unlikely to be able to, without any additional training, be able to assist you in any domain related to these weapon development programs. But it also is not going to be able to help you answer your perfectly innocuous chemistry homework or help you look up a relevant medical research paper. So this is collateral damage where if you remove this dangerous capability, then you're also removing a bunch of harmless capabilities. And a lot of things we want models to do are dual-use. Code generation, that's great. But if you can write code, you can also write a rootkit or an exploit, right? And so if we don't have perfect filters to distinguish between these malicious versus beneficial use cases, then the only thing we can really do is just obliterate any capability that could be abused. And conversely, the more precise we can make these filters, the less collateral damage we need to have for capabilities. But right now, we do not have very good ways of having these precise filters. You can normally adversarially exploit them. And so that's where I see the robustness tax coming in in practice.
Nathan Labenz: 1:21:13 It's everything, right? It's like you've got to spend more time doing it. You're going to put more compute, and you're going to have degradation of performance all at once. And so you've got trade-offs, and it's all super annoying. But that's the only way that we know forward from here. I think there's a couple of conceptual things that I'd love to get your thoughts on, and then we can go to where you and FAR AI are hoping to contribute to this. Just on the question of a breakthrough, I mean, obviously, breakthroughs are very hard to predict, but do we have any principled reason to think that somebody's not going to come along and solve this?
Adam Gleave: 1:21:48 Well, I really hope that someone does. I do think that the theoretical analysis should make us a little bit hesitant. But under some reasonable assumptions, there does seem to just be a tradeoff between these objectives. Now that's not to say that you couldn't get superhuman capability and superhuman robustness. So there's no theorem saying that you can't achieve that. But it might be that you could get superhuman capability earlier if you're willing to not have superhuman robustness. There probably is going to be some trade-off that people are facing. But that doesn't mean that it's insurmountable. We've solved that problem in many safety-critical domains where you just say, well, if you release an unsafe system, you have liability or you can't get approval for the training run. So there is a pathway to superhuman or human-level robustness, even if it's a little bit more costly. I think we're in relatively good shape.
Now right now, I don't think we even have a plan for reaching human-level robustness even if we did make a concerted effort to do so. And people have been trying quite hard for over a decade to solve adversarial examples. But I think there are some reasons to be more optimistic in the text-based setting. It already is very high-dimensional. So many more states than is possible on a Go board. It is lower-dimensional in many ways than an image. It is this discrete domain rather than continuous. And although the context window is massive, if you are just looking at a local context, the number of possible inputs is often much smaller for text than for a comparably sized image. So that maxim that a picture is worth 1,000 words, maybe it really is true if you look at the size of the state space. So that might make it easier to get some of these safety guarantees.
I also think that for the most part, people haven't been focusing on the most safety-critical, physically realistic threat models. So if you do have a concerted effort to actually make frontier models safe, that is of a similar amount of effort to what we're doing to try to scale up frontier models, then I think that would very likely yield some positive results.
I think there's also reason to be optimistic that we don't need fully adversarially robust systems. So I had this idea of a fault-tolerant AI system, and my idea is more a defense-in-depth analysis and saying, well, what affordances does this model have? If it has limited affordances, then maybe you can tolerate certain failure modes. Maybe you can just basically have the important principle of least privilege. Don't give it access to anything it doesn't need to have. That limits the blast radius. You have different models monitoring each other's outputs. And so you might be able to get to a point where although any given component in the system is quite brittle, it's really hard to be able to beat both the filter on the input going in, the model's own safeguards, a filter at the output, and some human monitoring of your usage of this API so you get kicked off the platform if you abuse it too much. You can imagine all of these extra layered safeguards where any individual component can be broken, but the overall combined system is pretty hard.
And I think there's all sorts of cases in society where we do just have very vulnerable components, but we've shaped incentives and the overall system to be quite safe. So it is actually very easy to rob a bank. Bank tellers are told if someone approaches you with a weapon, you should just hand over the cash. But it turns out the police response is generally pretty robust, and banks don't have more than tens of thousands of dollars on hand in a branch anyway. So the bank has managed to limit the loss, and broader society has meant that it is not a very lucrative criminal profession to be a bank robber, so people mostly just don't exploit this attack. And I think there might be many analogs there in AI systems where some of them can be exploitable, but only in relatively low-stakes settings. And if you do, you're going to get caught. You're not going to have access to the system any longer. And so you really would prefer not to exploit these systems.
Nathan Labenz: 1:25:47 One other thing that I thought was really interesting in reading your work leading up to this was the analysis that you had on different training paradigms and how different training paradigms have very different implied incentives for an adversary. There's some of them that are adversarial in nature that incentivize that exploit behavior, and others are much less so. Do you want to give a quick overview of that analysis? I think that's really interesting. Nathan Labenz: 1:25:47 One other thing that I thought was really interesting in reading your work leading up to this was the analysis that you had on different training paradigms and how different training paradigms have very different implied incentives for adversaries. Some of them are adversarial in nature and incentivize that exploit behavior, and others are much less so. You want to give a quick overview of that analysis? I think that's really interesting.
Adam Gleave: 1:26:15 Absolutely. Most of our discussion so far has been focused on more of a malicious use risk. Someone out there is trying to exploit a model, or at least is being negligent in their use of a model. But also, the way we design AI systems is composed of multiple machine learning components. When you do this safety fine-tuning, what you're normally doing is you're having a large generative language model that is being optimized using something like reinforcement learning to maximize the reward signal given by some other large language model. There's various variants of a setup, but all kinds of fine-tuning involve optimizing some other objective. And then you have this interesting question of, well, what is optimization if not in some ways being adversarial? You just want to find the input that will cause the highest number from the reward model or some other training signal. And if that input is not actually what a human really wanted because you're not being optimized directly from human judgment, but just exploiting this other model, then there will be this optimization pressure to learn that. This lack of robustness isn't just a security risk. It's also an alignment problem. It can be very challenging to align models if you've got this lack of a reliable ground truth reward signal. And actually, we see that this does happen. So if you just do something like RLHF in a naive way, then you'll get these very gibberish outputs that are often not at all what a human wanted. They're both incomprehensible and sometimes offensive, but they get really high reward signals. And so people avoid this by basically just trying to stop there being any major changes to base generative models. They add a KL penalty. So this is a regularization term that penalizes large changes in the model. This is an effective workaround, but then you're also staying close to all the toxic and undesirable behavior that was present in the base models. You're really limiting what your alignment technique can do by adding this regularization term. So what would be nice is if we had instead either a robust reward model, but we've discussed why getting really robust models is quite hard, or we have some optimization process which is still meaningfully aligning the AI system but is not exerting as much adversarial optimization pressure. And this is a relatively understudied area, but I think we already have some empirical results suggesting that different kinds of optimization processes do have meaningfully different trade-offs. So I think something like reinforcement learning seems like one of the worst ones, at least if you look in terms of how much it changes the model's behavior, whereas something like just sampling many possible outputs and picking the best outputs from that model as judged by the reward model, that tends to stay much closer to the base model's distribution. So it's probably less likely to find these adversarial examples. This was found by Leo Gao and his team at OpenAI on looking at scaling laws for reward model over-optimization. But those are just two examples of optimization processes. You could do many others. Imitation learning is probably safer than reinforcement learning from that standpoint because the optimization pressure is just towards copying something else. And so as long as the thing you're copying is harmless, you're unlikely to find something that is radically different to it. I think there's also a possibility that something like iterated distillation amplification—this is almost an AlphaZero-style training procedure for a language model where you have one model ask questions to copies of itself so it decomposes the problem and then uses those answers to come up with a response and then tries to distill that final response back into a model so it doesn't need to ask as many copies of a different model. This is something where the optimization process is actually being guided by the model itself. And so if a model doesn't want to exploit itself, and there's various reasons why, at least if it's stable and converges to anything, it'll be trying to avoid doing that, then you're actually in a pretty good place because it should be steering clear from potential areas where it might be giving misleading answers. This is definitely an area that needs a lot more work, but I'd invite people to explore this rather than just use off-the-shelf optimization processes because one of the big advantages we have for alignment as opposed to security is that we actually do get to choose the attack. We do get to choose what optimization approach different components of our system are using. And so we only need to make it robust to a particular kind of adversary that we get to choose. Whereas when we're in a security approach, we have to take whatever method the attacker comes up with and defend against all possible attacks, which is much harder.
Nathan Labenz: 1:31:14 Yeah, I guess to just try to put a little framework around that, alignment is a subset of safety in the sense that alignment is trying to get the model to do what you want it to do under even normal conditions, and safety includes these more adversarial unknown attack types. And specifically in the alignment context, you're looking for setups which do not directly incentivize what you call the main model in your analysis to find the exploitable weaknesses in the helper or the evaluator or the reward model. So where reinforcement learning is the worst is that the thing that the main model is optimizing for is the high score from the helper model, and that directness, that tightness of that coupling is the source of the incentive to find the exploits. And there are other setups which don't have that such tight feedback loop and so are much less prone to that problem.
Adam Gleave: 1:32:22 Yeah. Exactly. So it's this direct feedback loop, and then also it's the fairly unconstrained nature of reinforcement learning. And so once it finds one thing that gets high reward, it will try to keep doing that, which usually is exactly what you want. But it also means that if it stumbles across an adversarial failure mode, and it's really unlikely to be discovered by chance, it will now reliably keep doing that adversarial failure mode, whereas some other optimization processes might be slower to update and be more likely to stay in a common safe region. So there's a bit of a trade-off between how much you explore in your optimization process and how likely you are to exploit. So that's a different kind of explore-exploit trade-off to what people usually mean by that term. You explore more and you're more likely to exploit a model.
Nathan Labenz: 1:33:08 This is obviously a challenging context. We're in this GPT-4 moment where I always say GPT-4 is in a really sweet spot where it's really useful. I find it tremendously useful and I use it daily. And it's also not so powerful that we really have too much to worry about in terms of it causing major harms or getting out of control. I don't know how long that window extends. I don't know if GPT-5 stays in that window. My guess is we probably have at least one more generation that would stay in that sweet spot window. Although, I wouldn't dismiss somebody who had a different take on that. What do you hope to do with your collaborators at FAR AI over the next year or two as we move from the GPT-4 to the GPT-5 era to help ensure that at least this next phase goes well?
Adam Gleave: 1:33:56 One big project we're focusing on right now is developing empirical scaling laws for adversarial robustness. Scaling laws in general have been very helpful for forecasting future developments in AI and then also enabling people to make more rapid progress by working out which methods do continue to benefit from increased scale and which ones do not. But there's been a lot of investigation of scaling laws for capabilities, but no one has really looked into it for robustness, at least in the context of large language models. And so we have some preliminary results there already showing that bigger models, in fact, are more robust. But unfortunately, the increase in robustness is much smaller than the increase in capabilities. So there's evidence that it is going to be a widening capability-robustness gap as things go on. But then the thing we're really excited to do next is to look at, well, how do defenses change this? So this is just the robustness of a pretrained model fine-tuned on a particular dataset without any attempt to defend it. But if we do adversarial training or if we do a more defense-in-depth approach where we have some models moderating the input and output, does this actually not just improve the robustness of it, but actually change the slope of the scaling trend? And what we'd be, in many ways, more excited by is not a method that massively improves robustness today, but something where we say, well, in two generations from now or three generations from now, this will be meaningfully closing this capability-robustness gap. And we're looking to identify a technique like that which we could then scale our prototype and get to the point where this could actually be a solution to robustness for more advanced AI systems if frontier model developers do adopt it. So that's one direction that we're really excited by. I think the other direction that we do want to explore further is this more fault-tolerant angle and saying, well, okay, if you can't actually get robustness, what does that mean for AI safety? Does this mean that we need to really lock down who has access to our models if they have dangerous capabilities because we just can't defend against malicious use? How do you even do that? What does a know-your-customer regime look like for AI systems? Are you able to make that overall system safe even though lots of components are vulnerable by this redundancy, defense-in-depth? Does it actually scale? And what does this mean for alignment? So there's lots of questions there that we also want to explore. And generally, I think that this is a pretty under-investigated area because there's been a convenient assumption being made that, okay, robustness is just going to go away once we get sufficiently advanced systems. And unfortunately, I don't think that's the case in the absence of some conceptual progress.
Nathan Labenz: 1:36:32 Can you give a little order of magnitude intuition? I always like to try to have a Fermi calculation that I can go to. So for the current—this is pre-adversarial training. Right? You're just saying, how does, to take familiar names, how does GPT-3 compare to GPT-4 in terms of capabilities, in terms of robustness? And it appears that they are diverging. But can you give me a little more order of magnitude intuition for how to understand that?
Adam Gleave: 1:36:58 Yeah. I mean, it's a difficult question to answer because these models don't just vary in scale. They also vary in their fine-tuning. If we're just varying model scale in the pretraining regime, then very loosely, I would say that capabilities are improving an order of magnitude faster than robustness. So in order to actually see a scaling trend, we had to intentionally cripple the attack and make it weaker than it usually would be. Because if we ran the full strength version of something like greedy coordinate gradient, it would just—all the models would go from 100% accuracy to 0% accuracy. But even if we run this crippled version, when we go from something like a 13 million parameter model to 1 billion parameters, we see maybe something like a 10% increase in robust accuracy. So I would say you need to go several more orders of magnitude, just to a GPT-4 model size, in order to get to full robust accuracy against this crippled version of the attack. But then we can, of course, just actually use a full strength version of the attack, and then GPT-4 will also be very vulnerable to these kinds of attacks. So it seems like we need to be getting into GPT-7 or so before the just the current crop of attacks we have stops working, but then better attacks are coming out all the time. And whereas I expect that GPT-7 in terms of capabilities would be really qualitatively quite different and a very capable and, in some ways, scary model. So from that perspective, it seems like capability is moving much faster than robustness. But as you said, this is all without any defenses, so that could really change the picture here quite a lot. And we did find GPT-4 much harder to break than GPT-3.5 even though the difference in scale is substantial, but it's not more than an order of magnitude, I would guess, in terms of parameter count. And so a lot of that is coming from the safety fine-tuning they've done actually being reasonably effective.
Nathan Labenz: 1:38:48 Well, I certainly hope to see great success from this research agenda. You've been very generous with your time today. Is there anything else that we didn't get to that you want to make sure to mention before we break?
Adam Gleave: 1:38:59 Yeah. I'll just mention quickly that if anyone in the audience is either looking to more actively work on these kinds of problems or knows people who are looking for that kind of career shift, then my organization FAR AI is hiring for a wide variety of roles, both on the technical side. So we're hiring for an engineering manager, technical leads, research engineers, research scientists, but also on the more business operations side, we're looking to hire someone to lead some of our programs and events. So do check out our website to either find out more about our research or check out some of our open recruitment rounds.
Nathan Labenz: 1:39:31 Adam Gleave, founder of FAR AI, thank you for taking on the challenge of adversarial robustness, and thank you for being part of the Cognitive Revolution.
Adam Gleave: 1:39:40 Thank you for hosting me.
Nathan Labenz: 1:39:41 It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.
Ads: 1:39:56 Turpentine is a network of podcasts, newsletters, and more covering tech, business, and culture, all from the perspective of industry insiders and experts. We're the network behind the show you're listening to right now. At Turpentine, we're building the first media outlet for tech people by tech people. We have a slate of hit shows across a range of topics and industries from AI with Cognitive Revolution to Econ 102 with Noah Smith. Our other shows drive the conversation in tech with the most interesting thinkers, founders, and investors, like Moment of Zen and my show Upstream. We're looking for industry leading hosts and shows along with sponsors. If you think that might be you or your company, email me at erik@turpentine.co. That's erik@turpentine.co.