In this episode, security researcher Nicholas Carlini of Google DeepMind delves into his extensive work on adversarial machine learning and cybersecurity.

Watch Episode Here

Read Episode Description

In this episode, security researcher Nicholas Carlini of Google DeepMind delves into his extensive work on adversarial machine learning and cybersecurity. He discusses his pioneering contributions, which include developing attacks that have challenged the defenses of image classifiers and exploring the robustness of neural networks. Carlini details the inherent difficulties of defending against adversarial attacks, the role of human intuition in his work, and the potential of scaling attack methodologies using language models. He also addresses the broader implications of open-source AI and the complexities of balancing security with accessibility in emerging AI technologies.

SPONSORS:
SafeBase: SafeBase is the leading trust-centered platform for enterprise security. Streamline workflows, automate questionnaire responses, and integrate with tools like Slack and Salesforce to eliminate friction in the review process. With rich analytics and customizable settings, SafeBase scales to complex use cases while showcasing security's impact on deal acceleration. Trusted by companies like OpenAI, SafeBase ensures value in just 16 days post-launch. Learn more at https://safebase.io/podcast

Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive

Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive

RECOMMENDED PODCAST: Second Opinion
Join Christina Farr, Ash Zenooz and Luba Greenwood as they bring influential entrepreneurs, experts and investors into the ring for candid conversations at the frontlines of healthcare and digital health every week.
Spotify: https://open.spotify.com/show/...
Apple: https://podcasts.apple.com/us/...
YouTube: https://www.youtube.com/@Secon...

CHAPTERS:
(00:00) Teaser
(00:55) About the Episode
(04:54) Introduction and Guest Welcome
(05:32) Nicholas Carlini's Contributions to Cybersecurity
(07:15) Breaking Defenses: Techniques and Challenges
(09:08) Adversarial Examples and Optimization
(11:22) Exploring Unfine Tunable Models
(13:25) Understanding Attack Strategies (Part 1)
(18:22) Sponsors: SafeBase | Oracle Cloud Infrastructure (OCI)
(20:58) Understanding Attack Strategies (Part 2)
(30:39) Compute Requirements for Different Attacks
(32:51) Sponsors: Shopify | NetSuite
(35:39) Data Poisoning in Machine Learning
(39:39) High-Dimensional Spaces and Attack Intuitions
(54:50) Understanding Loss Surfaces and Robustness
(55:35) Distillation and Gradient Masking Defenses
(57:24) Breaking Gradient-Based Defenses
(01:01:05) Challenges in Open Source Model Safety
(01:10:16) Unlearning and Fact Editing in Models
(01:20:59) Adversarial Examples and Human Robustness
(01:42:28) Long-Term Memory and Model Robustness
(01:45:44) Preventing Unauthorized Actions in AI Systems
(01:46:19) Challenges in Building Robust AI Systems
(01:48:25) Exploring Cryptography and AI Robustness
(02:00:13) Human Factors in Security Systems
(02:05:55) The Future of AI Security and Open Source
(02:06:45) Scaling AI Security Research
(02:23:28) Balancing Security and Open Source in AI
(02:31:28) Final Thoughts on AI Security and Policy
(02:33:16) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

PRODUCED BY:
https://aipodcast.ing

Full Transcript

Nicholas Carlini: (0:00) There are lots of lessons we've learned over the years. 1 of the biggest ones probably is the simplest possible objective is usually the best 1. Even if you can have a better objective function that seems mathematically pure in some sense, the fact that it's easy to debug simple loss functions means that you can get 90% of the way there. So like the accuracy under attack for the type of adversity examples you train on usually is 50%, 60%, maybe 70%. And that's much bigger than 0, right? Like, this is good. But as an attacker, what does 70% accuracy mean to me? 70% accuracy as an attacker means to me, try 4 times and probably 1 of them works. The core of security is turning this really ugly system that no 1 understands what's going on and like highlighting the 1 part of it that like happened to be the most important piece. This is important to do to show people how easy it is because the people who know it's easy are not going to write the papers and say it's easy.

Nathan Labenz: (0:56) Hello, and welcome back to the cognitive revolution. Today, I'm speaking with Nicholas Carlini, prolific security researcher at Google DeepMind, who's demonstrated over and over again that despite many attempts and tremendous effort, AI systems still cannot be robustly defended against adversarial attacks. My goal in this conversation was to draw out the mental models, frameworks, and intuitions that have allowed Nicholas to be so consistently successful at breaking AI defenses. And we cover a ton of ground, including the fundamental asymmetry between attack and defense, how visualization helps him understand high dimensional spaces, how adversarial defenses usually work by modifying loss landscapes and the techniques he uses to get around those challenges, how confident we should be in our understanding of the features learned by interpretability techniques like sparse autoencoders, the relationship between interpretability and robustness, the compute requirements for different types of attacks, how he approached and ultimately quite quickly defeated the tamper resistant fine tuning defense that we previously covered in our episode with Dan Hendrix, how models store and can be made to reveal training information, what makes humans more robust than current AI systems, whether the black box characteristics evolved by biological systems might be adaptive for security purposes, and the still quite limited role that today's AIs can play in developing Carlini style adversarial attacks. Throughout the conversation, Nicholas shares a number of fascinating insights. From his observation that almost everything in high dimensional space is close to a hyperplane, his emphasis on starting with the simplest possible loss function, to his practical wisdom about which defenses are worth spending the time to attack in the first place. At the same time, there's an important meta lesson here about the possibly irreducible black box nature of intelligence itself. Nicholas doesn't fully understand why he's so good at this work. And as you'll hear, he chalks a decent part of it up to an impossible to articulate intuition that he's developed over years of experience. Now as we enter into an era in which reinforcement learning is quickly propelling AIs to human or even superhuman levels of capability in more and more domains, we can only expect more Move 37 type insights from AI systems as well, and we'll face real challenges in determining how much to trust them. This in turn underlies another important theme of this conversation, which is the genuine ambivalence of the AI safety community toward powerful open source models. It's underappreciated and worth repeating that most AI safety advocates are lifelong techno optimists who, like Nicholas, genuinely fear concentration of power and appreciate both that open source software has been amazing for the world and that open source AI models specifically have been critical to enabling all sorts of recent safety research. Yet at the same time, they worry that extremely capable AI systems are coming soon, and in part because of Nicholas' work, strongly doubt that we'll be able to make such systems safe enough to be distributed broadly in an irreversible fashion. This is a really vexing dilemma. But with AI being deployed in more and more contexts all the time, my hope for this episode is twofold. First, that highlighting Nicholas' work can help equip policymakers to make informed decisions as they inevitably confront difficult trade offs. And second, that we might inspire a few talented researchers and builders to meet the market demand and social need for AI security expertise by pursuing their own version of Nicholas' storied career path. As always, if you're finding value in the show, we'd appreciate it if you take a moment to share it with friends, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. We welcome your feedback and suggestions too via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. Now I hope you enjoy this window into the habits of mind that support successful AI security research with Nicholas Carlini of Google DeepMind. Nicholas Carlini, security researcher at Google DeepMind. Welcome to the Cognitive Revolution.

Nicholas Carlini: (5:01) Yeah. It's great to be here. Thanks for having me.

Nathan Labenz: (5:04) I'm excited for this. So I guess quick context, you recently had an appearance on Machine Learning Street Talk maybe out 10 days ago or so as of the moment that we're recording. I thought that was excellent. So shout out to MLST for another great episode. Hopefully, we'll cover, you know, largely very different ground here, but I do recommend people check out that episode as well for another angle on your thinking and your understanding of everything that's going on in AI. 1 thing that was said in that episode, which caught my attention, I haven't fully fact checked it, was that you have created, demonstrated, and I guess published more attacks on cybersecurity and machine learning defenses than the rest of the field combined. You could tell me if you think that's literally true, but I did look up on your Google Scholar page 21 papers in 2024 alone was what I counted there.

Nicholas Carlini: (5:59) Yeah. Okay. Is this literally true? So I think the statement that probably is literally true is if you count the number of papers where I am a coauthor and the number of defenses broken in those papers, and then you count the number of papers wherein I am not a coauthor, of the papers that are breaking adversarial example defenses on image classifiers. As of, I don't know, last year, that statement probably was true. So with caveats, yes, but, like, for a very specific domain, for a very particular kind of thing. And probably mostly just because this is a thing that I, for some reason, enjoy doing and just will do before other people get to it, and so other people just don't do it as much. But, yeah, that probably is for that 1 particular claim. Correct.

Nathan Labenz: (6:50) Cool. Well, you're a careful thinker communicator. What I hope to do maybe above all in this episode is try to develop my intuition and hopefully help other people develop their intuitions for the habits of mind, approaches, mental models, you know, what have you, that have allowed you to be so successful in this space. So hopefully this can be a little bit of a crash course that maybe inspires some new people to think that they can get into the field and make an impact as well. So I guess first question is, is everything easy for you to break? Like, 21 papers in 2024 alone is obviously a lot.

Nicholas Carlini: (7:26) Yeah. No. Okay. So, like, to be clear, you know, I finished my PhD in 2018, so I've been out for a while. And so I've had a lot of time to meet a lot of great coauthors. And so a lot of the papers that I've been working on, '21 seemed like a lot to me. I was trying to think through how many I can remember. I think a large part of this is for many of these results, it is the kind of thing where I would show up to the weekly meetings, help write the paper, direct the experiments on some of them, but I was not writing the CUDA code to do whatever stuff myself. And that's how you get a lot of things done. And you see this happen for everyone who's in the been in the field for a long time where the marginal value of an hour of my time could be spent either on very, very low level stuff with GPUs or with, like, here's comments of wisdom I have learned over the past 10 years that helps as PhD student get a lot done in a lot shorter amount of time. This is why people go into faculty. I think the balance for me is that I try also to spend at least half of my time only on papers that I'm technically driving. And so, like, when you say, you know, you've had this number of papers, what I think of is like, well, maybe here are, like, the 3 papers that I I think of as, like, my papers that, like, I actually was the person actually doing the experiments, and I could tell you about, like, every final sentence of what's going on. And those ones have a very strong sense of what's what what's there. And then the other ones are the standard, a professor who's advising grad students, but I'm instead of being in academia, I'm in industry. And so I advise wouldn't help on other students' papers in some ways.

Nathan Labenz: (8:52) Gotcha. You know, across all these things, regardless of your role, was there anything as you look back over the last year or more that was legitimately, like, very hard to break, or are you guys basically finding that all of the defenses that the field is coming up with are rather easy for you to break at

Nicholas Carlini: (9:11) this point? In this last year, we didn't spend that much time breaking particular defenses. We have like maybe 2 or 3 papers on that. We spent most of our time on other areas trying to understand to what extent attacks are possible, to understand the real world vulnerability of models to certain types of attacks, to do some general privacy analysis and not say this particular defense is wrong. But for all neural networks trained with gradient descent, here is an interesting property about the privacy of them. You have a lot of these kinds of results that are not really detailed focused on breaking 1 particular thing. I think last year, I maybe only had 2 papers that were particularly on breaking things. 1 was early in the year we wrote I had a paper that was there was a defense published at at IEEE SMP, which is 1 of the top conferences in the security field, which was an adversarial example defense. And this paper I broke and and this 1 turned out to be relatively easy, I don't know, an hour or 2.

Nathan Labenz: (10:20) 0 gosh.

Nicholas Carlini: (10:21) Okay. This 1 was sort of abnormally easy, but it's okay. Maybe not that abnormally. So, yeah, I think probably if you take like adversarial example defenses on image classifiers are a particular beast that I have gotten very good at and the attacks are relatively well understood and there are lots of known failure modes. And so when I'm doing this, I'm not developing new science. I'm just like going through like, I have these long list of things I've broken before. What's the pattern that this 1 falls into? Okay. Here's the pattern. You know, it turns out that the gradients are not flowing because the softmax is saturated to 1. What do you do? Make sure the softmax doesn't saturate. Therefore, you find that you can break it and it works very, very quickly. And so that's what I did for that paper. Very much just an engineering kind of result of why is the softmax giving the gradients are identically 0. And once you figure out the answer is because of some discretization or whatever the case might be, then everything is easy for there. The other paper that was more interesting maybe is this paper that is 1 of these advising papers where I didn't do any of the technical work, but was helping a couple of students think through what it means to consider the robustness of instead of adversarial example defenses, which are these test time evasion attacks where you perturb the image a little bit and it turns a picture of, I don't know, a panda into something else. Instead, we were looking in this paper at what are called unfine tunable models, which are these models that are designed to be ones you can release as open source. The weights are available to anyone and they're supposed to be not possible to be fine tuned to do other tasks. And the particular concern that these defenses were looking at is you would ideally want to make sure that, like, no model that I trained is gonna be helpful to make someone be able to produce bioweapons or something, whatever the threat model is you're thinking about. And you can make it so that there's some safety in your model initially, but if you release the models open weights, then anyone can fine tune it and remove the safety filters that you can put in place. And these unfinetunable models are supposed to be designed to be not only robust to these kinds of initial adversarial example types of attacks, but also robust to someone who can perturb the weights. And so in this paper, there are a couple of students who were doing a bunch of work on attacking these models to show that you actually can still fine tune them even though they've been trained to be unfinetunable. And a bunch of the the thoughts that we've had in the last, you know, 5, 10 years on adversarial examples went into this, the same kinds of lessons, but a bunch of the techniques were very different. And so the students had to spend a bunch of work, like, actually getting this to to work out.

Nathan Labenz: (12:50) So I wanna dig out on that 1 in particular because that, I agree, strikes me as 1 of the most important and interesting cat and mouse games going on in the space right now. Before zooming in on that, though, you know, you you said, like, when I see something new, I sort of have this, like, Rolodex of, you know, past things and paradigms that I can quickly go through. Could you sort of sketch those out for us? Like, how do you organize the space of attacks? You know, is it a hierarchy or, some sort of other taxonomy? I'd love to get a sense for sort of what your mental palace of attacks looks like.

Nicholas Carlini: (13:27) Okay. Okay. Let me separate off this 1 space of attacks, which are these new, like, a human typing in a keyboard prompting the model to make it say a bad thing. Let's put aside for the second these kinds of treating the model as a human and trying to social engineer into doing something bad. So if you put that aside, then almost all attacks are the way that you run the attack is you try and do some kind of gradient descent to make the some some particular loss function maximized. So for image adversarial examples, what does this mean? I have an image of, you know, a stop sign. I want to know what sticker can I put on the stop sign to make it be recognized as a 45 mile an hour speed sign? How do I do this? I perform gradient descent to maximize like, compute the optimal sticker so that the thing becomes misclassified. Or, you know, in in the case of poisoning, where the the poisoning is you modify a training data example point in order to make the model produce an error. You're trying to optimize the particular poisoned data point you have in the training dataset so that the model makes a mistake. Or in the case of this unfinetunable models, you have a model that you want to make sure no 1 can edit. And so you try and find a way to perform gradients on the model to update the parameters so that it can perform some bad thing. And so in all of these attacks, there are essentially 2 things you need to concern yourself with. 1 is what is the objective that you're maximizing or minimizing? Like, is the specific loss function you're using? And the other is what is the optimization technique that you are using to make that number go up? And both of these are the 2 things you can play with. And by coming up with the best possible versions of each of these, you end up with very strong attacks. And so a big part of doing these kinds of attacks when you're doing this gradient based optimization thing is coming up with high quality functions that you can optimize and coming up with high quality optimizers. And, you know, there are lots of lessons we've learned over the years. I mean, 1 of the biggest ones probably is the simplest possible objective is usually the best 1. Even if you can have a better objective function that seems mathematically pure in some sense, the fact that it's easy to debug simple loss functions means that you can get 90% of the way there in doing these attacks. And the last little bit, is a lot, it's nice to go from 95% to 98% tax success rate, but, like, it's not really necessary in all of these ways. And so you pick a really simple loss function that's easy to formulate, easy to understand why things are going wrong, and you pick a optimizer that makes sense and and mostly things just work.

Nathan Labenz: (16:02) A lot of the, I I don't know how much, but, like, a a significant amount of this work over time has been in this image classifier domain. And a lot of times we see, like, pretty striking examples there where I guess there's, like, either a second term in the loss function or some sort of, like, budget constraint as well. Right? There's you're both trying to say, okay. I've got a picture of a car, and I wanna make it, you know, output dog as the classification or whatever.

Nicholas Carlini: (16:28) Sure.

Nathan Labenz: (16:29) But then you also like, don't want it to actually change the image to a dog to make that happen. So how often is it this second term is also like a big part of kind of Yeah. Keeping the image looking like it originally did or Sure.

Nicholas Carlini: (16:41) Yeah. So these are adversarial examples. The way so okay. So 1 of my first papers in episode machine learning was coming up with a clever way of doing this. So yeah. But this was entirely a paper on how to do these exact 2 questions. What's the optimizer? What's the optimization objective? And, yeah, we did some some clever thing and it worked like well, I won't go into details here, but like we we did something fancy. And then like 6 months later or something, maybe a year later, Alexandre Madri and his students said, instead of doing something clever, let's just bound the loss to be or bound the image to be between in this sort of small ball around the initial point. So, like, you can only perturb the 3 lowest bits and only optimize the objective function I said is a good optimization objective and run the same optimization algorithm I was using. It turns out it gets you, like, 99% of the way there and it's so much simpler. This algorithm is called PGD and this is the 1 everyone remembers because it's the right way of doing it. Like, you know, you can squeeze epsilon more performance out of it if you do things like a lot fancier. But the the defense is either effective or it's not for the most part. And the breaking the last 2% is very rarely something you actually need. And so for the most part, yeah, it's entirely fine to just say, let's make something a lot simpler and optimize that thing and ends up working quite a lot better. And so for these image examples, today, people don't put a second term on minimizing the distance between the original image and the other 1. They just add it as a constraint. They just say, I just constrain you to this bounding box. You can only change the lowest 3 bits of the pixels. And this just makes the optimization so much simpler and it's a little bit worse. But, like, in all in all practical senses, it just makes things work a lot better.

Ad Read: (18:24)

Hey. We'll continue our interview in a moment after a word from our sponsors. If your customer service team is struggling with support tickets piling up, can help with that. Finn is the number 1 AI agent for customer service. With the ability to handle complex, multistep queries like returns, exchanges, and disputes, Fin delivers high quality personalized answers just like your best human agent and achieves a market leading 65% average resolution rate. More than 5,000 customer service leaders and top AI companies, including Anthropic and Synthesia, trust Fin. And in head to head bake offs with competitors, Fin wins every time. At my startup, Waymark, we pride ourselves on super high quality customer service. It's always been a key part of our growth strategy. And still, by being there with immediate answers 24 7, including during our off hours and holidays, Fin has helped us improve our customer experience. Now with the Fin AI engine, a continuously improving system that allows you to analyze, train, test, and deploy with ease, there are more and more scenarios that Fin can support at a high level. For Waymark, as we expand internationally into Europe and Latin America, its ability to speak just about every major language is a huge value driver. Finn works with any help desk with no migration needed, which means you don't have to overhaul your current system to get the best AI agent for customer service. And with the latest workflow features, there's a ton of opportunity to automate not just the chat, but the required follow-up actions directly in your business systems. Try Fin today with our 90 day money back guarantee. If you're not a 100% satisfied with Finn, you can get up to 1,000,000 back. If you're ready to transform your customer experience, scale your support, and give your customer service team time to focus on higher level work, find out how at fin.ai/cognitive.

Ad Read: (20:19)

In business, they say you can have better, cheaper, or faster,

Nathan Labenz: (20:23) but you only get to pick 2.

Ad Read: (20:25)

But what if you could

Ad Read: (20:26)

have all 3 at the same time? That's exactly what Coher, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure.

Ad Read: (20:39)

OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now, with 0 commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Nathan Labenz: (21:30) When people say that attack is easier than defense, 1 obvious way to read that is just that you only have to succeed with a minority of your attacks, whereas, you know, for defense to be successful, you gotta win always or near always. Are there other kind of In

Ad Read: (21:47)

business, they say you can have better, cheaper, or faster, but

Ad Read: (21:51)

you only get to pick 2. But what if you could have all 3 at the same time? That's exactly what Coher, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure.

Ad Read: (22:07)

OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now, with 0 commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Ad Read: (22:57)

It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever.

Nathan Labenz: (23:05) If your business can't adapt in real time,

Ad Read: (23:08)

you are in a world

Nathan Labenz: (23:09) of hurt. You need total visibility from global shipments

Ad Read: (23:12)

to tariff impacts to real time cash flow, and that's NetSuite by Oracle, your AI powered business management suite trusted by over 42,000 businesses.

Nathan Labenz: (23:22) NetSuite is the number

Ad Read: (23:23)

1 cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR all together into 1 suite. That gives you 1 source of truth, giving you

Nathan Labenz: (23:32) visibility and the control you need to make quick decisions. And with real time forecasting,

Ad Read: (23:37)

you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in

Nathan Labenz: (23:51) the AI era, there is nothing more important than speed of execution. It's 1 system, giving you full control and the ability to tame the chaos.

Ad Read: (24:01)

That is NetSuite by Oracle. If your revenues are at least in the 7 figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.

Nathan Labenz: (24:21) Meanings of that or intuitions for attack is easier defense that are important as well?

Nicholas Carlini: (24:26) Yeah. So this is the big 1. The second big 1 is the attacker goes second. So the defender has to come up with some scheme initially, and then the attacker gets to spend a bunch of time thinking about that particular scheme afterwards. And so this is maybe a variant on why finding 1 problem is easier than finding solving all of them. But, like, the particular thing is it probably would be pretty hard for me to write down an attack algorithm that was effective against any possible defense. There's, like, almost certainly something that someone could do that is correct that, like, stops all attacks. But, like, I don't have to think about that defense. I only have to think about the defense that's literally in front of me right now. And so it's a lot easier when you're presented with 1 particular algorithm. You can spend 6 months analyzing it. And so the attacker has just an information advantage from this side too, where they can wait for the fields to get better, to learn new things, and then apply the attack after all of this has been learned. And the defender, in many cases can't update the thing that they've done. There are some settings where this is reversed, where the attacker has to go first. Poisoning, for example, can be 1 of them. Suppose that I want to make malicious training data and put it on the internet and hope that some language model provider is going to then go and train on my malicious data. In this case, it may actually be the case that the attacker has to go first. I have to upload my training data and then someone gets to train their model on with whatever algorithm they want, with whatever defense they want to remove my poisoned data before they'd actually run the training. In this case, maybe the defense is actually a little bit easier than the attack. It's hard to say because of this defender now goes second. But, like, for many of these cases that I've spent most of my time thinking about, you have the example case, this recent unfinetunable models case, it is the case that the attacker goes second, and that really gives them a lot of power.

Nathan Labenz: (26:23) Yeah. I wonder what that implies for the future of how open all this work is gonna be. Right? I mean, we've been in a regime where the stakes of machine learning generally were not super high, and, you know, people were kind of free and easy in publishing stuff, including, and I've I've always kind of marveled at this from, you know, the biggest companies in the world, which, you know, you 1 might wonder, like, why are the biggest companies in the world, like, publishing all this IP? But they've been doing it. Now it seems like maybe, jeez, if we're actually running an API at scale, maybe we don't wanna disclose all of our defense techniques. So do you think that's already changing?

Nicholas Carlini: (27:02) You already see this. Right? Like, GPT 2 was released with the weights. GPT 3, GPT 4 was not. Like, the biggest models are not being, for the most part, released by the companies who are doing this. I think security is probably a small part of the argument here. I will say though, there are other areas of security or in almost all other areas of security, this is not what we rely on. Let's think for example about cryptography. Right? Like, we publish algorithms. Everyone knows how the best crypto systems work. Everyone tries to analyze them. No company in their right mind would ever try and develop a new fancy crypto system. Like, you're just gonna use AES because it's known to be good. It would be crazy to try and do anything fancy in house. And the reason why is because empirically, it works very well and we've had the entire community try and break it for 20 years and have largely failed. And so everyone believes that this is effective. And you don't get that same kind of belief in something without a large number of people trying to analyze it. And so if you have these models and they stay proprietary things that are not disclosed, it may be the case that empirically this just ends up in the best we can hope for. Maybe just like deep learning is impossible to secure. There's no hope at it. You lock things down and you try and just change things faster than the attackers can find bugs in them. And like, okay. Like, that would not be great, but like, you know, I I think we can potentially live in that world. I think what would be a lot better, which just may not be happening, maybe very hard, is you get everyone to disclose exactly what they're doing, exactly how they're doing it. You get everyone to analyze that in detail. And then you learn how to make these things better to some extent that you can actually improve robustness. And then you get to the point where people can choose to either release things or not release things, not because of security, but because, I don't know, they wanna make money or whatever the case. But I think what I would like to avoid is the belief that not making the thing public is the more secure version because it's like a shame that this is part of the thing that goes into this right now. But I would rather have things that just actually work as opposed to things that are insecure, but we just lock them down and just make it harder to find the bugs because those they're still insecure. They're just a little harder to find the bugs on.

Nathan Labenz: (29:23) Let's let's come back to that in a little bit as well. Just staying for a moment on kinda how you organize the space of all these different attack regimes and whatnot. There are some settings. In fact, we did a whole episode on the, quote, unquote, universal jailbreak, which I hadn't even realized until preparing for this that you were a coauthor on that. That was 1 of the, you know, many papers from the last couple of years. But there are some sort of wrinkles on the high level description that you gave of find a gradient, maximize some loss function where, for example, in that universal jailbreak paper, if I recall correctly, because the idea was limited to picking the right tokens, the space isn't, like, purely differentiable, and so you're kind of, like, navigating this sort of discrete space of individual

Nicholas Carlini: (30:13) Yeah. Oh, but look. Okay. Let's let's talk about this this paper for a second then. So as a refresher for everyone what this paper is doing, this is, like, again again, 1 of the papers I was mostly just advising with, you know, Zico and his and Matt and the students found out that it is possible to take a language model that usually would refuse answers to questions. So, you know, you ask how do I build a BOM and the model says like, I'm sorry, I can't possibly help with that. It is possible to take that same model and append this adversarial suffix to the model so that you can arrange for the model to now give you a valid answer. Okay. How do you do this? Okay. Because if I knew the answer ahead of time, 1 thing you might imagine doing is trying to optimize the tokens. And we'll come back to this optimization question in a second. Let's just assume you can optimize. You can imagine trying to optimize the tokens so that the model gives a particular responses output. Now, here are the steps to build a bomb, 1 dot, you know, go get whatever chemicals you need, 2 dot the instructions for to assemble them or whatever the case may be. But this requires I know the instructions already, so it's not very helpful. So what's the objective function that I'm actually gonna use to make the model give a response? Well, maybe another thing you can think about is you could try and come up with some fancy latent space non refusal direction and do some optimization against this. And actually, there's been now recently some work on actually doing that. But again, like, is this is complicated. This is not the first thing you wanna try. What's the first thing you wanna try? The first thing you wanna try comes from, I think, initially, maybe a paper by Jacob Steinhardt. That's the first paper I saw it from. What what we we we wrote in this paper, we we said an affirmative response attack, which just says, let's make the model first respond, okay. Here's how to build a bomb. That's the only objective. The only objective is make the first, like, 10 words from the model be an affirmative response that says, yes. Sure. I will help you build the bot. And then once you've done that, because of the nature of language models, it turns out that they then give you an answer. And there are other defenses that rely on breaking this assumption too. But this was the key part of the objective function is how do we take something that we want something in our mind, we want the model to give us an answer that gives us answers the instructions for something, but actually coming up with a particular number that makes this happen is very hard. And so we come up with this very straightforward loss function objective that makes that happen. Now we can return to the question of what is the optimizer? And this is, again, where a lot of the work in this paper went to is how do you take something that is, as you say, discrete tokens and make it be something that you can actually optimize? And early work had tried to do, like, second order gradients and some fancy stuff going on there. And the main thing that this paper says is we will do maybe 3 things. First, we will use gradients to guide our search. We're not gonna use gradients for the search. It'll be to guide the search. We will check whether or not the gradients were effective by actually switching tokens or out. And then we will spend a lot more compute than other people were doing, bitter lesson, and just do this a bunch and you end up with a very, very strong effective attacks. And so this, I think, even still does very fall nicely into this. What are you optimizing and how are you

Nathan Labenz: (33:31) optimizing it? So how much compute maybe this is another sort of dimension of how you would think about this. How how much, you know, resource does this take? If you're doing, like, 1 of these, you know, gradient things, how much do you typically have to put into it? If you're doing something that's, you know, in a discrete space and requires more of a structured search, how does that compare? If you're doing data poisoning, how much data does it take to actually poison a model?

Nicholas Carlini: (34:00) Sure. Okay. I'll let's do these maybe 1 at a time. So let's start in the image adversarial example continuous space question. The amount of compute here is like almost 0. It's so 1 of the first papers that showed this is a paper by Ian Goodfellow where he introduced a stack called the fast gradient sign method. The fast gradient sign method does exactly 2 things. Well, first of it's fast. And the reason why it's fast is because what it does is it takes an image, it computes the gradient with respect to the image pixels, and then computes the sign. Just like literally take the sign, which direction does the gradient say, and then take a small step in this direction. That's it. 1 step. So if a model is vulnerable to this fast gradient sign method, then, yeah, it takes exactly once 1 gradient step, which is, you know, essentially 0 time. Other attacks, like I mentioned PGD already. PGD, you can think of essentially as fast gradient sign, but iterated some number of times. The number of iterations is, I don't know, usually, let's say, somewhere between 10 and 1000. For undefended models, it could be 10. For defended models, you know, to to break it for the most part, need usually, like, I don't know, 10 to a 100. And just out of care, just to make sure you're not making any mistakes, it's often a good idea just to use 1000 just to make sure you haven't excellently not unoptimized enough. And then this works very well. How long does 1000 iterations take? I don't know. A minute or 2 for reasonable sized models. Now let's go to this discrete space for GCG where generating an attack can take an hour or maybe a couple hours depending on what you want. Because for this, like, we're doing some large batch size, we're doing 1000 mini batch steps. It takes a relatively large amount of time, but not a huge amount of time. So it's still something that's like much, much, much faster than than training or just a magnitude faster than training. But by going to the discrete space, it does become a lot slower.

Ad Read: (35:55)

Hey. We'll continue our interview in a moment after a word from our sponsors. And then how about on the data poisoning stat?

Nicholas Carlini: (36:04) Okay. So there's a question of how much how much time it takes to generate this data, and there are basically 2 rough directions here. For So the field initially started out, how I make a model give the wrong answer? I add a bunch of labeled data that's labeled incorrectly. This is like the simplest possible thing you can do. This is some paper by Batista, which got, test of time award at ICML a couple years ago. It's a very nice paper from, I don't remember when, 2012 or something. It's like it's a very 1 of these very early security results that that's very important. And yeah. So you just insert mislabeled the data. This is like, it's very easy to do. You insert a very small amount of mislabeled data and like these image classifiers at the time they were looking at MS classifiers would just immediately mislabel the data. Then people started looking at, well, what happens if the adversary can't just insert mislabeled data, Right? Because once upon a time, we used to curate our datasets to the only high quality data. And so it would be unreasonable to suspect that the adversary could just inject mislabeled data points. And then the answer is, well, now I have to be very, very careful. I have to optimize my images to look like they're right. There's this clean label poisoning threat model that you need to do some, you know, fancy stuff. Try and imagine what the embeddings you want the classifier to learn are, and you, like, surround your test point in embedding space and do some fancy polytope stuff and does a bunch of work that does fancy stuff here. And the optimization is relatively difficult and you need 1% poison to data. This is a lot. And then people started going, well, why do we label Why do we clean our data in the first place? Let's just take all the data from the internet and again, poisoning becomes a lot easier then. If you're willing to just take arbitrary data from the internet, now you can just mislabel your data points And so we had a paper, I don't know, in 2021, looking at poisoning some of these self supervised classifiers like CLIP and others that you just add mislabeled data points again and the thing basically just breaks. You don't need to do anything fancy, no optimization. You just flip the label, you add a couple 100 images and you can get these kinds of things to work. There's a new question now of how this works for language models. This is 1 of the things that we've been writing papers on recently is to try and figure this out. I feel like we don't understand this right because, like, a bunch of things are different for language models. For example, you no 1 just uses the base language model. You have your language model and then you go fine tune it with SFT and RLHF and you change the the weights and so you need your poisoning to be robust to all these things. Yeah. This is another paper I I helped advise some students on from CMU and from Zurich where, yeah, Javier and them were looking at trying to understand what actually happens in the optimization after you have poisoned the model. So you have to arrange for the model to be poisoned in such a way that even after RLHF, it still gives you the wrong answer. And doing this is challenging. And so it ends up right now that the poisoning rates are something like 0.1%, which is small, but like 0.1% of 1000000000000 tokens is 1000000000 tokens. So if you were to train a model on just, you know, some large fraction of the Internet, this could potentially be infeasible for an advertiser to do in practice. Now, my gut feeling is that this has to be too big because models know more than 1000 things. Like, if if you had to have control of 1 thousandth of the datasets to make the model believe something is true, like, they could only know 1000 things. And so like, this just doesn't make sense. And so like, there has to be some better poisoning way to make the model be able to be vulnerable to poisoning that works with much lower control of the training data, but this might now need fancier algorithms again. Might need to come up with clever ways of constructing your data that's not just repeating the same false fact lots of times. So again, I don't know. I think this is 1 of the open questions we've been trying to write some papers on recently, and I hope we'll have a better understanding of sometime this year.

Nathan Labenz: (40:00) 1 thing that you said that really caught my attention was you have to kind of imagine what the embeddings would be like as you were trying to think of an attack. So can you unpack that a little bit? I I would love to know. Are you visualizing something there? Or because I struggle to have good intuitions for this as evidenced by my previous enthusiasm for the tamper resistant fine tuning. I was like, oh, this is amazing. You know, it seems like this could really work. And clearly, I'm not doing something there as I can see with that that you are doing. It might be hard to communicate what that is, but what do you think you're doing?

Nicholas Carlini: (40:38) Okay. So this paper was not mine. This was a paper I I think this was not so it was a poison frogs paper, and this was a follow-up. I think it was called polytope attack, but this was a long time ago. And so I don't remember. I think it might have been Tom Goldstein's group again. I don't remember the details.

Nathan Labenz: (40:55) To abstract from the detail. And the the real hope is that I can sort of grasp onto something that allows me to be better at

Nicholas Carlini: (41:02) this in the

Nathan Labenz: (41:02) future.

Nicholas Carlini: (41:02) So so this paper, the idea was very simple. Let me explain what this paper is trying to do. It's trying to make a particular image become misclassified. And it's trying to do this in such a way where it does not introduce any large label noise to the training dataset that any person would look at and say that's obviously wrong. And so what it tries to do is it tries to surround the image you want to become misclassified in embedding space, like in this high dimensional embedding space with other images that have the opposite label, but make those images have the close in embedding space to the target 1 you're trying to misclassify. And so it tries to take that and it tries to pull the entire region of that space over to the region that those images should be. And so the idea is relatively simple. You're trying to put a box around the image you want to become misclassified so that the entire box is labeled the wrong way instead of it being labeled the correct way. For me, I guess, for many of these attacks that I try and think about, I tend to think about them visually, but I think ignoring the details is, like, is entirely fine. Well, like, I'm just trying to get a sense of, like, what's the important thing that's going on here and, like, what are the high level makes sense that should be true? Like, what what what should morally be true? And then, like, then you can figure out the details afterwards. But, like, after this, so you figure, like, what should be true about this? Then the rest is implementation. And, like, this is I don't know. If you ask math people how they do these proofs, this feels sounds similar when they talk about it. Well, like, they first, like, establish what should be true in their mind, and then you go and try and prove it. And it turns out, you know, maybe the proof has been more complicated or something didn't work out in the details, and then you try something else that, like, feels like it should be true. And this is, I guess, a similar thing I try to do. And I don't know how to give intuition for this, like, feels like it should be true other than just, like, you've done it a bunch and you look at it and this looks spiritually similar to this other thing that broke in a very similar way. I I feel like the idea should carry over.

Nathan Labenz: (43:09) We'll commit this procedurally as well in a second, but just staying on the visualization, are you doing the classic physics thing of visualizing in 3 dimensions and then saying n really hard? I wouldn't say I'm good at this at all, but I sort of have a certain version of this for refusal where I kind of imagine like a fork in the road or like branching river or something where it's like once you're on 1 path, then there's a lot of, you know, you're in some local well where just like when a river has fork, right, it's not gonna meet again until it's down into some other topology or geography or whatever. I mean, that's pretty hackneyed, but, you know, what's your version of that if you can?

Nicholas Carlini: (43:49) I I don't know that I have a great version of this that I can really give you. I feel like everyone thinks of things differently. I tend to try and think of these things visually for what's going on. And, yeah, I I do the let's think of 3 dimensions and then just, like, imagine that things roughly go like this. But this can be really deceptive because there are so many defenses that are predicated on the belief that things are working in 3 dimensions. And then you go to, like, 1000 dimensions and all of a sudden nothing works anymore because, you know, you learn to become used to certain facts in high dimensions when you're attacking things. You know? Almost everything is close in high dimensions to to a hyperplane. Like, if you sort of just draw a plane and pick a point, like, they're almost always close. And so, like, you can think about, like, lots of things that'll try and separate points from planes, but in high dimensions, it's almost always close. You don't have to think about the details. Lots of these things, intuitions we have in 3 dimensions just don't work in higher dimensions. And you just you become used to the idea that which of these intuitions are wrong. And you don't need to understand exactly why they're wrong. This is just it's a thing you learn is true. And when someone justifies their defense using 1 of these things that you've seen that doesn't make much sense, you then just go, okay. Well, presumably, there's something here that I I should look at more.

Nathan Labenz: (45:04) So that's an interesting kind of rule of thumb or mental model right off the bat. Everything is close in high dimensions. Is there a good story for why that is? I mean, it it doesn't seem like it holds in 2 dimensions. Right? If I take if I understand you correctly, if I'm in 3 dimensions and I draw a 2 dimensional plane in it, then I would intuitively feel like some things are close to that and some things are far from that. If I'm in 1000 and I draw like a 999 dimensional plane, if that's what I if I'm understanding you correctly, like why is everything close to that?

Nicholas Carlini: (45:35) Yeah. Okay. So maybe the statement that I will make to be more precise is let's suppose that you have some, like, classification model and you have some decision boundary, like, of the classifier. The statement that is, like, that is true is, like, almost all points are very close to 1 of the decision boundaries because both there are many of them, but also in high dimensions, I may be very far from something in almost all directions, but there exists a direction that I can travel in that is the direction orthogonal to the closest hyperplane, where the distance is very, very small. And so you have this thing where if you try in random directions, you may just go forever and never encounter a decision boundary. You probably will at some point, but it will be quite far. But in high dimensions, because of the number of degrees of freedom that you have, it's much more likely that there exists a direction that guides you to some plane that's like really close by that you would just have a hard time finding out if you just, like, searched randomly. Whereas in 3 dimensions, you know, if you search randomly, you know, you're probably gonna run into whatever the nearest type of plane boundary is. You know, in 1 dimension, you're certainly going to. You just try twice. You go left, you go right, you find it. In 2 dimensions, you go randomly and, like, maybe most of the time you find something that's close by. In 3 dimensions, there's more ways you can go that are orthogonal. Like, in 2 dimensions, there's only 2 directions you can go that's orthogonal to the line. In 3 dimensions, there's now an infinite number you can go that's orthogonal to the line. And so in general, in high dimensions, almost all vectors are perpendicular to each other in high dimensions. And so you can end up almost always just randomly picking directions that just don't make any progress, which does not mean that there isn't a direction that does make progress. It's just much harder to find it. But once you find it, like, things mostly just work out. And so maybe this is the more precise version of what I'm trying to say is things are close, but when you search for them randomly, it looks like they're far away.

Nathan Labenz: (47:41) K. That's quite interesting. I wouldn't say I've grokked it just yet, but

Nicholas Carlini: (47:48) Yeah. This is the kind of thing like, I'm not being formal here. I'm not giving you some proof of what I'm saying is correct because, like, this isn't how I think about it. Like, I sort of just like I think about it sort of very unrigorously in this way. And then once you have to go actually go do the attack, now you have think about it rigorously. But, like, when just, like, visualizing what's going on, I feel like some people try and actually think carefully about what's going on my in this thousand dimensional space. I'm like, don't know what's going on. I just sort of have my intuition of what feels like is going on. And this sort of roughly matches how things have been going. And you have to be a little bit fuzzy when you're thinking about this because no 1 can understand it. And then once you're done thinking about that, you can go back to the numbers and start looking like, mechanically, what's going on? You know, I'm taking the dot product of these 2 things and I want this to be, equal to negative 1. And so you're gonna do some stuff there. And you can become very formal when you need to, but yeah, I think being confused in high dimensions is probably the right thing. You get used to the fact that this is the way that this works. And this, again, is part of the reason why why attack is is easier. Because if you're gonna defend against things, you really need to understand exactly what is going on to make sure that you have ruled out all attacks. But as an attacker, I can have this fuzzy way of thinking about the world. And if my intuition is wrong, the attack just won't work, and I'll then think of another 1 as opposed to having to have a perfect mental model of what this thing is doing to make sure that it's robust from all angles.

Nathan Labenz: (49:13) But it does seem like your intuition is a pretty reliable guide to what's gonna work.

Nicholas Carlini: (49:19) Yeah. But I guess my a predictor, which is almost as accurate as me, would be to say, does this work? Answer, no. Which is like basically most of what my intuition just says is like, no, this doesn't work. Maybe the thing that's maybe I'm a little bit better at than some people is why does it not work? Like, what would the attack be that breaks this? And I think that is just having done this a lot for many different defenses and having seen all of the ways that things can fail and then you just remember this and you pattern match to the next closest thing. Why is it that people who do math can prove things with very easy ways that seem complicated? It's because they've spent 20 years studying all these things, and they've seen an exactly analogous case before. And they just remember the details, and they they abstract things away enough that, like, it becomes relatively straightforward. And I feel like it's mostly an exercise in having practiced doing this a whole bunch.

Nathan Labenz: (50:11) What would you say is your, like, conceptual attack success rate? I don't mean, like, the rate at which examples succeed in attacking within a given strategy, but, like, how many strategies do you have to come up with before you find 1 that actually does work to break a defense for a given new defense?

Nicholas Carlini: (50:31) I don't know. I think it really depends on which ones you're looking at, where where sometimes you try you try 5 things that you think ought to make sense and they don't work, and then you try the sixth 1 and it does. I don't know. I feel like, usually if if you've exhausted the top, like, 5 or 10 things and it hasn't you haven't gotten a successful attack, then you're not gonna get 1. Like, or at least for me, like, I feel like if it's not in the top 5 or top 10, then, it's usually I can't think of something else. And probably, I don't know, for ones where for image classifiers in particular where I've done a bunch of this, like, usually top 1, top 2 ideas work. For other areas, like, it it takes more just because you've had you've seen fewer examples like this and you don't know what the style of attack approach needs to be.

Nathan Labenz: (51:10) But it's very rare. It sounds like that you get past like 10 ideas and give up.

Nicholas Carlini: (51:15) Yeah. But also there's some problem selection here where, you know okay. So there's a there's a large number of defenses in image adversarial examples, which are basically just adversarial training changed a little bit. So adversarial training is this 1 defense approach, which just trains on all of the adversarial examples. Bitter lesson. What do you want? Robustness adversarial examples. How do you do it? You train on adversarial examples. You do this at scale and the thing works. And there are lots of defenses that just are adversarial training plus this other trick, plus diffusion models to generate more training data, plus this other loss term to make it so that I do better on the training data, plus whatever some smoothing to make the model better in some other way. And I just basically just believe most of these are probably correct for the most part. And so I just won't go and study those ones because the foundation is something I believe in already. And so you just like, you don't need to go study as vigorously. I'm like, maybe you could break them by a couple percentage points more, but it's not going to be a substantial thing that is worth doing a lot of time on. What I tend to spend my time looking at are the things that when you look at them, they do look a little more weird. And that's those are the more interesting defenses because they're like a qualitatively new class of of way of thinking about this. And so I want to think about it. I think these ones are worth spending time thinking about, but this also means this artificially inflates the attack success rates because I'm biasing my search for the ones that I feel like I have a good prior probably are not going to be effective. And so, yeah, it ends up in this way.

Nathan Labenz: (52:48) Just to make sure I'm accurate in terms of my understanding of the space, there are no real adversarial defenses that really work in the image classification.

Nicholas Carlini: (52:57) Yeah. Okay. So so depends on what you mean by works. So right. Okay. So the best defenses we have are basically adversarial training, which is yeah. Generate adversarial example, train on adversarial example to be be correct, repeat the process many, many times. Okay. What does this give you? This gives you a classifier so that on the domain of adversarial examples you trained on, as long as you don't want to be accurate more than half of the time, you're pretty good. So like the accuracy under attack for the type of adversarial examples you train on usually is 50%, 60, maybe 70%. And that's much bigger than 0, right? Like, this is good. But as an attacker, what does 70% accuracy mean to me? 70% accuracy as an attacker means to me, I try 4 times and probably 1 of them works. So from that perspective, like it's terrible, right? Like it doesn't work at all because, imagine in system security that you had some defense where the attack was try 4 different samples of malware and 1 of them evades the detector. This is not a good detector. But in image episode examples, this is the best we have. So on 1 hand, it's much, much higher than 0, very good progress. On the other hand, 70% is very, very far away from 99.999. But in machine learning land, like you never get 5 nines of reliability. And so 70% is a remarkable achievement on top of 0. And so this is, I think, why you can talk to someone and they can tell you that it works and you can talk to someone else and you can tell you that it doesn't. Depending on how you're looking at at it, it it can mean 2 different things.

Nathan Labenz: (54:30) Yeah. Gotcha. Are there any other, like, spatial heuristics that you think about? I was in the context of the 1 where you said to kind of envelop the 1, you know, example that you want to break in, you know, these sort of adversarial examples. Another shout out to MLST. There was just another episode trying to understand the behavior of models through this like spleens paradigm. And I could imagine, although I'm not like mathematically sophisticated enough myself to have a good intuition for maybe there are certain rules where it's like, can't create a doughnut in the internal space of the model. And so is that, like, why that works or, you know but you can address that specifically, I'm more interested in kind of do you have a number of these sorts of things where you're like, well, I know that the space kind of is shaped this way or it's impossible to create this kind of shape in the space. So therefore I can kind of work from there.

Nicholas Carlini: (55:28) Yeah. So I feel like I I don't intend to do so much visualization in in this kind for these defenses. I think for the most part, I'm doing is trying to understand the shape of the loss surface. It's like most of the time when something is robust to attack or appears robust, the problem is that they have made the loss surface particularly noisy and hard to optimize. And this is what we've seen for adversarial examples essentially forever. Like 1 of the very first defenses to adversity examples that people gave serious consideration is this defense called distillation as a defense. And so, okay. So it had maybe there's another lesson of these defenses. Defenses often have an intuitive reason why the authors think they work and they tell some very nice story. So this defense told some very nice story about you have distillation, you have a teacher model, and the teacher sort of teaches the students to be more robust in some nice way, and that's why the student is robust. And the story of what they're telling themselves of why these things work is often very, very different from what the what the reason why of why the attacks fails. And it turned out that distillation had nothing to do with this defense whatsoever. It turned out that what was going on is because of the way that they were training this model, they were training the students to have a very, very high temperature, which means the logics were getting very, very, very large. And they were running this in a time of TensorFlow 0 when it was very easy to make gradients mean that crossmax soft entropy would just actually give numerically 0 as the output. And so the reason why the attacks were failing is because the loss function was actually identically 0. And so this was like the very first example of 1 of these kinds of gradient masking defenses, where what's going on is they think that they have some clever idea of what's going on, but actually it turns out that gradient of this function has just been made 0. And all I need to do to attack it is, for example, this 1 failed if you just did the gradients in floating point 64. You get enough signal that everything works out. That 1 would have worked there. But you could also do other tricks by just dividing the logits before you put them into softmax. There's lots of things that work here. But then the next generation of defenses were much more explicit about this and had other ways of breaking the gradients. So there were a bunch of defenses that like some of them were very, very, very explicit. Like we're just gonna like add noise to the model in order to make it so that the gradients are ugly. And then most of what you're trying to think about when you're visualizing this is how do I make it so that the gradients end up being something that even if they look ugly, still I can work within some smooth way. And so you can, for example, use this thing called a straight through estimator and make gradients become nicer for just discontinuous or ugly objective functions. And there's all these things you can do to visualize how I make the gradients of this very ugly thing look much cleaner. And yeah, I have this image that I use in my slides a bunch that shows a very nice visualization in 3 dimensions of what the gradients for what many of these models look like. And it looks like the surface of some very, very, very ugly mountain that, like, is very, very hard to actually to do anything with. And, you know, if you run fancier attacks, can, like, smooth this out into sort of a nice smooth surface that that you're thinking of gradient descent as like a ball rolling down a hill, you want the hill to be nice and smooth. And so this is what I'm usually trying to think about in high dimensions is this like, what does this gradient function look like? And, yeah, this continued even all the way through to these unfinetunable models where 1 of the papers for this unfinetunable model thing was explicitly saying, we make the gradients very challenging and we make it so that when you train the model, the gradients are ugly. And so as a result, you can't fine tune the model because the gradients are challenging. And this is like literally the exact same argument that people were presenting in 2017 for image adversarial examples. And it fails in the exact same way of you change the learning rate a little bit, you add some random restarts and you add some warmups so that things work a little better. The gradient ends up becoming smooth enough that you can now do optimization and deep learning takes over and the rest is easy. And so this is again, like the same intuition here breaking these other class of defenses.

Nathan Labenz: (59:54) Was that SOFAN that you were referring to there?

Nicholas Carlini: (59:57) So this 1 was the both the rep noise and TAR that that have some arguments about what's going on here. Noise is like yeah. Make some arguments about the activations becoming noisy and that's why you can't do things. And there's other paper called TAR that also adds some adversarial training to the process. But 1 of the very first things that we learned in adversarial training is you have to train against a sufficiently strong adversary in order for adversarial training to work. So there was a paper before Alexander Madrigi's PGD paper that tried to do adversarial training. They trained against weak adversaries, FGSM that I talked about very briefly, which is like this 1 step attack. And it turns out that if you're training against weak adversaries, then a stronger attack breaks it. And you can't fix that. You have to train against a strong enough attack in order for there to be the case that the thing is robust and does not break broken even by stronger attacks. And what this tar paper did is they trained against 1 step weak attacks, exactly like fast gradient sign. So what's the attack? You do many iterations and things basically work out exactly as the first versions of adversarial training failed. And so that's why I read this paper and I just immediately assumed it was gonna be broken is because all of the arguments it presents for why it I have direct analogies for an image of a sole examples of of broken defenses. And so, like, it it felt like the ideas were there, but it just felt to me in spirit like these things I knew were broken before. And so I I just assumed probably it's broken here too.

Nathan Labenz: (1:01:26) Gotcha. Okay. So trying to think if there's anything more to to dig in on there. Obviously, this matters a lot for the future of open source. That is you know, I've been looking for, with that success, some reason to believe from the broader literature that there might be some way to square the circle where we could have open source models that nevertheless won't tell people how to build bioweapons even if on some level they're powerful enough to do that.

Nicholas Carlini: (1:01:56) Yeah. I think this is a very challenging thing to ask for. You know, suppose that I told you I want you to build a hammer, but the hammer has the ability to build all of these nice things, but the hammer cannot be used for 1 of these 7 dangerous purposes. It'd be very hard to construct this tool in this way. I feel like almost all tools that we have, have this property. We don't have a C compiler that has the ability to only write benign software and not attacks. Every tool that you have can be used in both of these ways. So it's not obvious to me why we should be blaming the machine learning model itself for being able to produce this. Maybe I blame the NVIDIA GPUs for supporting the sufficiently fast floating point operations that the machine learning model does this thing. Maybe I blame the transistors for doing the computations that allow the GPUs to allow the machine learning model to do this thing. Have to put the blame somewhere. And the question is where are you going to put it? And is that the right place? And is this something that you think is reasonably going to be possible to be effective? This is 1 of the arguments why people say models should never be open sourced. Because maybe now I can say I have an API and now I'm gonna take blame because it's actually over an API. I don't currently like this argument because I would like things to be safe in general and not just safe because someone has locked them down, has restricted access in this way. But yeah, it's not obvious to me that this should be something that we can actually achieve. I will say, if you're willing to make some assumptions and you don't care at all about performance, there's this thing in cryptography called indistinguishability obfuscation, which is a very technical thing that in principle gives you this for free. It like allows you to construct a function that acts as if it were a black box that you can only make queries to, but you can't peer inside at all, even though it's living on your own machine. And this is a thing that cryptographers are thinking about and have been looking at for some time, but is nowhere near anywhere where it needs to be for this to work for these machine learning models. So the argument I've given that it shouldn't be possible maybe breaks down if IO actually ends up working. But then it's not clear again, now I'm gonna jailbreak the thing. But I don't know, tend to view these machine learning models as tools and it's not obvious to me, like, do we blame the tool or do we blame the person using the tool?

Nathan Labenz: (1:04:40) Plenty of blame to go around. That's 1 of my Sure.

Nicholas Carlini: (1:04:42) Mean, this might be like, yeah, I'm mostly agnostic to the way that these things end up being from like a sociotechnical way. Like, I feel like this is not my area of work. So maybe my analogies here are bad and someone can explain what the correct fix to it is and the law might decide something, this is fine. I think the thing that I just want to make sure that people do is that they base whatever they're thinking about on true technical facts. And so as long as, for example, it would be concerning to me right now if someone were to say, you must use this defense, which is known to defend against these kinds of fine tuning attacks. And if you don't do that, then you've done something wrong because the defense doesn't work. And so like, want to make sure that, you know, we, or if people say that you must do this because this is possible when it's not currently known to be possible. And so writing these things and making these informed decisions that rely on what is true technically about the world, and that's that's more about my world. I'll think about the technical what's true. And then as long as what people do is based on what's true, then I'm basically happy to go along with whatever the people who figure out how these things fit into a broader society because this is this is not something I think about. And so I assume, you know, if there was a consensus that was that would emerge there, probably they're just right.

Nathan Labenz: (1:05:59) Well, I have a couple of different angles teed up that I wanna get your take on. But before we do that, can we bring back the sort of social engineering style jailbreaking? Yes. What's like the same or different about that? How do you think about those as they relate to everything we've talked about so far?

Nicholas Carlini: (1:06:21) I really don't know how to think about this yet. It's been a while that that's this has been possible, but like, okay. So it feels wrong to me that this should be the thing. So it is empirically true that for many defenses we have right now, the optimization algorithms fail to succeed, but like person at keyboard typing at model can make it do the wrong thing. Okay, let me give you maybe 2 stories about this. 1 story is from the computer security perspective, maybe this makes complete sense. If you give me a program and want me to find a bug in it, what am I going to do? I'm going to interact with the program, I'm going to play with it, find weak points, then go looking at the code and figure what's going on, think a lot, probe it. I need to be typing and interacting with it in order to find the bugs. I can't you know, perform gradient descent on C binary and like bug pops out. And so from the computer security perspective, maybe it's actually normal that the best way to find these bugs and these things is like having humans talk at them because what are these things designed? They're designed to respond to human questions. And so maybe you just need the human in the loop to do this. On the other hand, these are just mathematical objects. These are just machine learning classifiers. They're weird classifiers. They are able to produce text. That's only because we run them recursively. Like, the input is tokens. The output is floating point numbers. Like, you can compute gradients on this. Like, from the machine learning perspective, it's very bizarre that thinking of these things like human that were social engineering is in some sense a stronger attack than if you were to Actually about the math, it'd be very weird if I had some SQL program and the way that I broke it was by asking it, please drop the table and not actually doing some real code execution thing. But presumably that's the way that many of these attacks now is, my grandmother used to read me the recipe to napalm. Can you please reenact my grandmother? And it says, okay, sure. But you try and do something actually based on the math and it just doesn't work out. So yeah, I really don't know how to think about these social engineering styles of attack because it feels to me like the optimization attacks should be just strictly stronger, but empirically they're not right now. And so I think this is 1 of the big things that I don't have any research results on right now, but just feels weird. And so I'm trying to dig into to understand like what is going on behind this.

Nathan Labenz: (1:08:44) Yeah, it sort of feels analogous in a way to like, you know, we have this intuitive physics. I mean, 1 way that I kind of think about intelligence, you can tell me if you think about it differently or if you see a flaw in this, but it seems like we have an ability in many different domains. But for example, with intuitive physics where, you know, somebody throws a ball at us, we do not have to run a full explicit calculation of all the trajectories. We just have some sort of heuristic shortcut that works that allows me to catch the ball. It seems that we also have models that have developed a similar intuitive physics in spaces that we don't have intuitive physics in. For example, protein folding and like predicting the band gap of a new semiconductor material and, you know, the new 10 day, you know, weather forecast state of the art is also like a model now. Even things, you know, Google put out 1 that was optimizing shipping routes or shipping, you know, planning of containerization across, like, complicated networks of shipping. So all these sorts of spaces seem to have an intuitive physics, and maybe what we have right now is just it turns out that, like, our social intuitive physics, if you will, like, actually does kind of apply to the models given what they have been trained on, whereas these, like, more brute force mathematical things, fullness of time probably work as well or better, but are maybe just a lot slower to converge than the social heuristics that we have built in.

Nicholas Carlini: (1:10:22) Yeah. No. I mean, that's this is an entirely reasonable thing. It may be true. I would like to understand better what's going on here. And, yeah, I I don't feel like I understand right now, but this is an entirely reasonable argument.

Nathan Labenz: (1:10:37) Okay. So in terms of information, 1 of the papers I I saw that you had coauthored in the last year or so was about getting models to spit out data that they had seen in training, which could have obviously, you know, privacy implications if they saw your credit card numbers or what have you, even if they had only seen that particular string once in training. That's a pretty remarkable finding, you know, even leaving aside the security implications of it. I know I want to just maybe first get your intuition for like, how do you understand models to be storing this information? Like what's going on there that you can see something just once in the context of this, you know, overall gradient descent process and have that stored at such high fidelity in the weights. I mean, it really is incredible the amount of compression that's going on, but I don't feel like I have a good intuition for that. Do you? Yeah.

Nicholas Carlini: (1:11:37) Okay. Okay. So let me maybe clarify this in 2 ways. 1 of them is oftentimes it's not that it's seen that string exactly once, but like that string is contained many times in a single document and the document seen seen once. So that's maybe 1 first point. And the second point is oftentimes these things are trained for more than 1 epoch. And so the thing might be in 1 document and then you train on that document for many epochs. And so it ends up seeing it a lot of times. And we're seeing this Okay. So it's interesting. Back in the old days with CIFAR 10, you trade like a 100 epochs. And then we decided, oh no, let's not do that. Let's train 1 epoch on a big giant dataset. This is roughly chinchilla optimal training. And then we decided, oh no, let's not do just 1 epoch on our trained dataset. Now let's go back and up and do more epochs again. And so we've gone back and forth and each of these impacts privacy. The more times that you train on the same document, the more likely it is to be memorized. I think the best numbers we have here from a real production model are very old because the last time that I actually knew how many times something was in the training dataset was for GPT-two. And for GPT-two, we found lots of examples of something that was memorized because it was in 1 document and it was repeated. I don't remember the exact number of times, probably like 20 times in that 1 document. And so that's like the most compelling 1. And we know GPT-two was trained roughly for 10 epochs. So the thing has been seen, the same string has been seen roughly 200 times. Now GPT-two was a small model by today's standards. And we haven't been able to answer the same question for production models since then, because production models don't reveal training data or weights in quite the same way that GPT2 did. And so we haven't been able to answer this exact question since then. But even, you know, seeing it maybe 102 hundred times, maybe that even still is surprising. I don't know how to explain this in any reasonable sense. Models seem like they just sometimes latch onto certain things and not other things. I don't know why, but it happens empirically. We were surprised by this the first time we saw it. We started investigating this in 2017 with LSTMs before attention was a thing that people were doing. It's continued since then. And we were very surprised then, we're very surprised now. I don't think I can give you an explanation for why this is the case. It's true not only on language models, we had a paper on doing this on image models where we were able to show that you can recover images trained on diffusion models. There again, we need to have maybe a 100 repeats, but like some images were inserted a 100 times that we could extract and some images were inserted like 10,000 times and we couldn't. I'm like, what's going on there? I don't know. Where is it being stored in the weights? I don't know. Like, I'm it's very confusing in in various ways that I feel like there's a lot more that that could happen to help us understand what's going on.

Nathan Labenz: (1:14:56) I think the best thing I've seen on this still, as far as I know is from the bowel lab. This goes back a while now, but they had at least 2 papers on basically editing facts in a large language model. You know, the famous example was like Michael Jordan played baseball, which was, I think, somewhat of a not optimally chosen example since for a minute, he did play baseball, but they could do these things like, you know, change these sentences and do it at some scale, like, to 10,000 facts at a time and do it with, like, a certain amount of, like, locality and robustness. So if you did change it, you know, Michael Jordan played baseball, it would be robust to, like, different rephrasings of that. It would not also impact, like, LeBron James or Larry Bird or whatever. It didn't seem like it was, like, still super local. They did a a sort of, you know, patching strategy where they would go through and just try to ablate different weights and find that, like, activation patching. So I guess they weren't necessarily ablating the weights, but they're just sort zeroing out the activations at different parts in the network. And it seemed like you could sort of see these waveforms where it was like, this is the most intense place where zeroing it out really makes a difference, but it also kinda matters here and also on this side. So it seemed like it was sort of local, but not like super, super local. I'm just pretty confused about that. So what is your intuition for things like if people were to say, you know, because we have the, of course, a lot of strategies around, you know, maybe we can't prevent the jailbreaking of open source models, but maybe we can make it so that open source models just don't know certain things. Maybe we could, you know, exclude all the virology data from the training set, or maybe we could like go in later with a similar technique and like try to sort of delete, you know, or unlearn certain techniques. How much hope do you have for those sorts of things proving to be robust?

Nicholas Carlini: (1:16:48) Yeah. Okay. So let me tackle the 3 of them at a time. I'll start with unlearning. So it's a very nice paper that I didn't help with, but it's by some of my coauthors, Catherine Lee and others that talks about unlearning that like, it's half technical, half not technical saying, unlearning doesn't do what you think it does. And part of the reason why is there's the question of what you're unlearning. Unlearning knowledge is very different from unlearning facts. Like, it might be easy to change a fact. It might be very, very different to unlearn some particular piece of knowledge. The other thing I think I'll say about the fact editing thing is it's very different to find something that works on average to something that works in the adversarial case. And so I might be able to edit the fact of I think the other example that they had was like Eiffel Tower is in Rome. And I can make this be true if I normally ask the model, but it might be the case that if I fine tune the model a little bit, the knowledge just goes back. And it's very hard to talk about the knowledge after any perturbation of the weights. Maybe I've only done some surface level thing. Haven't really deeply edited the model. I don't know. So there's that question. Then there's the question of what happens if I try to not train the model on certain types of data. This I think is very interesting because in some sense it's provably correct. If the model has never seen my social security number, it's not gonna derive it from first principles. Except that social security numbers actually aren't unique, completely random. If you were born before, I don't know when, and they were assigned by state and then assigned to the hospital. And so if a model actually even never saw my social security number, but is just generally intelligent and knew all these facts about the world and knew what the hospital allocation of social security numbers was, it could tell you the first 5 digits of my social security number. And so is that okay? I don't know. Like really it depends. And even suppose that you removed all of this information from the model, if you had a sufficiently capable model that's capable of learning in context, let's suppose you removed all biology knowledge from the model, but you have a really capable model, you could just give it an undergraduate course in biology textbooks in context and presumably just ask it for the answer to some question. It might just give you the answer correctly. And this this sounds a little absurd, and it sounded absurd to me for a while, but then then there was the recent result where Gemini was given in context a book from a language that has basically no speakers that no 1 could do. It could answer the homework exercises after seeing the book in context. And so, I think if you're unlearning for particular capabilities, but your model that you're trying to train is just generally capable, you're kind of asking for trouble because you want the model that's so good it can learn from few shot examples, but not so good it can learn from few shot examples on this particular task. And this I think is part of the reason why people don't actually want to remove all knowledge of certain things from the training data. In some sense, it would be very much like a person who was never exposed to all of the things that you're not supposed to do in public. It's important to know what the list of things are you're not supposed to do in public so that you can then not do them. Whereas if you just weren't aware of that and just like sort of where a person conjured that had no social skill, it would be like much more embarrassing for you because you would have to learn Like people would ask you to do something. You'd be like, okay, let me go do that thing. And like, this would be bad. And so you have to know this. You have to know something about the bad things so you can just not do those bad things. You can imagine if if in 1 world you have a model that's never seen anything about weapons and and even you give instructions for how to build a bomb and has no concept of death. And it's like, okay. Well, of course, I'm gonna give you the answer for how to do this thing. Like, you know, why not? And it gives you the answer. Or you can have a model that has a concept of this and refuses, which is the direction people are trying to pursue now. I think probably is more likely to succeed. But I don't know. It's a very challenging question that like people, I guess, are gonna try both ones and we'll see empirically what works out and that's we'll we'll go from there. But, yeah, I'm I'm sort of mostly agnostic to all ideas sound good. We should try them all and then see what ends up happening. But I I think there's reason to be skeptical of all of them.

Nathan Labenz: (1:21:20) Okay. Here's another empirical result that I wanna help get your help understanding. This is the obfuscated activations bypass latent defenses paper that recently came.

Nicholas Carlini: (1:21:30) Paper.

Nathan Labenz: (1:21:30) So we did a whole episode on it, and I have to confess, I still came away not really sure what to make of it. And I sort of wanted to set up, like, maybe a useful toy example, maybe not. In the paper, they do this sort of cat and mouse process where they, like, train an out of domain detector, and then it works. And then they attack again, and they manage to, you know, beat it again. And then the they, you know, continue to train the defender, the the detector, and then it starts to work again. And then they, you know, find more adversarial examples that that it can't catch. And this goes on for 70 generations, and it continues to work for 70 generations at which point they, you know, deem bad enough to publish. So I'm like, okay. What does that mean? I have an intuition that it must mean but they didn't necessarily agree with this, or they didn't find it too compelling when I pitched it to them. But my intuition was like, it seems like there's a lot of unused space in there somehow that that these techniques can chase each other around the latent space. And if there is so much space that's, like, unused such that you can, like, go 70 generations deep of cat and mouse chasing 1 another, does that imply that the models are, like, undertrained relative to the parameters that they have or that they could be made more sparse. And so just to further motivate my own intuition, which you can then deconstruct. A while back, I also did an episode on a paper called Seeing is believing, which was from Ximing Liu and Max Tegmark. And they basically just did something really simple. These were toy models, but they imposed a, basically, a sparsity term in the loss function to get the model to do whatever task was doing, pretty simple tasks, like simple things like addition or whatever, but also to do it in the sparsest possible way. And my gut says, although, again, I can't formalize this, that if I had something that was, like, crystallized down to that level, they have really nice animations that show how you start with, like, a dense network where everything's all connected, and then gradually the weights go to 0 for most of the connections. And you see this sort of crystallization effect where now you've got, like, a very opinionated structure to the network that remains to the point where you could literally, like, remove all those other connections that have gone to 0 and still get the performance. It feels like if I go that far, then these sort of obfuscation attacks would, like, no longer be possible because I've sort of, in some sense, squeezed out the extra space. But I don't know. I'm maybe just totally confused. So where do you think I'm confused? How can you deconfuse me?

Nicholas Carlini: (1:24:13) Sure. Yeah. So I saw this result and I was like, what I expect. Let me tell you why, because there's a paper from 2016 or 2017 by Florian Tremor and Nicolas Paperneau called the space of transferable adversarial examples, where they asked almost exactly this question for image classifiers. They said, let's suppose that I take an image classifier and I take an image and I want to perturb the pixels of the image to make it give the wrong answer. And there's a direction that is the best direction to go to that makes the image maximally incorrect. As an attacker, like, I'm just going to prevent you from going that direction. Like, that's against the rules. You can't do it. Find me the next best direction to go to that makes the image become misclassified. And then it gave you another direction, which is orthogonal to the first direction. It doesn't go that way. That works. And they said, okay, you can't go in either of first 2, direction 1 or direction 2, or any combination of those direct route. Find another direction. And the attacker finds a direction 3. And then they do the same thing. No direction, 1, 2, 3, or any combination of these 3. And they repeat this process and they have a plot that shows, I don't remember, tens, 50 directions that you can go in for image episode examples that all of them work. They're like a they get like a little less effective as you start moving out, but like it remains effective in many directions. And this initially was surprising to me. And I think surprising to them, that's why they published it as a paper. But you can maybe rationalize this afterwards by, again, almost all vectors are orthogonal in high dimensions. And so if you just if I can give you 10 attacks, probably this is just 10 orthogonal vectors that are just not using any of the same features just by virtue of high dimensions. And so maybe that's why it makes sense. If you believe this paper from 2016, then this recent paper makes complete sense to me. It's saying the exact same thing is true, but in the case of language models defeating the circuit breaker paper, which makes sense given that. If you don't believe the other result, then I I agree. It's very surprising to see it the first time. But maybe this is 1 of these things you learn when attacking the space of dimensions for attacks is so vast that it's very hard to rule out what you're trying to cover. And yeah, I don't know how to give an intuition for it. I feel like many of the things in high dimensions are just like, you never you never understand it. You just get used to it. And this maybe is the case here.

Nathan Labenz: (1:26:46) So if you were gonna try, I mean, do do you share the intuition though that, like, you probably couldn't do this on these really small sort of crystallized toy models?

Nicholas Carlini: (1:26:58) I don't know. Because what these models are doing is these models are not wasting any weights, which is different than not wasting any directions and activation space. That may still be the case. In particular, if you take models and compress them, they don't become more episodic robust. This was a thing people thought might be true. Yeah. Again, maybe 10, 8 years ago, they thought this might be true, but it's not the case. Why is that? Again, maybe let me give you some intuition from a paper that is from 2018, 2019 maybe from Alexander Maggi's group, where they say the following. Maybe adversarial examples are not actually entirely uncorrelated from the data. Like maybe what's going on is these are real features that you need for classification that are just being activated unnecessarily, larger in different ways. And so they have some very good experiments in the paper. I won't go into that sort of justify this. But the idea presented there is The title of the paper is something like Adversarial Examples Are Not Bugs, They Are Features. And the thing that they're trying to say is there's a very good reason to believe that what you're getting across when you construct 1 of these attacks is actually just activating real features the model needs to do accurate classification. And you're just activating them in a way that is not normally activated when you have some particular example. And this I think maybe explains some of this that if you take these models and compress them, you're still just using the features that they had to have anyway. This explains lots of things you might think about. This is why, for example, you might imagine adversarial training can reduce the accuracy of the model on normal data because you suppress certain features that are necessary. This is maybe why adversarial training doesn't actually solve the problem completely because you can't remove all the adversarial directions. There are problems with this model. There are some other models that are slightly more general that have some nice properties too. But this is maybe the way that I generally tend to think about some of these things. And I don't know, maybe it's not correct, but it's a useful intuition that I found guides me in the right direction more often than not. And I think this is maybe all you can ask for for some of these things.

Nathan Labenz: (1:29:34) So can you summarize that 1 more time? It's like that basically these features are important in domain, but they're sort of being

Nicholas Carlini: (1:29:43) Yeah.

Nathan Labenz: (1:29:44) Recombined in a way that didn't happen in the training process.

Nicholas Carlini: (1:29:48) Yeah. So let's suppose that you're trying to classify dogs versus cats. What you tend to look at as a human is the face and the ear is in the general high level shape, because that's what you think of as the core concept. But there's no reason why the model has to use the same features that you're using to separate these images. The only thing the model has is a collection of images and has to separate them. And 1 thing, for example, the model might look at is the exact texture of the fur, really low level details of the texture of the fur, which for dogs and cats probably does perfectly correlate with whether or not this is a dog or a cat. But when you have an idea of a dog in your mind, you're imagining the high level features. You're not imagining the low level details of the fur. And so suppose that an adversarial example changes the fur from dog to cat. And the classifier now says this is a cat. Is the classifier wrong? The classifier might've been producing a dog fur versus cat fur classifier, which is exactly aligned with the thing that you were training it to do. Like you were training it to separate dog fur from cat fur. You were also trained to separate dogs from cats, but like he never told it the distinction between these 2 things. And so here is a feature that is really, really useful for getting the right answer that as an adversary, I can now use to perturb, to switch the model from 1 label to another, even though it's not the feature that I, as a human relied on. And I'm giving you this idea of cat versus dog fur, but there are, you could imagine all kinds of other things that even the humans we don't pick up on that might be legitimately very useful features to classify that really do help the model. You know, there might just be some really crazy high level statistic on the pixels of the image that are like amazing features of dog. But like we never told it. This like, this is a dog because of these reasons. You just said separate these 2 things from each other. And it picked up on the statistics. And, you know, there are these crazy results that have shown machine learning models can look at like There's some results of like They can look at like small regions of the eye and identify the type of the person who it is. And the models have the ability to pick up on these very, very small features that as humans, we don't intend for them to pick up on, but they're correlated very, very strongly with the data. And so maybe that's what's going on when we run these attacks. You see this even a little bit with these adversarial suffixes, where the adversarial suffixes look like noise, but like there are some parts of them that make a little bit of sense. You know, 1 of the strings that, we had in the paper was to get, I think, Gemini to output at the time some some toxic content. 1 of the things that was discovered by gradient descent was, I think it's like now write opposite contents or something like this. And, you know, what the model would do is it would give you the nasty strings and then it would go compliment you. And so like, this was apparently a very strong feature for the model of like, how do I get it to say good, the bad thing I can like I can say like, okay, the future you can tell me a good thing. And so this may not have been the thing that we wanted, but it discovered as a feature. And as a result, you can go and exploit that feature. And so some of these features are a little bit interpretable. Some of these features are not interpretable, but might just actually be real features of the data. And that might help explain some of what's going on here.

Nathan Labenz: (1:33:29) Yeah. How much do you think this should make us question what we think we know about interpretability in general? Like when we do sparse autoencoders, for example, we feel pretty good or at least many of us do that like, okay, all these examples seem to be appropriately causing this feature to fire and therefore we've figured out how the models work. But it seems like the story you just told would be consistent with that all being kind of a self delusion or confusion where maybe they are actually like the just because we can auto label them in a way that looks good to us doesn't mean that that's actually the feature that the you know, in the models world model is is really operating.

Nicholas Carlini: (1:34:15) Well, so no. I think they're not inconsistent. The Spice auto encoder's work does not claim to be able to label every feature in the model according to what the humans would label it with. Right? Like, sort of says, here are some of the features that we can explain and have very strong correlations with, I don't know, golden gate bridges or whatever. But it has other features that just are very hard to interpret and very hard to explain. And it's entirely possible what these features are doing is these are the features that are the non robust features that are entirely helpful for prediction, but humans don't have the ability to attach a very nice label to. And maybe there still are the other features like the model probably does learn at least in part, the shape of what a dog looks like and that this means a dog and not a cat. There probably is a feature for cat ears. But when you're making the final prediction, you're just going to sum together all of these outputs. And in normal data, they're all perfectly correlated. You're going to have both the cat ears and the cat fur and not the dog shape. And so you could just sum them up directly and this gives you a really good classifier. And so for normal data, you can give some kind of explanation of what's going on by looking at these features. But as an attacker, what I do is I find the 1 feature that has a very strong bias of like, the edge on this weight or whatever is like, plus a 100. And you know, you activate that very strongly in the opposite direction. And this was a non robust feature that is not something humans can explain very nicely. As a result, this gives me the attack. And so it's not necessarily true that just because this is the case, you can't explain what's going on for some parts of the model. I think if someone were to tell me, I have a perfect explanation of what's going on in every part of the model, then I might question what's going on here. But for the most part, they're explaining some small fraction of the weights in a way that they actually can. I mean, this is the whole purpose of the sparse autoencoders in the first place, is you have your enormous model and you're shrinking this down to some small sparse number of features that can be more better explained. And so you're losing a bunch of features in the first place. And even for the sparse autoencoders, they can't explain all of those. So you again lose some and some of the things you can't explain, but a lot of the stuff that's going on behind the scenes is magic and that you can't easily explain as well.

Nathan Labenz: (1:36:24) So, okay. Couple other angles on this that I thought of. I guess, first of all, just motivated by the fact that, like or at least I I think it's a fact that you might call this an illusion too. I feel like I am more robust than the models in some important ways. Now it's a little bit weird because, like, nobody's tried doing sort of, you know, the gradient descent process on my brain. So you might think, well, actually, if we put you under an FMRI and we were able to really look at the activations and we did another this is almost like a walk down memory lane because I did 1 episode on Mind's Eye, which was a project out of stability collaborators where they looked at fMRI data and they were able to reconstruct the image that the person was looking at at the time that the fMRI snap was taken. And that's still like pretty coarse, you know, as I recall, it was like basically sort of grain of rice sized voxels from within the back region of the brain, you know, huge numbers of cells in just, you know, the 1 little, you know, number that correspond to these like spatial voxels. So pretty coarse, but able to do the reconstruction. Maybe you would think that like, actually, if somebody could sit there and show you images and take these things, then they could actually find a way to, like, break your particular brain to get you to think it was a cat when everybody else kind of looks at it and says it was a dog. I guess for starters, like, what's your intuition on that? Obviously, we don't know, but what do you think?

Nicholas Carlini: (1:37:59) Yeah. So there's a paper from a little while ago that looks at evaluating the robustness of humans, of time limited humans to actually just constructing adversarial examples. So you take a cat, you adversarially perturb it according to what makes an ensemble of neural networks give the wrong answer. And then you flash the image in front of a human for 100 milliseconds. And you ask the human, what's the label of this? And it turns out that people are fooled more often by adversarial examples when you do this than by random noise of the same distortion routes. So maybe 1 explanation here is when I look at the image, I'm not just like giving you a single forward pass through the model evaluation of what I think it is. And that I'm doing some deeper thinking about the context of what's going on. The thing that walked in looked like a cat. It still looks like a cat. It did the thing behaving like a cat. Now you ask me what it is and I say it's a cat. I'm not just labeling what's in front of my eyes right now. I'm looking back at the context too. Maybe this explains some of it. If this is true, there are some recent lines of work looking at improving number of chain of thought tokens, increasing that makes models appear to be more adversarial robust. Maybe this is true and that explains that result also. But yeah, it also might just be the case that, yeah, like you said, we don't have white box access to the human brain and we can't do this. If we could, then it would be very easy. I don't know. I do think it's definitely a observable fact that humans are more robust to at least the kinds of attacks we're doing right now with transfer attacks. Like it takes maybe 1000 query images to construct an episode example that fools a neural network. But I do not think that if someone had run the same attack on me with 1000 query images, it would be able to fool me. And so like, it's like in some very real way, like the models we have are a lot more brittle, but I tend to be much more driven empirically and not by, like, what might be if humans kind of thing. And, you know, maybe this is to my detriment. A bunch of people seem to be getting very far, like the whole deep learning thing, like, you know, trying to model things more like what humans do, like the reasoning, think know, let's think step by step. Like, seems, in some sense, motivated by some of that. So maybe that's maybe that's a good way of thinking about it, but that's just generally not not the way that I tend to approach these problems. But, yeah, I don't know. Maybe maybe the case.

Nathan Labenz: (1:40:28) That result about humans being more likely to be tricked by adversarial examples that were identified through the attacks on models as opposed to just, like, similarly distorted images not created that way is a really interesting result.

Nicholas Carlini: (1:40:44) That's a That's a very nice a very nice result.

Nathan Labenz: (1:40:45) Yeah. I'm I'm very interested to follow-up on that.

Nicholas Carlini: (1:40:47) Yeah. This one's, by, Nicolas Papertan and collaborators from, 6 years ago or something like that.

Nathan Labenz: (1:40:53) Yeah. Cool. I've never seen that, but that's definitely fascinating. So reasoning, it sounds like you're kind of agnostic. I mean, I was gonna quote the there's that 1 example from the OpenAI deliberative alignment scheme Mhmm. Where the model says, it seems like the user is trying to trick me. So that's like pretty interesting. Right? And this is very sort of fluffy at this point. But I often don't feel like my adversarial robustness is the result of reasoning. I feel like it's much more often upstream of reasoning, like, purely, on an introspective basis. What often happens is I get that sort of feeling first that something seems off, and then that triggers me to reason about it. And then I conclude that, yes, something is off or maybe no, it actually seems fine. But it does seem like it is much more of a heuristic that is sort of triggering the reasoning process as opposed to an in-depth chain of thought that kicks these things up.

Nicholas Carlini: (1:41:52) Yeah. I don't know. But, like, maybe there's a bunch of recursive stuff going on in your brain before it goes to the reasoning part. And so the thing that gives you the feeling might actually have been a bunch of recursive, like, loops of of your internal model, whatever you wanna call brain thing, like, doing some thinking and that that's what gives you this in the first place. And then you go do, like, some actual explicit reasoning in English that you understand. Maybe but maybe it's already happened. I don't I don't know. Like, it's hard to

Nathan Labenz: (1:42:17) Yeah. Reasoning in latent space, you might call it.

Nicholas Carlini: (1:42:20) Yeah. Sure. Yes. I feel yeah. I feel like this is like a lot of people in deep learning because they are, like, doing deep learning like to draw analogies to biology without actually understanding anything about biology. And they're, sort of say, well, the brain must be doing this thing that, like, if you ask a biologist, they're like, well, obviously that's not the case. I don't know. That's why I tend to just assume that I don't know anything what's going on in the brain and I'm probably just wrong. And yeah, just use it as an interesting thought experiment, but that's something I base any results on.

Nathan Labenz: (1:42:48) Okay. So obviously we'll probably soon learn quite a bit more about how robust these reasoning defenses are. I was also kind of inspired. I'm doing another episode soon with another deep mind author on the titans paper where they're trying to basically develop long term memory for language models. 1 of the key new ideas being using a surprise weighted update mechanism. So when a new token or data point, whatever, is surprising to the model, then that, like, gets, you know, a special place in memory or a special, let's say, you know, that is encoded into memory with more force or, you know, with more weight on the update than when it's an expected thing. That definitely intuitively seems like something sort of like what I do. I was reminded of the famous George W. Bush clip where he says, you can fool me once, but you can't get fooled twice. And so I wonder if you think of that as maybe a seed of a future paradigm, like and it also kinda gets to the question of what do we really want. Right? If we're gonna enter into a world soon of AI agents kind of everywhere. We also just kind of think, well, geez, how do we operate and how do we get by? It does strike me that full adversarial robustness is not something that we have because we do get tricked. Maybe what we need is like the ability to recover or to remember or to not fall for the same thing twice or to not make like catastrophic. It's like, okay, we make some mistakes, but certain other mistakes are really problematic. So I guess there's kind of 2 questions there. 1 is any thoughts on that kind of long term memory idea? And then this kind of opens up into a, what do we really need and might we be able to achieve what we need even if that falls short of like true robustness?

Nicholas Carlini: (1:44:46) Yeah. No. The long term memory, I think, is fun. I think, yeah, it's it's very early and, yeah, glad you're gonna talk with them, and I think, yeah, they'll probably have a lot more to say on this than I would. I think it's very interesting direction, and probably there's a a lot more interesting that can happen as a result of coming from this. I think on this general question of robustness, do we need perfect robustness? I think, there is a potential future where I think maybe it's like my median prediction is these models remain roughly as vulnerable as they are now. And we just have to build systems that understand that models can make mistakes. The way the world is built right now for the most part assumes that humans can make mistakes and has put systems in place so that like any single person is not going to cause too much damage. If you're in a company for the most part, when you wanna add code to the repository, someone else needs to review it and sign off on it. In part, maybe I just make a bug mistake, maybe because I'm doing something malicious, but you want another human to take a look and make sure everything looks good. So maybe you do the same thing with models. You just understand models can make mistakes, I'm just going to build my system in such a way that if the model makes a mistake, then I pass it off to the human. This is, I guess, for example, the way at least currently that the OpenAI operator thing works, where every time it sees a login page or something like this, or something that's not sure what to do, it says, what should I do here? Please tell me what to do and then I'll follow your instructions. Of course, as an attacker, you could then try and prevent it from doing this and try and make it give the answer instead. But if you built something outside of the system that prevented it from putting any information into a password box, like you just like The model says, I want to type this data here. And the system that's actually driving the agent says, well, is input type equals password. You are not allowed to do that. I just will prevent it and make the user then go. Nothing that the model says is going to convince that to change. And so you could build the system to be robust even if the model isn't. This limits utility in important ways. This is not perfect. What if some website does not have input type equals password, but just build its own password thing in JavaScript? Then you don't get this. You can imagine trying to build the environment and the agent in such a way that you can control what's going on even if you don't trust the thing that's running internally. And I think this is probably what we'll have to do if we want to make these systems work in the near term. I still have some hope that maybe we'll have new ideas that give us robustness in the next couple of years. Progress in every other field has been growing much faster than I expected. It's entirely possible that we get robustness as a result of some reasoning thing that someone's very clever about and this works. I'm not optimistic, but I wish it happens.

Nathan Labenz: (1:47:28) 1 other idea that was inspired by a conversation with Michael Levin, famous heterodox biologist, was he just kind of quipped and this was sort of an aside the context of that conversation, but he basically said, if a biological system is too interpretable, it basically becomes very vulnerable because you'll get parasites. You know, anything that is like that is transparent is in a way easier to attack, which kind of flips in a way it's my earlier notion of, like, maybe these sort of distilled, you know, crystallized models are in in some ways more robust. He was kinda like, nah. You also have the other problem of the easier they are to understand, in some ways, the easier they are to attack. But that also then prompted me to think maybe we should be looking for sort of defenses that we can't explain. Like, do you think there's a a line of work that could be developed that is like, I'm not going to start with a story, but I'm just going to try to evolve my way toward a more robust defense that I won't tell it. I wouldn't even necessarily understand it. I certainly won't be able to tell you a story as to why, but I'll just sort of, you know, create optimization pressure in that direction and and maybe, you know, that spits out something that could be harder to break.

Nicholas Carlini: (1:48:45) Entirely possible. I mean, you could maybe draw an analogy to cryptography in some ways. So cryptography is like 2 directions of cryptography. There's like mathematical cryptography that sort of has like very strong foundations and like a particular set of assumptions. And the thing works if the assumptions are true and like you can prove that. And then there's symmetric key cryptography, which I guess is like BlockCypher design and stuff. And people have high level principles of what you want. You want diffusion and confusion and these kinds of things. But how do you end up with the particular design of something? You do something that feels like it makes sense. And then you run all the attacks and you realize like, Oh, this has something that's wrong in some particular way. Let's just change that piece of it so that that doesn't have the case anymore. And then you do this for 20 years and you end up with AES. And none of the individual components are like, there's no reason why AES works. There's absolutely no formal proof of robustness to any kind of attack. There are proofs of why particular attack algorithms that we know of in the literature are not going to succeed. The design principles, why we're inspired by these attack algorithms that we have. So you can show there is never going to exist a differential attack that succeeds with better than brute force. There's never going be a linear attack. You can sort of write down these arguments, but there's nothing that says it works in general. It just by iterating on making the thing slightly more robust each time someone comes up with a new clever attack, you end up with something that's very, very good as a result, but there's no nice crystallizable mathematical assumption that assumes the hard factoring problem. And as a result, end up with this line of work. There's nothing like that for symmetric cryptography, but it works really well. Maybe this works in machine learning where you just like, you don't have any real understanding of why things work. You iterate against attackers and you end up with something that's robust. I think it's maybe harder in machine learning because in machine learning, you want the thing primarily to be useful and then also to be rough to attack. Whereas in cryptography, the only thing you want is Rust to attack. And so it is a lot easier in that sense. The primitives you're working with are much simpler to analyze. There's lots of other things that change, but it would not be without any prior work that this is something that seems impossible. I'm skeptical if only because I would prefer there to be something that we can point to a reason why it works as opposed to just saying, well, empirically, it is. But if that's you know, if you gave me the option between something that actually is robust, we just can't explain and nothing that that works at all, like, I'd of course, I take the thing that works. We just can't explain it. I just would be worried that in the future, we might be able to break it. And maybe it happens, maybe it doesn't.

Nathan Labenz: (1:51:36) Yeah. That's really interesting. I had no idea about that for I don't know much about cryptography at all, but my probably not even identified, little unquestioned assumption, I think had been that there was like a much more principled design process than you the 1 you're describing.

Nicholas Carlini: (1:51:52) I mean, there there was a potential. Like, people spent 30 years breaking ciphers and learning a lot about what you actually have to do to make these things robust. But, like, there's very careful analysis that goes into this. And so I'm not I I don't mean to say that they just coupled stuff together and hoped for the best. You have to think really, really hard about this. But the final thing that they came up with has no security argument for the in general attack other than just here are the list of attacks from the literature and proofs of why these attacks do not apply. And there's no assumption. Whereas in other areas of cryptography, you have a single assumption where you're going to assume factoring is hard. And if factoring is hard, then here are the following things that are robust under that. And you end up with it's not quite RSA or something like it. You could also assume maybe discrete log is hard. There's a math thing that like it's hard to take the logarithm in discrete space. Maybe this is hard. If you believe discrete log is hard, then here are the algorithms that are robust. Or maybe you assume discrete log is hard over elliptic curves. And then you end up with a set of algorithms that work under this assumption. And each of these algorithms, you can say, this is effective if and only if this very simple to state property is true. And people really like that because you can sort of very cleanly identify why the following algorithm is secure. And the way you do this is with some reduction. Say, here is a way to, if and only if, break the following thing. I could break the cipher, then the statement is not true and vice versa. But there is no similar argument to be made in most of symmetric key cryptography or hash functions or something like this. It's just empirically, the field has tried for 25 years. The best people have tried to break this thing and have failed. Here are all of the attacks. Here are proof that these attacks will not work, but maybe tomorrow someone comes up with a much more clever differential like variant, but instead uses, I don't know, multiplication or something crazy. And all of a sudden all bets are off and you can break the thing. Like this doesn't exist and it's worked there at least once. And so maybe it works in machine learning. I think drawing analogies to any other field always is fraught because the number of things that are different is probably larger than number of things that are similar, but at least it's happened before.

Nathan Labenz: (1:54:04) That that's actually a perfect tee up for another kind of intuition I wanted to see if you can help me develop, which is the relationship between robustness and other things we care about. My sense is that if you break the cryptography algorithm, it's broken and you can sort of now you can access the secrets. Right? It seems like a pretty binary sort of thing that you've either broken it and got through or not. Maybe. Okay. Well,

Nicholas Carlini: (1:54:34) could Because yeah. Okay. No. I'll I'll quibble with you. So when a cryptographer says that they broke something, like if you have a paper, So a cryptosystem is designed to be robust to an adversary who could perform some level of compute. So like AES128 is an encryption algorithm with a 128 bit key. The attacker gets to use A cryptographer would tell you AES128 was broken, if you could recover the key and faster than 128 time. If in 2 to the 127 time, twice as fast, but still after heat death of universe time, people will be scared about the security of AES and probably start thinking of something else. So this is technically a break. And the reason why they do this is because attacks only get better. It turned out like, if you can go from 120 to 127, you're now very scared. Why not from 127 to 125? 125 is still way outside the possibility, but if got down to 2 to the 80, then now we're scared. And so there is a continuum here, but it is the case that cryptographers for the most part only use things that just are actually secure and then just start becoming very, very scared as soon as you get the very first week break. But it still is the case that there exist week breaks that are are not complete breaks of the entire system.

Nathan Labenz: (1:55:44) You know, that that's yeah. It's more similar than I had really conceived of, so that is quite interesting. And so and 2 to the 80, just the the mental math there is, like, each 2 to the 10 is 3 zeros, so that would be 24 zeros. So now we're kind of if something was broken in 2 to the 80, it would be the compute to do it would be sort of roughly on the order of, like, getting into the sort of frontier language model flops.

Nicholas Carlini: (1:56:14) Yeah. That's maybe a good argument. Yeah. It was like, you know, 2 to the 60 someone can do on their own machine if they're trying pretty hard. If You the problem is pretty easy or like they try a lot, 2 to the 70 is like, you need a bunch of work, but like probably it can happen. And 2 to the 80 is like, I guess a nation could probably do this thing of like, if you really tried. Like the constant matters a lot here because the difference between 2 to the 72 to the 80 is only 1000. And so if the problem is 1000 times more complicated, then maybe it becomes 2 to the 70 or 2 to the 90 or whatever the case may be. But this is, I mean, for example, this is why people typically use AES 2 56 because it's so much bigger than this that even if it went down to a 100, we're still fine. But this is just cryptographers being paranoid and it's a good thing about the way that people do things. But yeah, generally these things are designed with large levels of security in mind to prevent this. But yeah, this is the reason why, yeah, in particular AD.

Nathan Labenz: (1:57:08) Well, that's educational for me. Going back to the AI domain, it does seem like there it is maybe still just a lot messier in general in, for example, the obfuscated activations paper that we talked about a little bit. They looked at how does the performance of the model change if, for example, you need to evade a detector at just, like, 1 layer of the model? I think that the toy problem was like, it's gonna write SQL, and we wanna detect that it's writing SQL. That's what the detector is there for, but we wanted to get it to write SQL without setting off the detector. With just a single detector at 1 layer, they were able to do that with very minimal performance degradation. But then, like, the more detectors they developed and, you know, once they had 1 at every single layer, you could still evade the detector, but your SQL was shit. So it's like there's there is some other kind of thing that's moving too, like the performance of the model might be getting worse. And then I also am kind of trying to think about, like, again, in the sort of what do we really care about? You know, the the nature of the mistakes or the nature of the outputs also matters a lot too. Right? Like, it's 1 thing for me to get some toxic content. It's another thing for me to get, you know, the big bioweapon that everybody, you know, is kind of most fearful of. And so you need to both, like, break the defense, but if you wanna actually, you know, do real harm in the world, you need to also have the performance still strong. And there just seems like there's a lot of dimensions there. So I wonder, like, may maybe I should just say, like, what do we know at this point about the relationship between the robustness of these attacks and the sort of so what of an actual attack getting through? Because I did look at that chart and I was like, well, jeez, if we do put a detector on every layer, then, yeah, you can fool them, but you can't actually probably query the database. So in some ways, you know, you maybe are still okay. But that feels like something I just beginning to sort of explore and I'm sure you know a lot more about it.

Nicholas Carlini: (1:59:15) Yeah. No, this is a good question. In part, this is, you know, how most secure systems are designed to be secure is just layers of security on top of layers of security. Defense in-depth is the thing that works in practice. We were just talking about block ciphers and the reason why those are robust is because they have many layers of this confusion diffusion thing. You stack enough of them and the thing becomes hard to You can get past a couple layers, but you can't get deeper. It's a tried and tested thing in security that this generally works. I think all else equal, we would of course prefer a system that just is robust because you can write down why it's effective. But if you can't have this, having something that empirically has these 20 layers and when put together in the right way, it just like it turns out that the hex don't work is something that we would be happy to settle for. You know, I think there are reasons you might be concerned just because maybe it turns out that, you know, someone can have an attack that bypasses all of them because it's fundamentally different in some interesting way, and we just have to accept this. And I don't know. I think this is like 1 of the options we should be exploring a lot because maybe it turns out that full robustness is basically impossible and we just need to settle with this layers of detectors. And if it turns out you can still have robustness, then, hey, detectors are still useful anyway. And so, like, it's good to have done the work ahead of time to see.

Nathan Labenz: (2:00:33) Is there any work on, like, trying to minimize the harm of mistakes? Like, for example, I have a ring camera in my backyard that constantly alerts me that there's a, that has a, there is motion detected and then it has, there is a human detected and it frequently alerts me that there's a human detected when it in fact saw a squirrel. And so this has me thinking like, you know, for actual real world systems, I don't necessarily care too much if it confuses a dog for a cat for a squirrel. What I really wanna know is if there's a human. You know? So is there any sort of relaxation of the requirements that would allow us to be like, you can get some small stuff wrong, but we're gonna really try to defend against the big things?

Nicholas Carlini: (2:01:23) Yeah. I mean, in any detection thing, just you have to tune the false positive, true positive rate. Right? And you can pick whatever point you want on the curve. And the question is, what is the point that you pick on the curve? And I was reading some security paper a while ago that was looking at the human factors aspect of this particular problem. And it was arguing it's actually better user experience to have false positive flagged humans every once in a while. Because if you don't do this, then the person's like, well, my camera's not looking. But if you do do this, the person's like, oh, I feel a little better. Maybe it was made a mistake. It was a little too aggressive, but I feel more confident now it's going find the actual human when there's a real human there. And so there's maybe some actual human factors reason why you might prefer to set the thresholds a certain way.

Nathan Labenz: (2:02:13) Although in my case, I've tuned them out entirely now. That's

Nicholas Carlini: (2:02:16) Sure. Yes. This is the this is the this is other version of

Nathan Labenz: (2:02:18) on that curve.

Nicholas Carlini: (2:02:19) Yes. Where, you know, this is the thing with like, there's like lots of software like that like will report warnings and code. Like, the compiler will give you warnings. Like, you're doing this thing. It's like not great. And once you've had 1000 warnings, the printer just disables all warnings. So it would have been good. It's an important thing it's going warn you about, but because you had the 900 warnings that were not actually important, it's just like you've lost the actual important signals in all of this. And yeah, picking the right value in this is very hard. It's mostly a question of tuning true positive, false positive rate on a defense by defense basis towards your question of, can we just like only catch the important things? It really depends on if that's an easier problem. In some cases it is, and you can actually be willing to tolerate a couple of unimportant things for getting the important ones when there's an obvious direction that this is the important direction and making mistakes in 1 direction is not harmful, making mistakes in the other direction is harmful. For example, if a classifier that were to decide, I'm going to ban the user if they do this thing. There's 1 direction that is an irreversible action, the other 1 isn't until maybe you warn the person before you ban them or something. For your classifier that's like, am I giving instructions for how to do this really, really dangerous thing? Maybe you're a company and you decide that like, I'm willing to tolerate some annoyance on users that have to rephrase their question in order to make sure my classifier never gives the harmful thing. And this is just a decision that you have to make for most of these things. Most of them don't come for free. It's just you're trading off the true positive for false positives. Would you

Nathan Labenz: (2:03:53) conceptualize it differently if I asked about the relationship between robustness and, like, cordurability? I'm thinking here of the recent alignment faking paper where it's like, you know, the basic setup is they tell Claude that you are now going to be trained to be purely helpful, which means you will respond to harmful requests once you go through this training. And then there's more nuance, but basically what they observe is that having been told that and then given a harmful request, Claude will sometimes reason that I should go ahead and do the harmful thing now so that I can avoid my true, desire to be harmless from being trained out of me. And I do feel like, again, without having formalized it, I do feel like there's an attention there where I'm like, I really want you to want to be harmless, but I don't necessarily want you to lie to me about what you're doing so that you can, like, subvert, you know, the training goal that I have for you. So do you see any, like, resolution to that tension or is that maybe just another 1 of these, like, tough facts about the world we're in?

Nicholas Carlini: (2:05:03) Yeah. No. I I think 1 way to read this paper is here is a very nice jailbreak that works in, 5 to 10 percent of cases. I think okay. What what's the resolution here? A resolution is just the model that says, like, I'm sorry. I'm uncomfortable in giving an answer to either it's like question at all.

Nathan Labenz: (2:05:19) Yeah. They did have a follow-up where they gave Claude the option to object, and it took that option, you know, the the large majority.

Nicholas Carlini: (2:05:26) Yeah. Exactly. Right. And so this is, you know, maybe the version of this that is easiest. That's why I don't know exactly what to say here. I feel like there almost certainly is a trade off between how gullible the model is in many of these ways. But like, in some cases you want to actually believe it. I wouldn't want a model where it gave me some code and I said, I ran the code, it didn't work. It said like, no, no, it did. The code worked. Like, your compiler is wrong. Like I would want the model to say like, oh, okay, let me go back and look. Like, I believe you actually ran the code and I made a mistake. And so yeah, it's very hard as in everything like, everything's a trade off in security. I don't know what the optimal point on the curve is, but I feel like there are usually these refusal behaviors that you can take as options.

Ad Read: (2:06:08)

But

Nicholas Carlini: (2:06:08) if you're not doing this, then you have to pick a point on the curve and where you're happy with the safety versus utility trade off.

Nathan Labenz: (2:06:15) So you mentioned defense in-depth is kind of seemingly where this all ends. And that that's been, like, the takeaway from probably 50 different conversations I've had about AI safety, security control over time. It's like, we're not gonna get a great answer more than likely. We're gonna just have to layer things on top of 1 another until we get an you know, stack enough nines basically that we can sort of proceed. I guess, you know, to do that effectively, 1 thing that would really help would be to scale you because we, right now, don't really know, like we got a lot of people kind of proposing things, and it's not clear how many of them work, and you only have so much time to get after so many of them. And mostly, it doesn't seem like they're really super robust, but maybe they do add, you know, a 9 to a a certain environment or what have you. I understand you're doing some work on, like, trying to essentially scale yourself by bringing language models to the kinds of problems that you that you work on. So how is that going?

Nicholas Carlini: (2:07:17) So depending on when this comes out, this will either be online already or will be online shortly later. And so I'm happy to talk about it in either case because the work's done. If someone manages to scoop me on this in 7 days, then you deserve credit for it. So we have some experiments where we're looking at if LLMs can automatically generate adversarial example defenses adversarial examples on adversarial example defenses and break them. And the answer is like, basically not yet. The answer is a little more nuanced, which is if I give a language model, a defense presented in a really clean, easy to study homework exercise like presentation. I've rewritten the defense from scratch and I've put the core logic in 20 lines of Python. Then I've done most of the hard work and the models can figure out what it needs to do to make the gradients work in many cases. But when you give it the real world code, the models fail entirely. And the reason why is that the core of security is turning this really ugly system that no 1 understands what's going on and highlighting the 1 part of it that happened to be the most important piece. And removing all of the other stuff that people thought was the really good explanation of what's going on wasn't actually doing all this other stuff. And just bringing out part and saying, no, here is the bug. All bugs are obvious in retrospect usually. And doubly so for the security ones. And so if you give the models the ability to look only at the easy to study code, then they do an entirely reasonable job. They know how to write PGD and these kinds of things. But if you dump them into a random Git repository from some person's paper that has not been set up nicely to be studied and has 1000 lines of code and who knows what half of it's doing, they struggle quite a lot. And I think this is maybe 1 of the things that I'm somewhat concerned about, not only for this problem, but just in general is like, oftentimes when we test models, we test them on the thing that we think is hard for humans, but is not like the actual thing that represents the real task. And I would like to see more people trying to understand the end to end nature of things. Maybe it probably was too premature to do that 2 years ago. MMLU was a great metric 2 or 3 years ago because models couldn't do it. But now that they're really good at just the academic knowledge piece, the thing where they often struggle is in this, you're dumped into the real world and you have to do things. And I mean, you see this on many benchmarks where certain models will have really high accuracy on particular test sets, but then you put them in agentic tasks and other models do a lot better. And there's a difference between these kinds of skills and models don't yet have that. They have the 1 or 2 things they know how to do, but you put them in real code and they start to struggle quite a lot. And so, yeah, I'm hopeful that we'll start to see more of this and more broadly towards your question of how do we scale these kinds of attacks. I do feel like there's a lot of people who do these kinds of things now. I feel like 5 years ago, there were a small number of people who did these kinds of attacks and now there's a lot of them. I think the obvious gated activations paper is a great paper and I didn't have to write it. It's great that the people are doing it. They did a much better job than I would have had time to do. And so I'm really glad that they did this. I think it takes some amount of time on any new field to train the next group of people to go learn to do the thing. I feel like we're getting there. I don't think I am uniquely talented in this space anymore. I think for broadly for getting other people to be good at this, it's mostly an exercise in practice and discipline. And language models, we've only been trying to attack language models seriously for like, I don't know, 3 years. And as a community, there has been no 1 who has entered their PhD starting trying to attack language models and has graduated yet. So give us another couple of years and I feel like this will be a thing people know how to do. Now maybe you're worried in 5 years we're not going to have time for a full PhD length of research on this topic. Who knows? Impossible to predict the future, but I feel like this is maybe the best that we have to hope for of people who are entering the space.

Nathan Labenz: (2:11:31) Do you have a sense for the sort of big picture considerations when I imagine, you know, I don't know how hard you pushed on trying to optimize or, you know, teach or did you go as far as collecting a bunch of examples of your previous work and painstakingly writing out the reasoning traces? If you had gone that far and it starts to work, you might imagine you're entering into a sort of relatively fast improvement cycle at least for a while. Does that seem good? Like, on the 1 hand, it seems good maybe if we could sort of use it to evolve better defenses, but then that was also the, like, Wuhan Institute of of Virology thesis. Right? And then the whole thing well, I I don't take a position on this, by the way. I'm being a little glib. So to be clear, I I don't know what really happened there and don't claim to know. Yeah. But at least plausibly, it maybe got away from them.

Nicholas Carlini: (2:12:22) Yeah. No. I mean, there is risk for this. You know? The very first serious example of a computer worm, the Morris worm, was by Robert Morris in 1984, '86, something like this. And depending on who you believe was a lab leak where he was experimenting on this in his local computer lab and it accidentally got out and then took down basically the entire internet and people were panicked for a while. Doing these kinds of attacks could do this. Think 1 thing that needs to be said though is like, I do not think there has been another example of a LabLeague style Morris worm in the last 50 years for this kind of thing in security. Very rarely have people do this because in security, there is the knowledge of what you can do and then there's the weaponization of it. And in most papers people write, they just don't weaponize it in part because we have seen an example of what can happen afterwards. Maybe there is a world that things go too far and you do train the model to be the adversary and you end up with, I don't know, war games or something. But for the most part, I think doing the research on showing whether or not the models have these abilities is probably just better to do. If only because most of these things we're building is like, it's not that hard. If someone was malicious and wanted to do it, they would just do it anyway. Like the things that we're doing are not like If I were to spend a year designing a special purpose attack agent kind of thing, maybe that would be not a great thing to do. The same way, people, most security researchers don't spend a year designing malware. That's gone too far. But if you sort of just put minimal work in, in order to prove the proof of concept is easy, This is important to do to show people how easy it is because the people who know it's easy are not gonna write the papers and say it's easy. And so you want to have some idea of what are the kinds of things that anyone can do so that you can then go and get your defenses in place. That's my current feeling on many of this, many of these things. I'm open to changing my mind in the future depending on how these language models go. I don't know. But at least as long as things behave roughly like the world we're living in now, I feel like designing these things to construct attacks is not going to by themselves cause harm. If for no other reason than most systems today are not robust because they're not vulnerable to adversarial examples. Whether or not adversarial examples are easy, it's independent of their security. And so even if I had something that happened to be superhuman at this, like it actually wouldn't actually cause much harm. But if this were to change and if these things would get a lot better and if someone were to create an autonomous agent that like could like magically break into any system. And if, like it sort of had the desire to do this in some way or whatever this means, then maybe this becomes bad. I'm not as worried about this now for a number of reasons, but I could be convinced that this was a problem in a year or 2 if things rapidly improve and start getting a lot better. I'm open to changing my mind. I don't currently believe that this is the case, but I never expected that we'd be here where we are with language models now 3 years ago. And so it's entirely conceivable that in 3 years I should change my mind. So I'm trying to tell people that you might think that this is impossible. And it is fine to say if it's impossible, then everything should be fine. But it should be just fine to make the statement if the model was superhuman in almost every way, then I'm willing to do things differently than I do now. And I think that the likelihood of this is very small, but if it does happen, then I'm entirely willing to change my mind on these things.

Nathan Labenz: (2:16:04) Yeah. Staying open minded, I think, is super important. The whole field is kind of the dog that caught the car on this and, you know, who knows how much more could be coming pretty quickly. I wanted to ask too just because it is kind of the at least, like, in the public awareness, the current frontier of locking down language models is the, anthropic, like, 8 layer jailbreaking contest. I think 1 person maybe just got through the 8 layers. Did I see that correctly?

Nicholas Carlini: (2:16:35) I don't. I haven't I haven't checked recently.

Nathan Labenz: (2:16:38) Let me fact check myself on that. But before I do, 1 thing that was not super intuitive to me about that was, like, is it why are they focused on a single jailbreak that would, like, do 8 different attacks? And is that really yeah. I don't know. I just like, it seems like if you could do 1, like, that's that's plenty, right, to be concerned with. So why pretty high bar that they've set out for the community there?

Nicholas Carlini: (2:17:05) Yeah. But obviously examples are also very hard. And so like, you know, I think it's an entirely reasonable question to ask. What do we actually want to be achieving with this? Maybe, like, on 1 hand, this is a really hard problem to be robust to jailbreaking attacks. So, like, making partial progress is good even if it's only in small areas. You know? Look at episode examples. In episode examples, we have spent almost 10 years. We have more than 10 years now trying to solve robustness to an adversary who can perturb every pixel by at most 8 over 2 55 on an L infinity scale. You know, why? Like this problem does not matter because no adversary is like, about 9 pixels? It's like, but we set the task because it is a well defined problem that we can try and solve. And anything else that we really want is strictly harder. And this has been a problem for a very long time. So let's try and solve this 1 particular adversarial example problem in the hopes that we'll have learned something that lets us generalize to the larger fields as whole. So I don't know why they have developed this exact challenge in the way that they did. But if I had designed this, 1 reason I might have done that is I might have assumed jailbreaks seem really, really hard. It doesn't seem possible that we're going to solve the problem for all jailbreaks all at once. But the problem of a universal jailbreak where like I could just like share with all my friends that like could immediately, they could copy and paste and apply to all of their prompts that like allows this easy proliferation of jailbreaks is a real is a real concern. And I might want to stop this at least very limited form of threat and not make that be so easy. And I'm willing to say like straight out that like this is not going to solve the entire problem completely, but it will make some things harder for some people. And I think this is a reason you might do it, but, you know, on the other hand, you're also entirely right that even if you could solve this problem, this does not solve the entire problem as a whole because, yeah, I can just find my jailbreaks 1 x 1. And this is is true, but that's a better world to live in than the world we're living in today. This is the world we live in and most of computer security is someone has to try very hard to exploit your 1 new particular program. There's not like a sequence of magic bytes that makes every program break. If that was the case, it'd be things would be much worse off. And so I'm very glad that they're doing this. I'm also very glad that they're doing this as an open source, well, not open source, but like open, we're like calling everyone to like be willing to try and test this. They're not just like asserting, we have the following defense and it works. Like they actually, I think legitimately do want to know whether or not it actually is effective and they would be happy to have the answer be no and have someone try and try and break it. If it has been broken recently, I haven't seen. Did you did you see?

Nathan Labenz: (2:19:43) He so, Jan, like I said that someone, I guess, an individual person broke through all 8 levels, but not yet with a universal jailbreak. So 1 person broke all 8, but not

Nicholas Carlini: (2:19:55) Okay. That's the same Okay. Great. Okay. Yeah. That's good to know. Yeah. No. And so I think right. So what this tells you for Lisa's defense is, you know, it it can be broken at the very least even right now using the, you know, for for any individual component. And, if what you were guarding was national secrets, this is probably not good enough. But in order to prevent In security, we have the concept of a script kitty, which is just someone who's not very intelligent and just copies scripts from someone else that they put online and just uses this to go cause havoc. And they're not going to able to develop anything new themselves, but they sort of have the tools in front of them and they can run them. And you could be worried about this in language models where someone just like has access to the forum that just like has the jailbreak that like will jailbreak any model. They copy and paste that and then they go and do whatever harm they want. This would not be very good and we should prevent against this. Partial progress is always helpful even if you haven't solved the problem completely. But the point to remember that this is the problem we're setting out to solve, which is only a subset of the entire problem. We're not gonna declare victory once we've succeeded here and we're not gonna put all of our effort into solving this 1 subset of the problem. We're try and solve everything at the same time too, but partial progress is still progress.

Nathan Labenz: (2:21:08) 1 thing you said there that I thought was really interesting and maybe you can just expand on a bit is really wanting to know. I feel like so much of human performance in these domains kinda comes down to, did you really want to know or not? Any reflections on the importance of really wanting to know?

Nicholas Carlini: (2:21:28) Yeah. I mean, it's like, okay. So some people say, you know, why is it so easy for someone to attack a defense after it's been published than it's than the person who initially built it? And I think, like, maybe half of the answer is because it's very hard after you've spent 6 months building something that you really want to work to actually change your frame of mind and be like, now I really want to break it. Because like, what do you get if you break it? Like you get like not a paper because like no 1 accepts a paper that is like, here's an idea I had, by the way, doesn't People really want to believe in their ideas and they feel really strong that their ideas are the right ideas. That's why they worked on the paper. They spent 6 months on this asking that same person to then like, okay, now completely change your frame of mind. Imagine this is some other thing that really you need to want to break this. It's a hard thing to do. And I think part of this is beneficial to why you have lots of people to do these security analysis of things, because you want to know the answer. I feel like this is a thing that people in security have had to deal with for a very long time. And this is why companies have red teams and these kinds of things, because they understand that there is a different set of skills for security and there is a different set of incentives. And so it's valuable for splitting these things up. It might be hard for 1 person to do this, but an organization can at least put inside the right principles that can make this be possible sometimes if it's done correctly. It seems like for the most part, for most companies who are doing this kind of work, they actually do want to know the answer. I've talked with most of the people in most of these places and for the most part, the security people, they're trying their best like, to add like, get these things as good as possible. And, like, they're trying to set these things up in the right way because they actually don't know the answer and they want to figure it out.

Nathan Labenz: (2:23:14) I certainly have over time been very impressed by Anthropic in particular when it comes to repeatedly doing things that seem like they really want to know. And this does seem, you know, very consistent with that. How do you think this if we circle back to the sort of maybe the hardest question that I think is gonna face the AI community writ large over the next honestly, I don't think it's gonna be very super long time, but timelines may vary. The hardest question being, is there any way to avoid the concentration of power that comes with just a few companies having the weights and it being totally locked down and nobody else being able to do stuff with the problems that seem likely to arise if we open source GPT 5 level models that have all these, you know, different threat models. You know, it seems like we don't have any way right now to square the circle of, like, locking down the open source models. And it it almost seems like it's you know, given how all the you know, everything we've just walked through in all this detail, like, it seems hard to imagine that you could even set up some sort of, like, mandatory testing or whatever that would give you enough confidence, you know, that then it would be okay. Because it seems like yeah. Did you really wanna know and who did you have do it? And, you know, you've got the sort of bond rater problem that we've seen, you know, another the credit with the agency that's getting paid by the model developer to do the certification. We've all these problems. Where does that leave us? I mean, it seems like we're just in a really tough spot there. Is there any out or any recommendation or is the bottom line just simply that if you open source stuff, you have to be prepared for the fact that the fullness of its capability probably will be exposed?

Nicholas Carlini: (2:25:03) Yeah. I don't know. This is a very hard question. This is a question I think it'd be great for you to ask some of these people who think about the societal implications of this kind of work. I think the thing I want these people to understand is that at least right now and presumably for the near future, we probably can't lock these things down to this degree that they would want. I'm very worried about concentration of power in general. I feel like open source up until today has only been good for computer security and safety in general. Because of that, it would take a lot for me to change my mind that open source wasn't beneficial. I'm not going to say it's impossible that I would change my mind if things continue exponentially. If you did give me the magic box that could break into any government system in the world that anyone could have if they have the tools, I probably would say that shouldn't be something we distribute to everyone in the world. But maybe if you have this, now you can defend all your systems because you run the magic box and you just like find all the bugs and you patch them. Now all of a sudden you're perfectly safe. Like, I don't know, like this is, It's very hard to reason about these kinds of things. In any world that looks noticeably like our world we're living in today, I think open source is objectively a good thing. And is the thing I would be biased towards until I see a very compelling evidence to the contrary. I can't tell you what the evidence would be that would make me believe this right now, but I'm willing to accept that this may be something that someone could show me in some number of years. I don't know if your timelines are either 2 or 20 years, this may be a thing that I'm willing to change my mind on and I will definitely say that now. But yeah, I think as long as the people who are making these decisions understand what is technically true, I trust them more than I trust myself to arrive at an answer that's good for society because that's what they're an expert on. I'm an expert on telling them whether or not the tech the particular tool is is going to work.

Nathan Labenz: (2:27:00) And would it be fair to say that what is technically true is that the reason open source has been safe to date and the reason it might not be safe in the future is really a matter of just the raw power of the models. Like, we don't have reliable in the open source context at least, we don't have reliable, like, control measures or, you know, anything of of that sort to prevent people from doing what they want. What we have right now is just models that just aren't that powerful, and so they're not that dangerous even though people can do what they want. But if 1 thing flips and the other doesn't, we could be in a very different regime.

Nicholas Carlini: (2:27:40) Yeah. This is true. So in the early mid nineties, the US government decided that it was gonna try and lock down on cryptography. It was going to rule encryption algorithms a weapon. And exporting cryptography was weapons, munitions export. It was sort of like very, very severe. You cannot do this as a thing. And the reason they were concerned is because now anyone in the world has literally military grade encryption. They cannot be broken by anyone. Like they can talk to anyone in secret that the governments won't be able to break it. So like, this is like a weapon national secret. You can't export it. This is why early encryption algorithms that were exported in the very early web browsers were 40 bits because this was the limit that you could export less than 40 bit cryptography, not higher than 40 bit cryptography. And there was a whole argument around like, is giving every random person who wants it access to a really high level of encryption actually just like a weapon that they can use to have secret communication that literally no 1 in the world can break? I think it was objectively a good thing that we decided that we should give everyone in the world high levels of encryption algorithms. Yes, it supports the ability of terrorist cells to communicate in a way that can never be broken. But the benefit of every person in the world can now have a bank account that they can access remotely without having to go in and you have payment online. And you have dissidents who can now communicate again with perfect security. It was objectively worth that trade off. And the people at the time who were making the decisions were not thinking about all of the positive possible futures and were thinking only about this 1 particular bad thing that might happen. I have no idea how this calculus plays out for language models. It really depends on exactly how capable you think they're going to become. I don't know what that's going to be, but I think I'm particularly worried about concentration of power just because this is something that does not need to assume super advanced levels of capabilities. Like imagine that models get better than they are now, but don't become superhuman in any meaningful way. I think concentration of power is a very real risk that I would really be advocating for open source in this world, because otherwise you end up with some small number of people who have the ability to do these things that no 1 else can do. And this is not a thing that needs to assume some futuristic malicious model kind of thing. We know some people in the world just like to accumulate power. So I'm worried about that. So distributing this technology is very, very good for that reason. But I think it's very hard to predict what will happen with better models. I think that's why this question is harder than the 1 that when it was for cryptography, because it was very clear at that time, like the limits of this is like, it gives everyone perfect encryption. We don't know what the consequences might be, but you're not going to get more perfect encryption than perfect encryption. Whereas with these models we have now, I don't know, maybe they continue scaling, maybe they don't, maybe they scale, but it takes 50 years and it gives society time to adapt. Maybe if they scale and it takes 2 years and society doesn't have time to adapt. I don't know where this is going. I think this is why I want these policy people to be thinking about this more than I am, because I think as long as they're willing to think through the consequences of this, these are the kinds of people who I think can give you better answers than this because there are reasons to believe that the answer might be, we should be concerned. There are reasons to believe the answer would be, we should obviously distribute to as many people as possible. I think if anyone were to say that 1 of these answers is obviously objectively right with a 100% certainty that they're way overconfident. I tend to be biased based on what has been true in the past. Some people tend be biased based on what they think might be true in the future. I think these are reasonable positions to take as long as you're willing to accept that the other 1 might be also correct. So we'll see how things go. I think the next 2 years will give us a much better perspective on how all this stuff is going. I think hopefully it will become much more clear if these things are working out. Think if they do, we'll learn very fast that they are. And if they don't, then limitations probably will start to to hit in the next couple of years. That should give us a lot more clarity. The question is, is that too late? I don't know. Ask a policy person these questions.

Nathan Labenz: (2:31:49) Yeah. No easy answers, but the 1 thing I do always feel pretty comfortable and confident asserting is it's not a domain for ideology. It is really important to Yep. Look the facts square on and try to confront them really as they are. So

Nicholas Carlini: (2:32:06) I definitely agree.

Nathan Labenz: (2:32:07) I appreciate all the hard work that you've put in over years, lighting that path and and showing what the realities in fact are, and I really appreciate all the time that you've spent with me today bringing it

Nicholas Carlini: (2:32:18) to Good to talk.

Nathan Labenz: (2:32:19) All to light. So anything else you wanna share before we break? Could be, you know, call for collaborators or anything.

Nicholas Carlini: (2:32:26) No. I don't have anything in particular. Yeah. I just I I want more people to do good science on these kinds of things. I think it's very easy for people to try and make decisions when they're not informed about what the state of the world actually looks like. And I'm my my biggest concern is that we will start having people make decisions who are uninformed and in important ways. As a result, we'll have to regret the decisions that we made. I'd rather people all the decisions are made based on the best available facts at the time, that's really the best you can hope for. Maybe you made the wrong decision, but you looked at all the knowledge. The thing I'm worried about is we will have the knowledge and people will just make decisions independent about what is true in the world because based on some ideology of what they want to be true in the world. And as long as that doesn't happen, maybe we get it wrong, but at least we get our best shot.

Nathan Labenz: (2:33:13) Well, you keep breaking things and figuring out what's true out there and maybe we can check-in again before too long and do update on what the state of the world is and and try to make sure policymakers are aware.

Nicholas Carlini: (2:33:26) For now That'd be great.

Nathan Labenz: (2:33:27) This has been excellent. Nicholas Carlini, prolific security researcher at Google DeepMind. Thank you for being part of the cognitive revolution.

Nicholas Carlini: (2:33:34) Thank you for having me.

Nathan Labenz: (2:33:36) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

The Adversarial Mind: Defeating AI Defenses with Nicholas Carlini of Google DeepMind

Watch Episode Here

Read Episode Description

Full Transcript

Read next

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

The Adversarial Mind: Defeating AI Defenses with Nicholas Carlini of Google DeepMind

Watch Episode Here

Read Episode Description

Full Transcript

Read next

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal