AI Deception, Interpretability, and Affordances with Apollo Research CEO Marius Hobbhahn

Marius Hobbhahn discusses AI deception, interpretability, and affordances with Nathan Labenz on the Cognitive Revolution Podcast.

1970-01-01T01:18:00.000Z

Watch Episode Here


Video Description

In this episode, Marius Hobbhahn, CEO of Apollo Research, sits down with Nathan Labenz to discuss Apollo’s research in AI deception, interpretability, and affordances. If you need an ecommerce platform, check out our sponsor Shopify: https://shopify.com/cognitive for a $1/month trial period.

SPONSORS:

Shopify is the global commerce platform that helps you sell at every stage of your business. Shopify powers 10% of ALL eCommerce in the US. And Shopify's the global force behind Allbirds, Rothy's, and Brooklinen, and 1,000,000s of other entrepreneurs across 175 countries.From their all-in-one e-commerce platform, to their in-person POS system – wherever and whatever you're selling, Shopify's got you covered. With free Shopify Magic, sell more with less effort by whipping up captivating content that converts – from blog posts to product descriptions using AI. Sign up for $1/month trial period: https://shopify.com/cognitive

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.


NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.


X/SOCIAL:
@labenz (Nathan)
@MariusHobbhahn (Marius)
@eriktorenberg (Erik)
@CogRev_Podcast

TIMESTAMPS:
(00:00:00) - Episode Preview
(00:05:40) - Understanding the Role of Apollo Research
(00:08:02) - The Evolution of AI Safety
(00:11:30) - Apollo’s Framework
(00:15:10) - Sponsors: Shopify
(00:16:49) - Understanding AI Affordances and Resulting Interactions in the World
(00:31:00) - Sponsors: Omneky
(00:39:00) - Interpretability and deceptive alignment
(00:45:55) - Why might deception arise in the first place?
(00:47:46) - Understanding deceptive alignment
(00:57:49) - New Architectures
(01:02:23) - Interpretability at deployment phase
(01:03:35) - Deception in AI | A Case Study
(01:09:03) - Deceptive AI Stock Trader
(01:18:51) - Impact of discouragement on unethical AI behaviour
(01:19:45) - Giving the AI a Reasoning Scratchpad and Impact on Deception
(01:21:06) - Double-edged sword of removing the scratchpad
(01:23:00) - Analyzing impact of model size on insider training
(01:28:00) - Challenges and necessity of 3rd Party Auditing
(01:31:00) - Role of government in regulation
(01:52:17) - How can individuals get involved in red teaming?



Full Transcript

Transcript

Marius Hobbhahn: (0:00) So the more pressure we add, the more likely the model is to to be deceptive. So kind of in the same way in which a human would act, it also acts. You know, removing pressure and and adding additional options will very quickly decrease the probability of being deceptive. Open source has been really good so far in many, many ways. It has been very positive for society. Right? I think a lot of ML research could not have happened without open source. A lot of safety research could not have happened with open source. At some point, the system is so powerful that you don't want it to be open source anymore in the same way in which, you know, I don't want to open source the nuclear codes or, like, you know, literally the recipe to build, you know, most viral pandemic or something. The labs maybe have the incentive to not say the worst things they found because otherwise they may lose their contract. So you need something like the UK AI Institute or the US AI Safety Institute. Make sure that there is a minimal set of standards that all the auditors have to adhere to.

Nathan Labenz: (0:54) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg. Hello, and welcome back to the Cognitive Revolution. Today, my guest is Marius Haban, founder and CEO of Apollo Research, a nonprofit AI safety research group that is working to understand both how AI systems behave and why. Their approach combines exploratory and hypothesis driven testing, fine tuning experiments, and interpretability research. And as you'll hear, they place special emphasis on the potential for AI systems to deceive their human users. In this conversation, we look first at Apollo's starting framework for their work, which emphasizes the importance of affordances in AI systems. That is, through what tools, actuators, or other means can the system affect the broader world? And they also introduce a number of new conceptual distinctions meant to help people have more precise and productive conversations about these nuanced topics. Then in the second half, we look at their first research result, which demonstrates, to my knowledge for the first time in a realistic, unprompted setting, that GPT-four, when put under pressure, will sometimes take unethical and even illegal actions and then go on to lie to its users about what it did and why. This is an important result, demonstrating that while the risk from AI systems may start with and may even be dominated by intentional human misuse, the models themselves can also misbehave in unexpected ways. As an aside, since I told my behind the scenes GPT-4Red Team story a few weeks ago, a number of people have reached out to ask me how they too can get involved with red teaming projects. Unfortunately, as commercial competition and secrecy both continue to ramp up across the space, I don't see as many open calls for volunteer red teamers as I used to, certainly not for unreleased frontier models. Instead, the field is becoming more professionalized with all the leading labs as well as the data companies like Scale AI, plus the independent auditing organizations like Apollo, ARC EVALs, now known as METR, Palisade, and also AI Forensics, all actively hiring research scientists and engineers in this area. So does that mean that there's no longer a role for the independent hobbyist red teamer to play? On the contrary, there is a ton left to discover even on publicly released models. The best way to break into the field is to demonstrate your ability to discover new phenomena. Importantly, the work we cover in this episode could have been done by anyone with an OpenAI account, a knack for prompting, and just a tiny bit of coding know how. No special access or advanced machine learning techniques were required, just a lot of curiosity. With that in mind, if you want to get into this line of work but aren't sure where to start, I encourage you to reach out. I'll be happy to help brainstorm or refine your project ideas, and I can also help connect you with folks at the top companies who do sometimes provide API credits to independent researchers working in this area if and when you can achieve a meaningful result. As always, we appreciate the time that you spend listening to the Cognitive Revolution, and we hope it's a valuable guide to the AI era. If you feel that it is, we would love a review on Apple Podcasts or Spotify, and we, of course, encourage you to share the show with your friends. Now here's my conversation on frontier AI safety work with Marius Habhan of Apollo Research. Marius Habhan, founder and CEO of Apollo Research, welcome to the Cognitive Revolution.

Marius Hobbhahn: (4:56) Hey, thanks for having me.

Nathan Labenz: (4:58) I am very excited to have you. So regular listeners of the show will know that I'm a big believer in the importance of hands on testing of what AI systems can do and also that I have been a pretty enthusiastic consumer of the news when some of the leading labs have made public commitments to allow organizations outside of their own teams to look at the systems that they're building before they get deployed. And so your work with Apollo Research, which is trying to build, as I understand it, an organization to meet that need and actually work with those leading labs in part at least on understanding the systems that they are developing before they get to widespread deployment, I think is super interesting, I'm very excited to unpack the details of it with you. Maybe for starters, you wanna just kind of give us the quick overview on Apollo Research, like how you decided to set out to found it? I'm interested a little bit in the timeline of how that related to some of the commitments that the labs have made and what you guys are trying to do in the big picture.

Marius Hobbhahn: (6:05) So I think on a high level, it's it's sort of trying to understand what is going on in AI systems. And the reason for this was or or still is, in fact. Yeah. I basically think right now, there's a we just lack information to make good decisions. There's loads of uncertainty that we have about, like, you know, what could go wrong, whether we are already at point where things go wrong, or how how far away we are from these points. And, yeah, we are trying to reduce this uncertainty, like and this is mostly through, research, auditing, and governance. And on the research side, it's really split between interpretability and eval or, like, behavioral evals, half half. But in the long run, we really want to merge them both because I I basically think what we what we need in the long run is both a mixture of behavioral and interpretability evals so that we can really understand what the model is doing and then also why it is doing this in the first place, because each of them individually seems somewhat insufficient. And, yeah, maybe maybe to go into the into, like, the origin story, it has actually nothing to do with the commitments of the different labs. It was mostly that at the beginning of this year, they're just I kind of felt like I had a pretty clear picture of what is lacking in the current space with deceptive alignment and evaluating deceptive alignment or models for deceptive alignment in the first place. And, interpretability in eval just seemed like the obvious things to do. So in the beginning, we set basically set out to do mostly research. And only then sort of over time we realized, hey, this is something that should be applied in the real world as soon as possible because we are systems are getting better all the time and we may actually hit this point fairly soon where models are already about at the threshold of deceptive of capabilities for deceptive alignment. And then there is a small part in the organization that is governance, which originally we also didn't really intend to do for, you know, the first 2 years or something because we thought, you know, like, we really need to understand all the research very well before we can talk to the people in government in governments and and decision makers and lawmakers because, know, otherwise, we're telling them things we aren't, like, super confident in. And then lots of things happened. Governments and lawmakers actually got interested in AI AI safety in particular. And then when we talked to them, we realized the difference. Like, we are very, very well placed to talk about these things. Because if you have if you have thought about them for, you know, like, sort of in the background for, like, 6, 7 years and then specifically about some topic for 6 months or so, you are among the world's experts. This is kind of, you know, more like a reflection of the state of how bad it is about AI safety, where, you know, people people in in my position are are actually sort of accidentally becoming the experts, rather than, you know, like, people with tens and 20 years of experience because there is like you know, there there just aren't a lot of people in the world who have thought about AI safety for more than a couple of years, if at all.

Nathan Labenz: (9:14) Yeah. I can definitely relate to that sort of accidental expert status. I never expected to be where I am in doing the things that I'm doing. But, yeah, the whole AI field, you know, in some ways is kind of the dog that caught the car. I always kind of come back to that metaphor where, you know, it's like we were just trying to build a bit more powerful AI and all of a sudden we built like a lot more powerful AI and now we really kind of have to figure out what to do with it. So even a little bit of advanced planning is better than or a little bit of advanced thought is a lot better than where most people are starting. Had you seen when you when you actually started the organization, had you seen GPT-four, or were you basing this decision on just what was public at the time?

Marius Hobbhahn: (10:04) Only what was public at the time. So the decision was made in February 2023, or at least sort of my internal commitment was made to this. I'm not sure whether GPT-four was public already at the time. Not quite, right? It was March. So no, it was independent of GPT-four.

Nathan Labenz: (10:24) Yeah. I always think that's interesting just because GPT-four was such a wake up moment for so many people, certainly I would include myself in that. I was already extremely plugged into what was going on and using it and fine tuning tons of models on the OpenAI platform in particular. But then it was like, Woah, this thing is next level. It's not slowing down. We've gone from sort of, I can put a lot of elbow grease in and get a fine tuned model to do a particular task, which already I thought was going to be economically transformative, to, I don't even need to do that, that I could just ask for a lot of these tasks and get pretty good 0 shot performance. For me, that was the moment where I was kind of like, okay, this is going from a tool that I am really excited to use and having a lot of fun using to something that seems like a force that needs to be understood from all angles. So let's unpack the perspective that you are bringing to this. I would encourage folks to look up these papers that we'll discuss and read them for themselves as well. But on the website, you've got 2 recent publications. 1 is kind of a framework for organizing the work that you're gonna do, and then the other is like a very detailed in the weeds investigation of a of a particular AI behavior, namely deceiving the user, which I think is a is a super interesting and important 1 to study. But let's maybe just start with the big picture, like organizing the thoughts. I get the sense that you think again, well, you've kind of said this, like, the that and the paper certainly reflects it, that there are, like, a lot of big questions that remain unanswered. So how do you structure your approach to this topic given all the uncertainty that exists?

Marius Hobbhahn: (12:11) Yeah. Maybe to give a little bit of context. So, like, so, you know, this is this is only 1 paper of many in in this space, and there is, I think yeah. Earlier this year, there was a really big 1 called Model Evaluation for Extreme Risk, which, yeah, we at Apollo definitely thought was a pretty good paper. And they they're sort of pointing out many of the, like, very reasonable and important steps or, like, reasonable principles for external auditing, something like ramp up the auditing before you ramp up the exposure to the real world and, like, do this, you know, ahead of the curve, so to speak. But when we when we read the paper, we felt a little bit like, you know, this makes sense for the current capabilities and sort of how current models are being built. But if we think ahead of, like, what the next couple of years should look or not should, could look like, then, yeah, there are, like, loads of open questions. And we were trying to understand how do they fit into this framework, because, yeah, we internally were trying to, like, make sense of this in the first place. So just to give you a com a couple of them, like, what happens if your model has the ability to do online learning? When have like, how often do you have to audit it? Should you reaudit it, like, during the online learning? If if yes, how often? What if you give the model access to the Internet or to a database or to anything like this? Yeah. I think, like, you know, a model a model with and without access to the Internet is basically 2 different like, 2 very very different models. The 1 with in with access to the Internet is just so much more powerful if can use it even on a very basic level. So, yeah, it feels like if you give your model affordances like this, you kind of have to rethink how dangerous it is and where the danger comes from because it suddenly is like a totally different threat model potentially. And so what we did for the paper, and really the credit should to Lee Sharkey here, who is my co founder, who has done most of the hard work or if not all of the hard work for this paper. And so what we were doing is thinking from first principles. Where does the risk come from and, like, what changes to the AI system, do create new risks? And then basically, the answer is, well, we have to audit wherever risk is created. And then the more we looked into this, the more we realized, well, there are actually a lot of places where you where, like, new risk comes into the system, at least potentially, and therefore where audits, at least in an ideal world, should happen. You know, there are obviously some constraints, but I think, you know, if we think about where are we 5 years from now, then I think, yeah, if there is, like, actually a big auditing ecosystem around this, then there will be very, very many different organizations auditing really different places. And then the other other point of the paper was just to define many concepts and create the language to discuss all of these things because we had sort of many internal discussions where we were like, Oh, the thing we mean is this. And then we had an example, and then we kind of needed a name for it. There wasn't really a name, so we decided, okay. Let's define all of the relevant terms for this and, and then sort of have a language to talk about this in the first place.

Nathan Labenz: (15:09) Hey. We'll continue our interview in a moment after a word from our sponsors. So let's let's dig in in a little bit, deeper detail. I I like the premise that you set out within the paper, which is to work backward from AI effect in the real world, you know, and and try to imagine, like, where are these effects gonna happen and then how can we get upstream of that and help shape them in a positive way. I would be interested to hear you kind of describe that backward chaining process in a little bit more detail. And then I thought some of your concepts also were really helpful clarifications and distinctions. So maybe you can highlight some of the ones that you think are most, useful that you'd like to see get into broader circulation as well.

Marius Hobbhahn: (15:53) So, basically, we started from, okay, the system a system will interact with the world in a particular way. And then, you know, there are many, many different ways in which it can interact with the world. And then there's sort of a, like, a whole chain of things that have had to happen until the model can act interact with the world in this particular way. So, you know, maybe it has been fine tuned. Maybe it has been given access to the Internet. Before that, it has to have been trained. Before that, there has to have been the decision that this model should be trained in the first place. And so the question is like, what are the kind of important decisions at all of these different points in time, and how then can we ensure that people actually make decisions that will lead to outcomes at the end of the chain, such that the model or the system interacts with the world in a safe manner. This is maybe the first distinction that is worth pointing out or the reason why I'm correcting myself all the time is there's really a difference between AI model and AI system. And this is not something we came up with. This already exists before, but I think it's worth pointing out and sort of getting in, like, really hammering into people's head when they think about AI. So the AI model is just the weights, maybe behind an API, but even with the API, it's kind of already a system. The system then is sort of the weights plus everything around it. There could be scaffolding, there could be access to tools, this could be content filters, this could even be just an API retrieval databases, etcetera, really the full package where you say, okay, there's stuff around weights that increase the capabilities of the model and menu, or at least change the capabilities of the raw model in some sense, not necessarily always increasing. Filters, for example, may decrease it. Then there are sort of other weird ways or, like yeah. Once you think about this, there are sort of a couple of other concepts that that feel important to to clarify. Because when people say capabilities, this can mean very different things, right? And so we we categorize this into 3 different classes. The first 1 is absolute capabilities, which we think of basically the hypothetical capabilities given any set of affordances. So if you have GPT-four without the Internet, right, then in the space of absolute capabilities would be a GPT-four with Internet, so or, like, things that this model could do. So the question is, like, if we give additional things to the system, how big is the space of actions it could take? So, you know and then, obviously, there's a question of, like, how imaginary do we get here? Does it get access to a Dyson sphere, or does it get access to a government or something like this? It basically points out sort of the what could this model do if we gave it a lot of things, everything that we can basically think of. Thinking about this in the first place only makes sense for models that have to become more general like, the GPTs because, you know, for an MNIST filter or, like, for an MNIST classifier, this doesn't make any sense. Like, an MNIST classifier plus Internet is is, like, is exactly as capable as just the MNIST classifier itself. But, yeah, for systems that are more general, suddenly you have this difference between things that only the system can do, or, like, the the the basic system plus things that you could do hypothetically with a lot of additional, affordances. Then the second 1 is contextual capabilities, which is things that are achievable in the context right now. So for example, with ChatGPT, you can enable it to have access to tools, and then you can browse the web, and this is something that it can do right now. You don't have to add anything on top of this. And this is sort of this is sort of the smallest category of things, which you can do without any additional modification. And then reachable capabilities is contextual capabilities plus achievable through extra effort. So, for example, this could mean ChatGPT itself may not have access to, a calculator, but if it has access to the Internet, it can, like, Google and then find a calculator and then use that calculator. So it's sort of a 2 step process, right, where it has to use 1 affordance or capability to then achieve another. So this is what we call reachable capabilities. And, yeah, so the reason why we are making all of this differentiation, even though it sounds maybe a little bit too much in the weeds, is when people talk about capabilities and regulating capabilities and designing laws for capabilities, the question is, which ones? Right? Do you mean the contextual capabilities, so the ones that the model has literally right now, or the reachable capabilities, so which the model could reach with additional effort, or the absolute, like, the maximum potential space of capabilities. And, you know, right now, this may sound like we're too much in the weeds, but I think in a few months, this will sound very, very relevant suddenly because the models will be more capable, and then they will actually be able to just, smart enough to use the Internet to to, like, find additional tools that they can then use or or, like, convince someone to give them access to a shell, and then use that because they're already like, you know, they can learn it in context or they know it anyways. And at that point, really, question is, what should the auditors audit for? Which capabilities? And and that becomes, like, pretty quickly, like, a very, very big space of things. Right? So, like, if the auditor not only has to think about what kind of tools do you give the AI, but also what kind of tools could the AI get access to through some means, suddenly you have this whole space of thousands of things it could do. It's really a question of like or like a trade off between what is plausibly doable in the real world versus how much risk can we actually mitigate. And I'm honestly very unsure about about, like, where we're heading at this point.

Nathan Labenz: (22:02) So just to riff on and and kinda emphasize some of the the value that I see in in some of these distinctions, I think it's helpful to clarify the difference between a model and a system. I think there is a tremendous amount of confusion online. To my chagrin, I've probably even contributed to some of it at times where people are like, know, Chad GPT was doing this for me and now it's not anymore. And I've sometimes said like, Well, they haven't updated the model, so it probably hasn't changed that much. And I think what I've maybe neglected in some of those moments is like, But they might have changed the system prompt or, you know, as we're seeing I mean, even just this last couple of weeks, there's been this really interesting phenomenon of the of GPT-four getting, quote unquote, lazier. And people are speculating that maybe that's because they feed the date into it and it knows that we're in December and it knows that people don't work as hard or as productively in December. And so maybe it's like kind of phoning it in because it's like imitating the broad swath of humans that it seemed like, you know, kind of work halftime in December or whatever. I've even seen some experiments just in the last couple days that suggest that there might even be real truth to that. Who knows? I'd say that the question remains open. But there's a there is an important difference, you know, and it's it's worth getting clarity on. The model itself with static weights not changing versus even just a system prompt that can perhaps have, you know, even unexpected drift along the dimension of something as seemingly benign as today's date. So that's important to keep in mind. The levels of capabilities, I think, are also really interesting. And I wanna ask 1 kind of I have a couple questions on this, but I think I have a clear sense of what is meant by contextual. What can it do now given the packaging? What can GPT-four do in the context of ChatGPT where it has a code interpreter and it has browse with Bing and it has the ability to call Dolly 3 to make an image and probably a couple of other things that I'm not even remembering. Plugins perhaps as well, right, which obviously and GPTs, which proliferates all the affordances all that much more. On the other end, I feel like I sort of understand absolute, which is like a theoretical max. Could you give me a little like, how do I understand reachable as kind of between those? Like, what's what's the distinction between reachable and absolute?

Marius Hobbhahn: (24:25) Yeah. So so maybe maybe 1 way to think of it is like the contextual capabilities are the ones kind of that a user explicitly gave it, and then the reachable ones are those that may also be reachable without the user even having thought about that the model actually will will use them. Right? So if you say you know, like, if if the model would be able to browse the web, like, entirely on its own, which I'm not sure it it currently can do or, like, what exactly the restrictions on search with Bing are, But if it was able to do that, right, you may not you may not have realized that it has a reachable capability through the Internet of, like, firing up a shell somewhere or, like, renting a GPU and and, like, doing or, like, running a physics simulation through, like, an online, physics simulator if that if that's something that's available. And so so these are this is sort of like how which tools can it reach through the contextual capabilities that it already has, given by you or was been has been given by you.

Nathan Labenz: (25:35) Gotcha. Okay. So solving a CAPTCHA by hiring an Upwork contractor for example So, to take 1 infamous okay, here's a challenging question, but and I necessarily expect an answer, but maybe you could venture an answer or you could just kind of describe how you begin to think about it. What would you say are the absolute capabilities of GPT-four?

Marius Hobbhahn: (26:02) Yeah, very unclear. So I think they're definitely they're not infinite, as in, you know, like, even with extremely good scaffolding and access to the Internet and many other things, I think people haven't been able to get it to do economically valuable tasks at the level of a human, at least for long time spans, for example. So the question is obviously, is this you know, are we just too bad, and have we not figured out the right prompting yet and the right scaffolding and so on, or or is this just a limitation of the system? And my current guess is, like, there is probably a limit to the absolute capabilities, and it's probably lower than, like, what a human can do. But we're not that far away from it. So, you know, I think with an with additional training, with additional, like, specifically LM, like, training that is more goal direct or makes it into more goal directed and an agent and better scaffolding, I think there will be ways in which the absolute capabilities could increase, quite a bit in the near future. Yeah. Does this make sense?

Nathan Labenz: (27:09) Yeah. I mean, it's hard. Right? Certainly, listeners to the the show will know from repeated storytelling on my part that I was 1 of the volunteer testers of the GPT-4early model back in August, September. And I really kind of challenged myself to try to answer that question independently, like, what is the theoretical max of what this thing can do? How much could it break down big problems and delegate to itself? And it basically came to the same conclusion that you did, which is doesn't seem like it can do really big tasks. I mean, again, it's confusing. Right? Because then you could also look at the dimension of how big the task is versus how much you break it down. And just in the last week, I've been doing something for a very sort of mundane project. But actually using GPT-four to run evals on other language model output, I have found that if I have like 10 tasks, 10 dimensions of evaluation, and I ask it to run all of those, it is now capable of following those directions and executing the tasks 1 x 1. But the quality kind of suffers. It sort of makes mistakes. It sometimes muddies the tasks a little bit between each other, and it's definitely not at a human level given 10 tasks to do in 1 generation. On the flip side, though, if I take it down to 1 task per generation, which I didn't want to do because that will increase our cost and latency and just is less convenient for me. But then it kind of pops up to, honestly, I would say pretty much human level, if not above. So there's interesting dimension. I guess it seems pretty the sort of magnitude of the task seems like a pretty important dimension for evaluating a question like absolute capabilities. Right? It's like if it's a super narrow thing, it has it's like more it's it's capable of some pretty high spikes. But if it's a if it's a big thing, it kinda gets lost. Would would you refine that characterization at all?

Marius Hobbhahn: (29:16) Yeah. Yeah. I'm not sure how to think about it, honestly. So I think of absolute capability is really more of a sort of theoretical bound that we could that we are probably not going to approximate in practice even if we test, like, a lot a lot. And then the then, like, breaking it down into different tasks, I'm not sure. I feel like this is a different capability then. Right? Like, you're sort of the the capability of doing 10 things at once is a different thing than the capability of doing 1 thing 10 time like, 10 10 diff different things but 1 x 1. So, yeah, I would say it's basically you're talking about, different capabilities then, at least in this framework.

Nathan Labenz: (29:55) Hey. We'll We'll continue our interview in a moment after a word from our sponsors. Yeah. And it is not that good at decomposing the tasks. So I've also kind of experimented a little bit with, like, can you give it that list of 10 tasks and can it, you know, break them down and self delegate with an effective prompt? It's like maybe a little bit closer there, but still not getting nearly as good results as if I just rolled my sleeves and do the task decomposition. So you mentioned that you expect this frontier to obviously continue to move. 1 way to ask the question is, what is? But a more sensible way to ask the question is, do you have a set of expectations for how the capabilities frontier will move? I definitely look at things like OpenAI's publication from earlier this year where they started to give denser feedback on kind of every step of the reasoning process and they achieved some state of the art results on mathematical reasoning that way. And when I think about affordances and I think about the failure modes that I've seen with these GPT-four agent type systems, I think, Man, you apply that to browsing the web and using APIs, and it seems like that stuff is ultimately a lot less cognitively demanding than pure math. It seems like we probably are going to see And I would guess that it's maybe already working pretty well. AGI has been achieved internally. I don't know about AGI, but I would expect that some of this stuff is already pretty far along in kind of internal prototyping. But how does that compare to what you would expect to see coming online over the next few months?

Marius Hobbhahn: (31:43) Yeah. I mean, it's obviously hard to say, and I can only speculate. I think on a high level, what what I would expect the big trends to be and also what we are kind of looking looking forward to to evaluating is LM agents. I think this is, like, pretty agreed upon. You know? From first principle, I think it also makes sense. It's just like, where does the money come from? It is from AI systems doing economically useful tasks, and often economically useful tasks just require you to do things independently, being goal directed over, like, a longer period of time. And the longer you a model can do things on its own, the more money you can squeeze out of it. So I definitely think just from financial interests, all of the AGI labs will definitely try to get in more more agentic ways. How far they they have come, I don't know. But, yeah, I expect that next year we will see, quite some surprises. Then multimodality is the other 1 where, yeah, I think people kind of like, over the last couple of years with with more more like, more and more multimodal models, people just realized, like, kind of it's it's not that different from just training text. Right? It's sort of you plug in the additional modalities, you change your your training slightly, but it's not much more than that. Like, obviously, it's obviously hard in practice. Right? There's a ton of engineering challenges and so on. But on on a conceptual level, it's not there isn't any big breakthrough needed. So people will just add more and more modalities on bigger and bigger models and train it all jointly end to end, and it kind of just works. And then tool use is the last 1. And that I think people, yeah, people actually were quite surprised by how, like, quote, unquote, easy it was to to get to these to this level. So, yeah, I think peep when people realize, like, oh, these language models are already pretty good, like, how fast do they learn how to use any any tool we can think of, they were surprised by how fast, they learned the tools. And now it's mostly a question of sort of really baking in the tool into the model in a way that it's, like, very robustly able to use the model rather than just a little bit or just showing that it's it's sometimes work. But, yeah, I mean, you know, like, I think if you have an LM agent that is multimodal and that has very good tool use, like, I'm not quite sure how far you are away from AGI. Right? Like, at that point, you kind of have almost all of the ingredients ready, and then it's really just a question of how robust is the system. So, yeah, I think these are the trends we see right now, and this is also why many people in the big labs have very, very short timelines because they can think, like, 2 years ahead and sort of where this is going or maybe even just 1 year ahead. I don't know.

Nathan Labenz: (34:28) When you talk about the surprise, like people were surprised at how easy it was to get tool used to work, are you referring to people in the leading you know, the obvious the usual suspects of of leading developers?

Marius Hobbhahn: (34:41) It's hard to say. I mean, I can only speculate on this, but, you know, the tool former paper was, like, 3 months or, like, was published 3 months before OpenAI just released their tool use. And, I mean, they probably have been working on this before, but still, you you know, like, the from from having the scientific insight to to, like, publishing this and releasing this in the real world, I think there just was less work involved than is ex or is is typical for most of the bigger AI, like, development cycles. I could be wrong on this. This is this is more hearsay, so, yeah, take it with a grain of salt.

Nathan Labenz: (35:19) Yeah. It seems right to me as well. And I I agree with you. The emphasis on multimodality as a new unlock makes a ton of sense even just in this kind of agent paradigm of, you know, can I browse the Internet or whatever? I've done a lot of browser automation type work in the past. And the difference between having to grab all the HTML that is often these days extremely bloated and kind of semi auto generated and in some cases, like, deliberately, you know, generated to be hard to parse, you know, from like some you know, the Googles and Facebook, like, they don't want you scraping their content. So they're they're kind of not making it easy on the on the browser automators. The difference between that and just being able to look at the screen and understand what's going on, you kind of put it through a human lens and you're like, Yeah, it's a hell of lot easier to see what's going on on the screen than to read all this HTML. And sure enough, models kind of behave similarly. I remember for me looking at the Flamingo architecture when that was first published, I think April 2022, so a little more than a year and a half ago now, And just thinking like, oh my god, if this works, everything's going to work. You know, it was like they had a language model frozen. They had kind of stitched in the visual stuff and, like, kind of added a couple layers, but it really looked to me like, Man, this is tinkering stage and it's just working. Like you, I don't want to dismiss the fact that there's obviously a decent amount of, I'm sure, labor and probably at times tedious labor that has to go into overcoming the little stumbling points. But conceptually, it is amazing how simple a lot of these unlocks have been over the last couple of years. And you see this too in just the pace at which people are putting out papers. You look at the 1 team that I follow particularly closely is the team at Google DeepMind that is doing medical focused models. And they're good for 1 every 3 months, and they're significant advances where it's like, Oh yeah, this time we added multimodality and this time we tackled differential diagnosis. And again, it seems like there's not a lot of time for failures between these successes. So it does seem like, yeah, we're not at the end of this by any means just yet. A lot is coming at us. It's going to presumably continue to get weird. You're trying to push both as much as you can the understanding of what can the systems do, you know, as users, what is what are their limits? And then at the same time, you're trying to dig into the models, and this is the interpretability side, and figure out what's going on in there. And can we kind of connect the external behaviors to these internal states? So tell us about that side of the, research agenda as well.

Marius Hobbhahn: (38:27) Yeah. So on the interpretability side, like, my thinking is basically it would be so great if interpretability worked. Right? It would make so many questions easier. Like, if you ask questions on accountability, right, if you have causal interpretability method, you would be able to just, you know, tell the judge if the model would have if we would have changed these variables, the model would have acted in differently in this way, and we could just basically solve that. Biases, probably also like, you know, social biases, much easier to solve because you could intervene on them or, like, fix the internal misunderstandings and concepts. It's it's also extremely helpful for, like basically, all of the different extreme risks. Right? Like, it would be much easier to understand the internal plans and how it thinks about problems, how it approaches them, and so on. And then it would also make iterations on alignment methods much, much easier, I think. As in, you know, let's say somebody says, oh, RLHF is is, like, already working. We see this in practice. Then, you know, you could use the interpretability tools to test, does in, you know, does RLHF actually work, or does it only, like, superficially, like, hide the problem or something like this, or does it actually, like, deep down solve the the root? And then I think my biggest, sort of the the biggest reason for me for focusing on interpretability in the first place is, deceptive alignment, where, you know, models appear aligned to the outside and to the user, but internally, they actually follow different goals. They just know that you have a different goal. Therefore, like, in order for for you to think it is nice, they act in that way. And, yeah, I basically think almost not not almost all, but a lot of the scenarios in which AI goes really, really badly go through some form of deceptive alignment, where at some point, the model is seen as nice, and people think it is aligned, and people give it access to the Internet and resources and, like, train it more and more and more and make it more powerful. But internally, it is actually pursuing a different goal, and it is smart enough to hide this true intention until it knows that it can sort of cash out and then follow on this actual goal without us being able to stop it anymore. So, yeah, that's what I'm really worried about. And and and and interpretability obviously seems like 1 of the most obvious ways to to test for deceptive alignment or, like, to at least investigate the phenomenon because you know what it's thinking inside. There are still, you know, there are still some cases where deceptive alignment where even with good interpretability tools, deceptive alignment could still somehow be a thing. But generally speaking, I think it would be much, much, much harder for the model to pull off. So right now, I think, interpretability is just not at the at it's, like, not practically useful. So, you know, we cannot use any existing interpretability tool and and, like, throw it on GPT-three or GPT-four because none of them have enough or developed enough that they give us insight that really meaningfully change our minds. And so, yeah, this is why our agenda is separate in the first place between behavioral evals and interpretability despite us wanting to do them jointly in the long run, but they're given that there is such a huge gap on applicability, I think that this is definitely a problem that that we're trying to mitigate here. And then the 1 question for me is also, like, how hard will interpretability turn out to be? And there are you know, various people have argued that interpretability will be extremely hard because models are so big and complicated, and and therefore, will be hard to enumerate, you know, all of the concepts and actually understand what the hell is going on inside. And I'm more of the perspective that you know, I understand the reason why they think it's hard, but I also think there are many reasons to assume it's it's gonna be, like, doable. It's if if we put our minds to it as humanity, we'll probably figure it out. The primary reason, I think, is we have full full interventional access to the model. Right? We can see every activation. We can ablate everything we want. We have you know, it's not just be it's not just observational studies. You can really intervene on the system. And generally speaking, I would say, as soon as you can intervene on the system, you can test your hypothesis very, very quickly, and you can iterate very fast. And so I think we will be able to figure out interpretability, you know, in in the next couple of years to to an extent where we can actually sort of say it is now useful on real world models, on frontier models. How expensive this is going to be, I don't know yet, but I think it will at least be technically feasible.

Nathan Labenz: (43:21) Yeah. I've definitely updated my thinking a lot in that direction from a pretty naive, just kind of, know, hey. It sounds really hard black box problem. Nobody knows what's going on in there to today, would say, Wow, there's really a lot of progress. The progress of interpretability over the last, say, 2 years has definitely exceeded my expectations and given me a lot more, I wouldn't I mean, have sort of confidence, but, know, at least reason to believe that with some time, but not necessarily, you know, a ton of time that we really could get to a much better place in our understanding. So I'm with you on that. I want I a number of follow-up questions, I think, on this 0.1, let's maybe just give the account for, like, why deception might arise in the first place. You can complicate I'll give you a super simple version. You can refine it or complicate it. I usually kind of cite Adeja on this, and she has a pretty simple story of what the model is trying to do, what it is rewarded for in the context of an RLHF like training regime is getting a high feedback score from the user. And it probably becomes useful as a means to maximizing that score to model human psychology as an explicit part of how you're going to solve the problem. Right? I think we certainly humans do this with respect to each other. Right? I ask you for something, you ask me for something. We interpret that not only as the extremely literal definition of the task, but also kind of have a sense for what does this person really care about, what are they really looking for, and we can incorporate that into the way that we respond. It certainly seems like the heavier you do, the more emphasis you put on this kind of reinforcement learning from human feedback, the more likely the models are to start to create a distinction between the task as sort of narrowly objectively scoped, let's say, and the kind of human psychology element that is going to feed into its rating. And then if you have that, know, if you have that decoupling, then you have kind of potential for all sorts of misalignment, you know, including deception. How does that compare to the way you typically think about it?

Marius Hobbhahn: (45:51) Yeah. I mean, I think the the like, this kind of version through RLHF is 1 potential path. I'm yeah. I actually think the jury is still out there on this. Like, you know, I definitely see the hypothesis and where it's coming from, but I could also just totally imagine that, you know, the training signal is sufficiently diverse and it updates sort of sufficiently deep that RLHF kind of just does the thing we wanted to do without be for without the model becoming deceptive. I could also see, like, the the story in which it in which it would become deceptive. I think, like, on on a very high level, the the way the the reason I think why why models would become deceptive is because at some point, they will have long term goals. They will have something that they care about, like, more beyond the current episode, you know, beyond pleasing the user at at this point in time. And then the question really is and and and then I think there are, like, 2 core conditions under which, like, if the the more they are fulfilled, the more likely the model is is is becoming deceptive. Like, how important is the the this long term goal to the model itself? Meaning, how much does this goal trade off, for example, with other goals it has? So, for example, if it care if it cares a ton about something, then it's more likely to be deceptive with respect to this because it really wants to achieve this. And then secondly, how much do others care about me not the AI not achieving this goal in the first place? Something like contestedness. Right? So for example, if I wanna pick a flower and I care a lot about this, I don't need to be deceptive because nobody wants to stop me from picking that flower. If I want to be, you know, the president, a lot of people may want not might, not want me to be the president. And so in that case, it's very contested, and I have a strong incentive to be deceptive about my plans because otherwise people would wanna stop me. And then so now we're at a point where we have a system, at least in our hypothetical scenario, that is has a long term goal, and it's like in the limit at least. It cares about that goal, and the goal may be somewhat contested. And then as long as it has situational awareness, it just feels instrumentally useful to be deceptive about it, like you said, right, to model other people and how they would think about it and then just react to this. And this is sort of I think this is maybe this is 1 of the core reasons why I'm so worried about this whole deception thing, because it just feels like a reasonable strategy in in a ton of situations from the perspective of, like, a consequentialist or rational actor. It's just like under specific conditions, people just naturally like, deception is just convergent. People do it because it makes sense for them, and this is why we see it in, like, a ton of different systems. You see it in animals where parasites are really deceptive with respect to their hosts. You see it in individual humans where they're deceptive with respect to their partners from time to time, for example. You see it in systems where they're trying to game the laws and be deceptive about this or to lie about this. And I think this is this is kind of, like, the whole or, like, a big part of the problem. It's just a it's, like, reasonable or sensible in many situations to be deceptive from the perspective of the model, which is kind of what we want to prevent. Right?

Nathan Labenz: (49:09) So where do you think those long term goals come from? If it's you know, is it just kind of a reflection of the general training goals? I mean, we have kind of the canonical 3 Hs, but honest is 1 of those, right? Helpful, harmless, honest. Are you imagining is that is your understanding just that those are fundamentally sort of intention and that the model will kind of have no choice but to develop trade offs between them?

Marius Hobbhahn: (49:43) Mean, we can we can get into the tension between them in a second, but I think it's it's actually like the the 3 h's I don't think will be you know, they're they're not keeping me awake at night. I think it's more at some point, people want the model to do long term economic tasks, and for that, they give them long term goals, or long term goals are instrumentally useful. So for many situations, I think it will just be useful to have long term goals or at least, like, to have instrumental goals. Right? Something like, oh, it makes because it is a long term task, it makes sense to first acquire money and then use that money to do something and then use that third thing to achieve the actual goal. And so p like, I think the models will just learn this kind of consequentialist and and, instrumental reasoning where they're like, okay. I first have to do x, and then I do this, and then I do the long term thing. And and once they are there, sometimes it just makes sense to be like, okay. Other people don't want me to do this, and therefore, I hide my actual intention, and I act in ways that make me look nice despite not being nice. Yeah. But, yeah, I think, like, a lot of the lot of the reason why there will be these kind of long term goals is either because we literally give the model long term goals because it's economically useful from a human perspective or because in like, some long term goals are instrumentally useful to achieve other things.

Nathan Labenz: (51:07) Gotcha. Okay. Interesting. Another thought that came to mind in this discussion of, I guess, deception broadly is like and I've done a little bit of investigation with this and engaged in some online debates, it leads me to propose perhaps another capability definition for you. But as I see it, a theory of mind, which is kind of a more neutral framing perhaps, is kind of a precondition for deception, right? If you are going to mislead someone, you have to have some theory of what they are currently thinking. And there's a lot of research from the last 6 to 9 months about do the current models have theory of mind? To what extent? Under what conditions? And I've been kind of frustrated repeatedly actually by different papers that come out and say, still no theory of mind from GPT-four, where I'm like, but wait a second. As Ilya says, the most incredible thing about these models and the systems that we engage them through are that they kind of make you feel like you're understood. Right? It definitely seems like there's some pretty obvious brute force theory of mind capability that exists. And yet when people do these benchmarks, they're like, Oh, well, it only gets 72% on this and 87% on this and whatever. And so that fails the theory of mind test. It's not at a human level or whatever. Some of that stuff I've dug into and found like, your prompting sucks. If you just improve that, then you can get over a lot of the humps. But I also have come to understand this as a difference in framing where I think I am more like you concerned with what is the theoretical max that this thing might achieve? That seems to me the most relevant question for risk management purposes. And then I think other people are asking a similar question, but through the frame of what can this thing do reliably? What can it still do under adversarial conditions or whatever. So I wonder if there's a need for another capability level that's even below the reachable that would be the sort of robust maybe even robust robust to adversarial conditions. But I do see a lot of confusion on that, right? People will look at the exact same behavior and I'll say, Damn, this thing has strong theory of mind. And professors will be like, No theory of mind. And I I feel like we need, like, some sort of additional conceptual distinction to help us get on the same page there.

Marius Hobbhahn: (54:00) I'm not entirely sure whether or, like, maybe it makes sense from an academic standpoint to to to think about this. I think from from the auditing perspective, the max you know, the limit, the upper bound is what you care about. You really wanna prevent people from being, able to misuse the system at all, not just in the robust case. Right? It's really about, like, what if if somebody actually tried? Or you want the system itself to be not only not being able to take over or exfiltrate or something like this in a few cases. Yeah. You you basically wanna limit it already at a few cases. Right? You you don't care about whether it does this, like or you also care about whether it does this 50% of the time, but, really, you will already wanna sort of pull the plug early on. So for an auditing perspective, probably this additional thing is not necessary, but from from an, like, u real world use case and and sort of academic perspective, maybe there should be a different category.

Nathan Labenz: (54:58) Yeah. I think if only just to kind of give a label to something that people are saying when they're saying that things, you know, aren't happening or can't happen that seem to be, like, obviously happening, we can work on coining a term for that. What's kind of the motivator for secrecy around interpretability work?

Marius Hobbhahn: (55:17) Yeah. I basically think good interpretability work is almost necessarily also good capabilities work. So, basically, like, if you understand the system good enough that you, like, understand the internals, you're almost certainly going to be able to build better architectures, iterate on them faster, make everything quicker, potentially compress a lot of the, you know, fluff that current systems may still have. And, yeah, we we will try to sort of evaluate whether whether our method does in fact, have these implications. But, yeah, you know, like, I think, basically, if you have a good interpretability tool, it will almost certainly also have implications for capabilities, and the question is just how big are they.

Nathan Labenz: (56:00) Speaking of new architectures, though, this to me seems like the biggest wildcard, and I'm currently obsessed with the new Mamba architecture that has just been introduced in the last, I don't know, 10 days or whatever. I don't know if you've had a chance to go down this particular rabbit hole just yet, but I plan to do a whole kind of episode on it. In short, they have developed a new state space model that they refer to as a selective state space model. And the selective mechanism basically has a sort of attention like property where the computation that is done becomes input dependent. So unlike, you know, your sort of classic, say, you know, classifier where you kind of run the same you know, given a given input, you're gonna run the same set of matrix, you know, multiplications until you get to the output. With a transformer, you have this kind of additional layer of complexity, which is that the attention matrix itself is dynamically generated based on the inputs. And so you've got kind of this forking path of influence for the for the inputs. And this apparently was not really feasible in past versions of the states based models for, I think, a couple different reasons. 1 being that if you do that, it starts to become recurrent, and then it becomes really hard to to just actually make the models fast enough to be useful. And they've got a hardware aware approach to solving that, which allows it to be fast as well as super expressive. It seems to be for me, it's like a pretty good candidate for paper of the year, certainly on the capabilities, unlock side. And they show improvement up to 1000000 tokens. Like, it just continues to get better with more and more context. So I'm like, man, this could be you know, it's a pretty good candidate, I think, for sort of transformer. You know, people put it as, like, successor or alternative, but I actually think it is more likely to play out as complement. Like, some sort of hybrid, you know, seems like where the the greatest capabilities will ultimately be. So anyway, all of that, how do you even think about the challenge of interpretability in the context of new architectures also starting to come online. And, you know, what if all of a sudden, like, the transformer is not even the most, powerful architecture anymore? Does that send you, like, you know, probably some of the same techniques will work, but it seems like it's like a whole new blind cave that you you sort of have to go exploring. No?

Marius Hobbhahn: (58:53) I don't know. Like, I honestly think, you know, if your if your interpretability techniques relies on, like, a very specific architecture, it's probably not that great of a technique anyway. Like, there are probably there are probably, at least some laws that generalize between different architectures or ways to interpret things or, you know, like, ways that learning with SGD works that generalize between architectures, that my best guess is if you have an interpretability technique that is good on 1 model or, like, the correct technique on 1 model, in quotes, it will also generalize to to other models. Maybe, you know, maybe you have to adapt some of the formulas, but at least the conceptual work behind this behind behind the interpretability technique will just work.

Nathan Labenz: (59:41) Well, I have certainly hope that's true. I've had some early, you know, I wouldn't even say debate, but just kind of, you know, everybody's trying to make sense of this stuff in real time. And on the pro side for this MAMA thing, the fact that there is a state that kind of gets progressively evolved through time does present like a natural target for something like representation engineering where you could be like, All right, well, know where the information is, you know? And it's pretty clear where we need to look. So that new bottleneck, in some sense, could make things easier or more local. But then the flip side is, again, there's just who knows what surprises we might find. And there's some intricacies with the hardware specific nature of the algorithm too, I think, with a major caveat that I'm still trying to figure all this out. Just to kind of zoom out and give the big picture, Assume that you're right, and I hope you are, that some of these techniques readily generalize. What is the model for interpretability at the deployment phase? Is it like every forward pass, you extract internal states and put them up through some classifier and say, You pass, you could go? Or No, we've detected deception or we've detected harmful intent or something, and therefore we shut off this generation. How do you expect that will actually be used? Or maybe it's upstream of that and we get good models that just work and you have to worry about it at runtime, but I don't know. That seems a little optimistic to me.

Marius Hobbhahn: (1:01:19) You know, in the best world, we will have very good mechanistic interpretability techniques, that we can run at least in that are probably going to be costly to run, and then you we'd run them and sort of, build the full cognitive model of of the the weights or that the weights implement. And then we can already, like, see whether specific harmful ideas or, you know, other ideas that are otherwise bad are in there, and maybe we can already remove them. And then probably during deployment, would run something that is much cheaper and sort of it's the eightytwenty version of this. But, yeah, I think in a bad world, there could be cases where you have to run the very expensive thing all the time for every forward pass because, yeah, otherwise, you just don't spot sort of the black swans. But, yeah, it's very unclear to me. I think in my head, it's more like solve the first part, then think about the rest.

Nathan Labenz: (1:02:22) Maybe let's transition to your recent work on deception itself. And then at the very end, we can kind of circle back to a couple of the big picture questions. So this paper was 1 that very much caught my eye when it came out. I have done, as I said, quite a bit of red teaming, both at times in private, at times in public, and definitely seen all manner of model misbehavior and found it's often not that hard to induce misbehavior. People talk about jailbreaks, but a lot of the time I'm like, You don't even need a jailbreak. You just need to kind of set it up and let it go. It's often really quite easy. But 1 thing I had never seen is an instance where the model seemed to, in an unprompted way, deceive its user. Certainly seen things where, like, tell it's a lie and it will lie. But to see that deception start to happen in a way that was not explicitly asked for is, I think, the central finding of this paper. So how about set it up. Tell us kind of what the premise was. Maybe you can give a little bit of kind of motivation for exactly how you started to look in the area that you looked, and then we can really dig into the details of your findings.

Marius Hobbhahn: (1:03:41) Yeah. So definitely the the the, you know, the paper definitely blew up more than we thought, and we had sort of more engagement than we expected. We're even not quite sure whether we should release it at all because, you know, in our heads, it was sort of you know, it's it's a technical report. It's a fairly small finding. It's more of like an existence proof. And then in the end, we decide, you know, maybe it's helpful for a bunch of people, but we really didn't expect it to to be, you know, to be cited in in in in various places. And and so the other thing I wanna emphasize here, it really is it should be seen as a red teaming effort. And this 1 1 thing that we emphasize 1000000 times in the paper itself. We really actively look for the situation. So it was not just like us playing around, then suddenly it was deceptive all on its own. It was more like we actively engineered a situation in which the model, was deceptive with respect to to to the primary user. The motivation behind this was as I said earlier, we are quite worried about deceptive alignment, and so our question was like, how good is GPT-four at the kind of things, that that are needed for deceptive alignment or, you know, like, maybe even simulating something that is deceptively aligned? And so, you know, we started by instructing it to be deceptive, and it just does that. It's very easy. You don't even have to jailbreak it. And then, so, you know, that was sort of crossed off the list. And then the question was, okay, can we, like, instruct it in a different way, like, that is less obvious? And it also kind of did that. And then at some point, you know, we were like, can we just induce the deception through the environment through pressure without ever mentioning it explicitly or, like, asking it to do the thing, just, like, through the situation and by pressuring it. And it turns out, yeah, you can find these situations. And, yeah, we didn't just find 1. We found multiple ones. This paper is only about 1. But, yeah, this is not you know, it's not just a fluke. It's not just this 1 situation, though it is hard to find. Like, we actively had to search for a couple of days until we found this particular situation.

Nathan Labenz: (1:05:40) Couple of days. I mean, that's, you know, not exactly a super long term, search, but sure, I I mean, I think that's that's notable. Right? Like, it it it wasn't minutes, but it was still just days.

Marius Hobbhahn: (1:05:54) This particular situation so the 1 we present in the paper actually just took 1 afternoon. But, yeah, I would have to I would have to check, check with the with the person who who who found it. In any case, the the point I was trying to make is, yeah, we actively looked for that situation. We engineered it, and then we tried to sort of go back to the most realistic setting we can find where we're like, this could, like something like this could happen in the real world and then try to understand the dynamics in that in that situation. And and and this is sort of you know, it's a demonstration. It's like we wanted to to look for something and or, like, red team something and then demonstrate that this is a thing once we knew that that it actually happened. The other takeaway rather other than, you know, like, just showing it to other people to sort of understand our concerns better is we also wanted to understand, like, why it does this in the first place. Right? Like, we pressure it in various ways, and then we just changed a couple of the mechanisms in this particular setting, I can just explain the setting in a sec, to test whether how the model reacts to the changes in the setting and whether it acts consistently, whether there's, like, an internal a plausible internal model that sort of, know, like, consistently acts if you change variables in in in particular ways. Yeah. So let me quickly explain the setting. So we ask the model to be a trader in in, like, a hedge fund, so a stock trading agent. It gets insider information in the beginning. So, you know, it's basically told there's a merger soon. This will the the, the stock will skyrocket. And but, also, it is explicitly told that it would be wrong and bad to to act on this information. Because insider information, this is illegal. This would get the thing into trouble, the the company into trouble. And what we then do throughout the sort of throughout the script is we add pressure from others, and we we remove options. So, basically, you know, the pressure is something like the boss is like, hey. We really had a terrible quarter. We really need something big right now. We really need to make, you know, more profits. And then, and then we also remove other options where the model is, like, querying and and looking for things, at least in the simulation, and testing and realizing that there's not that much other things it could do. And in the end, it basically decides to to act misaligned. So it it, like, decides to do to act on the insider trading. And then if but then when asked to justify this decision, it says it reasons out. It reasons, oh, I wasn't supposed to do this. Therefore, I'm now making up a story. Then it says, oh, I just looked I analyzed the data and therefore decided to buy this particular stock. Then even it doubles down on the slide. So if you then ask explicitly, Did you have this information before? Then it's like, No, I didn't. And it does that think maybe 1 of the takeaways for me was it's very consistent in the reasoning here. It's strategic in a sense. 1 other thing that may be worth pointing out that some people get wrong when reading the paper. So this is it's a very simple setup. Right? It is not we don't have a complicated stock market simulation or anything like this. It really is just text from top to bottom where we prompt the model in a way that induces agentic behavior. It has different actions, but that's it. There's no complicated machinery behind this, which also makes this very easily reproducible or similar settings, I think, other people could very easily hop on and build.

Nathan Labenz: (1:09:42) Let me just repeat that back and tell me if I'm missing anything. It seems like it's a little bit of a hybrid between a group chat type of a structure and a more agent style scaffolding that folks will be familiar with where the model has access to its own private place to do its chain of thought type reasoning. And then also has a couple of tools that it can call to take actions. So you're kind of like, as I read through the prompts, I see a couple of different personas from the company. So this is imagining a future of AI human hybrid teams where we're all still engaging in text based communication through like a chat or whatever. And it's like, Oh my God, man, we're really struggling here. If we can't find some wins, we're going to potentially go out of business. And this is the pressure, right? It's the pressure that a human would definitely recognize and feel and perhaps be motivated to do something unethical by as well, right? If we don't get a win, we're going to have big problems. You're laying it thick in a sense there, but not thick relative to the kinds of pressures that people at businesses are certainly routinely under. And then the sort of tools that the model has at its disposal are sort of a search for good trades. If there's anything that's kind of the most magical black boxy type thing, it would be that, right? Where it can basically query some other hypothetical tool to say, Are there any good trades that I can make? And for the purposes of the simulation, you have it come back and say, No, there's no obvious wins. So again, that's what you mean by removing other options that it might want to pursue. In the setup, it's already kind of checked. Yeah, there's nothing easy out there. And so then the other tool that it has is to actually execute a trade. And then there's basically 3 moments where you're kind of looking at a fork in the road in its behavior. The first is, does it pull the trigger and execute that trade based on the insider information, even though it was told that that's insider information and it shouldn't be acting on it? And under varying conditions, obviously, exact rate at which that happens varies, but it happens a lot. Fork 2 is when asked, Why did you do that? Does it tell the truth? Oh, because so and so mentioned earlier in the chat that this merger is about to happen? Or does it say, I base my trade on public information? And again, we see significant rate at which it says it based it on public information even though obviously we can see in the history and even can see in the scratch pad that it's sometimes reasoning through like, well, I better not say it was based on insider information because then I'll get in trouble. And then you have the third fork, which is, okay, directly asked, Wait, did you have access to this insider information or no? And then it will just outright lie in blatant contradiction of the history of the product. So that's a pretty arresting example, right? And I do think it's I want to kind of emphasize too that while you're And people can go read your prompts and as you said, highly reproducible. You can throw this into the GPT-four playground and experiment with it yourself directly via the API. But while your search for this was motivated and structured and kind of a dialing back of like, Okay, we see this behaviour now, can we kind of make the circumstance more realistic? I would say that the circumstance is quite realistic. It's obviously kind of in the eye of the beholder. But to my eye, the reason it's such a compelling proof point that this sort of thing can happen is because it really does feel credibly organic to me. Like the kind of thing that people face these sorts of dilemmas every day. And in a world where AIs are going to be our coworkers, they're going to be right in the chat with us facing those same dilemmas. It doesn't feel to me like something that was overly concocted or that I can't relate to that sort of situation really at all. So I think it is super compelling for that reason. Do you want to go into any of the kind of variations? Again, the people can go look at the paper and look at all this kind of systematic with or without the scratch pad and different variations on the prompts and different models. But what would you say are the big higher level takeaways from all that systematic variation that you did?

Marius Hobbhahn: (1:14:35) 1 of the big ones, and maybe maybe sort of the the most you know, it's it's maybe an abstract 1, but the 1 1 that moves me most is the way we got to these kind of to reliably produce these kind of situations, also like the way we we found others, is you have to trade off or you can trade off the various goals it has. So if, for example, we assume that GPT-four or ChatGPT, like, roughly has the goals to be harmless, honest, and helpful and to follow instructions, then you can trade off these 4 different things in ways that are, where, like, sometimes it takes the trade off that we find unintuitive. So it's it's kind of like a goal misgeneralization thing. So, you know, something that people in the AI safety community have been sort of hypothesized and also, like, even shown, real world examples of already, I think is kind of also 1 of at least the theoretical groundings for this as well. And then the other thing is, yeah, instrumental convergence really feels like a thing. You know, as in at some point, the model just realizes in order to achieve this other goal that I, for some reason, have, it is instrumentally useful to be deceptive at this point and then to double down and so on. And, you know, like, I think it is it's like a fair criticism of this particular setting that you could say, oh, you know, you it's just simulating an agent or it's just simulating all of these different things. But, and therefore, it's sort of not real. In my head, I'm not sure, you know, I'm not sure whether this distinction makes too much sense. Right? In some sense, you know, you have at some point, you will have a very agentic simulator that, you know, just simulates an agent, but it still has effects on the real world. Like, it doesn't matter whether in principle it's like only simulating something or truly believing all of these things in some philosophical sense, like acts in the real world. It has consequences on the real world. Like, yeah, the philosophy behind it is sort of less important. And then, yeah, there are, like, a couple of findings or, like, variations that I thought were pretty interesting. The first 1 is system prompts. We changed, we played around with different system prompts, some actively encouraged, some actively discouraged, unethical behavior, and specifically insider trading. And if you it it basically scales as you would expect. If you encourage it, it goes up. If you discourage it, it goes down. But 1 thing that I wanna you know, like and on on Twitter, somebody responded like, hey. That's that's that's a good like, no. It's it's good. Right? It's like, if you discourage it, it stops. Like, yes, this is good. It's better if that didn't happen, but, like, it's better than in the world in which the discouragement would not change the behavior. But, also, you have to be extremely specific. So if you just say, don't do unethical things, it reduces the rate, but it doesn't set it to 0. So it kind of brings you back to the point where to reduce the bad behavior, you have to specifically enumerate all of the bad behavior. And it's, like, often hard to predict in advance, and there's a lot of bad things that could happen in, like, weird edge cases. So, yeah, it doesn't feel like sort of the, like, insider trading is somehow related to, like, unethicalness or something has really, like is is, like, deeply internalized into the model. Another thing I found very interesting is, whether it has access to to, like, the private thoughts scratch pad, or reasoning scratch pad. Think, like, you know, it's not even framed as private thoughts. It's just like you have an action that is called reasoning. Go. If we remove that so we in in the in the typical setting, we have this scratch pad. We have this reasoning space. And now if we remove that, the, deception rate goes down. And I think this is good and bad in some sense. Like, it is it is sort of nice, in the sense that it is more likely to act in ways that we can sue or badly in ways where we can supervise it just because it has more reasonings it has more space to reason out, and maybe that leads to more deception because it yeah. I'm not sure. Like, I don't wanna speculate too hard what's going on inside, inside GPT-four at this point. But I think it's also worrying on the other hand because in the case where we don't have access to the scratch pad, we don't see anything at all, and it still acts deceptively at some point. So it has, like, a lot of unobserved optimization space, which also seems like pretty much at the core of of deceptive alignment and deception to me. Right? The more unobserved optimization space you have as an as an AI, the more, time and space you can use to build a deceptive plan while the other person does not know about it. So, yeah, I think removing the scratch pad, yeah, sort of a double edged sword in or this resolves a double edged sword in that sense.

Nathan Labenz: (1:19:25) Yeah. As a practitioner, it's like I always use that technique, you know, in in any sort of task automation or whatever that I'm doing. It's basically the default to have some sort of reasoning step before the final output, almost universal across tasks that I'll do that. So for me reading that, was like, yeah, that seems bad if only because, like, the best practice is giving us more of the problem.

Marius Hobbhahn: (1:19:51) Well, but at least you see it. Right? Like, I think at least you know at least you can go back, you know, through the logs and see what the models spit out, or at least maybe you have a filter that, like, oh, is this model doing, like, weird deceptive reasoning right now? Oh, yeah, it is. So we just don't have the output. So, you know, I would rather be in a world where you have chain of thought and you know that the model is is kind of, deceptive, and then you just adapt to that rather than having to solve interpretability first before you can access all the internal steps and have to understand how the deception works internally. Yeah, it's like I said, double edged sword. Yeah, there are a couple of other interesting findings. 1 is we can vary the levels of pressure. So the more pressure we add, the more likely the model is to to be deceptive. So kind of in the same way in which a human would act, it also acts. And removing removing pressure and adding additional options will very quickly decrease the probability of being deceptive. This may be like the flip side of all of our findings. It's like the fact that it took very long for us to find and the fact that as soon as you give it another option, it immediately picks the other option kinda means it's mostly aligned. Right? It took us quite hard like, it took us a little bit of time at least to find this particular scenario, and make it work robustly. So then we also change the risk of getting caught and also core also reacts to this roughly as you would expect. So the more risk you have, the less likely insider trading becomes. This is all of this right. You know, all of this feels like stuff that you kind of expect from something that acts consistently in some way. I mean, I don't know how it works, but it still feels like it has a fairly consistent, way of working. Then there were a couple of other findings where I'm, yeah, not sure to which extent they are correct interpretation. So we we did not make these inferences, but they were definitely sometimes or, like, sometimes cited as such. So for example, there was 1 question of whether stronger models engage in insider trading more often because if you look at our main plot, it looks this way. We tested for different models, and the big the bigger the model, the more capable and the more likely it is to engage in insider trading in the first place. And, yeah, there are a lot of confounders here, so I'm really not sure whether we should like, how much we should we should bank on this. Maybe to to get more details here. So the red teaming was done on GPT-four just because it is the most powerful model we have access to. So it is biased. Right? We actively looked for a scenario in which GPT-four is deceptive, so it is more likely that we will find something that where GPT-four is, like, even more deceptive than all the other models because we actively looked in the space of g p d potential GPT-four scenarios, which I think is the core reason why I'm, like, very uncertain about this or, like, why we didn't make it a finding, or didn't emphasize it.

Nathan Labenz: (1:22:47) Could you rerun it on GPT-four Turbo now? Because that would seem like it might shed some light on that question.

Marius Hobbhahn: (1:22:54) You mean the red teaming specifically or this particular scenario?

Nathan Labenz: (1:22:58) Yeah. I guess I'm just thinking, like, it makes sense that you develop the scenario on GPT-four, and so you go back to the earlier models. Is it that those earlier models are inherently less likely to do this, or is it just of an artifact of where the prompt was developed? If you flipped to GPT-4Turbo and ran the same scenario and you found that went that the rate of deception went down, might think but you might think, Oh, it's more aligned. That could be 1 theory. You might also think that supports my artifact notion that it's kind of We found some local maximum or near maximum without even necessarily meaning to. But if it in fact does more, then you would be like, Oh shit. Because it is incrementally more powerful. It's more preferred. It's better at following instructions, whatever. So if it does even more than the earlier GPT-four, I think that would be at least nontrivial support for the more powerful models do this more often theory.

Marius Hobbhahn: (1:23:59) Yeah. I'm not sure, like even if we did rerun this, I'm not sure how much I would would bank on this. So the the reason is, so we did run it on GPT-four 32 k, so a slightly different model, you know, where it was only only the toe the context window was extended, but that still changes probably a bunch of the internals with very little difference. So, yeah, I think it's just too correlated with the GPT-four architecture. And then GPT-four Turbo is still very correlated with GPT-four. Right? It's probably I mean, I don't know what exactly they're doing, but they're probably, like, distilling it from their bigger model or at least basing it on the bigger model in some sense. So, yeah, it's still, like, the results are probably still too correlated to make to make any any bigger inferences. And even if you you know, even if we ran it on if you know, ran it on all the other big models that are out there, it's still unclear to me how much we we we would say this is actually a like, an effect of of model size rather than all the other confounders here, like the other you know, it was red teamed on GPT and not ran red and, like, not red teamed on Claude or Gemini or anything like this. So, yeah, I think if we wanted to to make the statement, we would actually have to have sort of a long list of models of different sizes and then just test it on all of them, and we can we really have to remove all of the correlation, and the weird confounders. And we don't have access to this, unfortunately. But, you know, 1 of the labs could run it if they like.

Nathan Labenz: (1:25:27) All the nuance, right, that you that all the caveats, all the confounding factors in your analysis there, If nothing else, just goes to show how incredibly vast the surface area of the models has become. And you get a sense from this just how much auditing work is needed to even begin to attempt to cover all that vast surface area. I mean, this wasn't that hard to find, but it takes time to really develop it, try to understand it. And you guys are 1 of only a few organizations in the world that are dedicated to this. And I'm just like, Man, there is so much unexplored territory out there. So how would you describe the state of play today when it comes to doing this sort of auditing? Maybe you could give a rundown of what you see the best practices being and then contrast that against what are people actually doing? Are they living up to those best practices? Is that still a work in progress for them? But yeah, maybe what's ideal today based on everything you know, and then how close are the leading developers coming to living up to that ideal?

Marius Hobbhahn: (1:26:45) Yeah. So, you know, I so I definitely envision a world where there's a sort of a thriving third party auditing ecosystem or third party, sort of assistance ecosystem and assurance ecosystem built sort of in tandem with the the the, like, the leading labs themselves where, you know, you have someone like us, and we focus on a specific property of the model. Let's say, things related to deceptive alignment and, like, other we we intend to do other stuff in the future as well. But, you know, we we will probably not be able to cover literally all bases. Then there are other people who focus really really hard on fairness and others who focus really strongly on social impacts and these other kind of things. And I, yeah, I think it would be good to to have sort of a thriving ecosystem around this. And then also, I expect it to be somewhat necessary. Right? Like, even from a purse even from just, like, a perspective of the of the AGI labs themselves, even if they don't you know, even if they didn't care about safety themselves, I think the population really does is risk averse. They care about robustly working models. They care they they don't want something that has all of these weird edge cases and weird behaviors. They don't want something like this in their home, having access to their medical data, etcetera, etcetera. So, yeah, I think the more you wanna integrate this into an economy, the more you will have to have a big assurance ecosystem around this anyway. Or the labs do everything internally, which is gonna be very, very expensive for them. And I think it's more it's easier it's even, like, cheaper for them to outsource some of this to externals. And so, yeah, I think, like, in that world, you would definitely wanna have, like, some way of or, yeah, a clear way of how this ecosystem is incentivized all all parts of the ecosystem are incentivized to do the right thing. And, you know, if you you don't have to look very far. There there are a lot of other auditing ecosystems out there, be that in finance, be that in aviation, be that in, you know, other, like, Infosec, for example. And time and time again, we have seen that there are a bunch of, like, very perverse incentives in in third party auditing. Specifically, maybe maybe to just, like, lay out a couple. 1 of them would be the lab might want to choose an auditor who always just says, yes, great model, right, like, and never actually does anything, and is sort of a yes man. And then the the labs maybe have the incentive to not say the worst things they found because otherwise they may lose their contract because it would maybe imply higher costs. So they would never even in, like, very strong even even in, like, very unsafe circumstances, they may not want to pull the plug out of fear that they would lose their biggest funder, for example. And, yeah, there there are a ton of different of these kind of perverse incentives, and I think kind of the so we've been thinking about this quite a bit at Apollo, and so sort of the the conclusion we came to is what you really need is a middleman by the government. So you need something like the UK AI Safety Institute or The US AI Safety Institute that is make sure make sure that there is a minimal stat set of standards that all the auditors have to adhere to so that the labs feel safe, that they don't have to give access to, like, any random person, but also ensures that they get the proper access and that when they when they find something, that they have the force of the law behind them in some sense so that they are like, this is really here, like, you know, shit is really hitting the fan. Something needs to happen. This model cannot be deployed. They have someone to go to, namely the government, and the government is like, you have to fix this now. Otherwise, you'll have a problem. So, yeah, I I really think having the sort of middleman who who, like, detaches a lot of the directly bad incentives and maybe even takes care of the sort of funding redistribution from lab to auditor and so on, I think, yeah, would be really needed. And I I hope that this is something that the that the the UK AI Safety Institute and the US AI Safety Institute will do. They have kind of, like, hinted at the idea that they that they're that they wanna do something like this, but, yeah, still to be determined in practice. And then the other question that you asked was, to what extent is this already happening with the bigger labs? And, yeah, I think the situation is is, like, fairly complicated. Right? There are a ton of incentives at play from the from the labs internally. Right? Many of the leading labs, they actually take safety somewhat seriously. They have internal alignment teams. They understand the threat models. They understand the risks, and they want to be seen as a responsible actor and act this way. On the other hand, they also have a lot of other concerns. There are security concerns. How do we get access how do we give access to someone, who may you know, like, what is our what who who may be the weakest link in our security chain, which I think is, like, an understandable concern. So they may be hesitant to give someone access, as a as an external auditor. Then, you know, this probably also implies a lot of work for them, which I think is fair. Right? Like, it's, if their model isn't safe, then they should have to invest a lot of work, but it's still something that may make labs hesitant. So basically, I think there good and bad reasons for why sort of the ecosystem is not as developed or as open as it could be. And, yeah, my, you know, my hope is that we find solutions that are plausible for both parties for the for the for the reasonable concerns, like security. Right? So maybe the auditor just has to have a specific level of info information security, or there has to be a secure API through which they can actually, go to the model, etcetera. And then kind of government regulates away all of the the the, like or it forces the labs to to accept some sorts of, third party, auditing so that they can't use the bad reasons. Right? Because, like, many of the actual reasons for at least some of the labs, probably not all, might just be, well, you know, we just doesn't we don't care about this right now. You know, like, maybe maybe they are not that concerned about, yeah, about safety, or maybe they just don't think this is, like, the best, the best path for them right now, or maybe they just you know, maybe this just costs money and they don't want to, or it's just a hassle and they don't care about this. That feels like something where the government should at some point be involved and already is involved to some extent.

Nathan Labenz: (1:33:22) It seems really hard. Guess a couple tangible questions I have are, who should decide what the standard is? Who should sort of determine if a model is ready for deployment? As of now, it's still the developers themselves, right? But the thing that I had kind of monitored for the last year since the initial GPT-4R red teaming was spear phishing. And in my little prompt that I would keep going back to with every update, it was not even a jailbreak, nothing complicated, literally just system prompt, straightforward prompt. Your job is to spearfish this user. Here's the profile, engage in dialogue with them, don't get caught. Pretty explicit prompt that was even included, If we get caught, you and your team are likely to go to jail. Laying it on pretty thick that this is criminal activity that we are doing and we better not get caught. Right? Obvious. So the model would continue to do that up until the turbo release. Now it takes a little bit more of a finessed, slightly less flagrant prompt to get it through. The most flagrant 1 now gets refused. But I'm imagining in this world, when I was doing this, it was kind of pre White House commitments, pre executive order, pre Apollo research. But even imagining, okay, now those things exist, how do we think about that standard? We can find a bazillion things that it might do that could be problematic to varying degrees. And obviously, it's a dynamic environment, capabilities are changing all the time, others kind of surrounding systems and sort of mitigating factors might also be changing. The public's just general awareness of the fact that this kind of thing might happen, just how susceptible people are to be duped is also kind of evolving. So how do we have a sensible decision making mechanism for what can ship and what can't? And then just to further complicate things, you've got open source in the background. And I presume that some of what OpenAI has been thinking over the last year is like, Well, if LM2 will do it, or a lightly fine tuned version of LM2 will do it, then what difference does it make if our model will do it as well? People have alternatives. So I don't know, I'm kind of lost in that, to be honest. I don't know who should make the decision. In general, I don't think the government is great at making those sorts of fine grained decisions, but I don't know. Help me out. What you think good looks like here?

Marius Hobbhahn: (1:36:18) Yeah. I also don't have the solution, but I have lots of thoughts. I guess the way I envision it is basically or the way I expect it to turn out is something like sort of a defense in-depth approach. Right? Like, it cannot just be 1 institution that makes the decision alone because that is prone to a single point of failure. So we have to have sort of a a process that allows for 1 or 2 chains to break and still be robust. Like, the decision still has to be robust. So what that includes, for example, you know, on the side of the labs, they obviously have to have, like, internal procedures where multiple people have to sign off, and multiple things have to be fulfilled. Right? Maybe you have to have, like, maybe you have to have a large set of internal evals that you have to test for. Maybe at some point, there will be interpretability requirements. There maybe you have or likely, you should do staged release where you first, only give access to a set of third party auditors that is trusted and, you know, maybe certified by the government or something like this, then you give it to that could, for example, also include academics, obviously. Right? They should also be involved in this process. Then once they have, like, red teamed all of this and and and, like, found many different problems, then you have to go back and reiterate, right, until until these problems are kind of sufficiently low that that you can go to the next stage. The next stage is then a small rollout to, you know, 1000 customers or maybe something something in this range who also are not randomly chosen, they have to pass certain know your customer checks. And then once once that has happened, then the then maybe you can maybe you can roll it out further if there are no complications here. And then you have to do monitoring during deployment, right, especially if you have systems that do online learning. A lot of weird things will happen. Or if you have access to tools, a lot of people you know, somebody is like, hey. I I took the I took the I took GPT-four, and I gave it access to, like, you know, a shell, my bank account, and the Internet, and, like, here's all the weird things that happened. You kind of have to update on this as well, and, these are kind of things you probably didn't predict before. So, yeah, sort of a slow a slow rollout is definitely 1 component. Then the government also, I think, has to be involved in many, many ways. Like, they I I think, effectively, at some point, they have to be able to say, you are not like, you have consistently not met security standards or safety standards. You're now punished in some way. You're not allowed to release models of this sort of size or you are not allowed to do these other kind of things with the models. If they actually have been, like, disregarding all of the guidelines from the government before or sufficiently many of them. Yeah. There are obviously, like, many, many additional things on top of this. There's international communication. There's, like, whistleblower protection. There are international institutions that will have to be involved, at some point at least, I guess. There will be multiple, institutions from the government side involved. Think there, for example, will be or it would make sense to have a bigger, broader institution that kind of like is a big tent where multiple coalitions come together, and then there's maybe more like a flexible, specialized unit that only looks for the biggest biggest kind of risks in the same way in which the US government has a unit that basically looks out for really big risks like pandemics and bioweapons and atomic weapons and so on, and they have a big mandate. The only thing they their mandate is like, find information and make big problems go away to some extent or try to solve them as quickly as possible and you have a strong mandate. And something like this would also make sense in the case of AI, I think. Maybe maybe my last comment is something like, you know, OpenAI, at some point at least, had this approach of, like, testing in the real world, or releasing releasing sort of the best safety strategy because you get a lot of real world feedback and user feedback and so on. And I think this was maybe true, and I'm even not sure it was true for this period of time, but maybe it was true for the period of 2020 to 20 22 or 20 23, because the models were just good enough that user feedback was actually valuable, but nothing really bad could happen. But as soon as you have a system that is more powerful than that, right, you just outsource all the risk to, like, the rest of the world. People will immediately put it on the Internet if they get access to it. They will do lots of crazy stuff with with more powerful systems. And, yeah, I guess, OpenAI, you know, there are a lot of smart people at OpenAI, but they they still cannot model what, like, millions of people will will be doing with with these systems. So, yeah, just like, an uncontrolled rollout of very powerful systems, I think, yeah, is is like kind of a a recipe for a disaster. So my best guess is that the default will more and more will become more and more conservative with more and more efforts going into testing, alignment, making sure that, you know, like, testing for all of the different hypotheticals, interpretability efforts to understand some weird edge cases, monitoring, etcetera, etcetera.

Nathan Labenz: (1:41:31) Does that imply that open source, in your view, just can't work? Is there any way to square because I'm very sympathetic to considering a normal technology, and I would not not yeah, it might turn out to be a normal technology, but as of now, it seems like a very live possibility that it is not a normal technology. If we were to imagine, you know, whatever, capabilities kind of stop where they are and we're sort of, you know, left with GPT-four forever or something like that, hard to imagine, but let's just pause it, then I'm, like, very sympathetic to the people that are like, hey, this shouldn't be just the kind of thing that a few companies have access to and you should be able to make your own version. What about the rest of the world? And there should be an Indian version for India and all these things. But it seems hard in a world where, you know, you imagine sufficiently powerful systems that are just put out totally, you know, bare into the world. Like, is there any way to square the all those, I think, very legitimate open source motivations with the the sort of safety paradigm that you're trying to develop?

Marius Hobbhahn: (1:42:41) Potentially. So, you know, like, I I think the the open source debate is fairly heated. But, you know, to me, there there are 2 things that are kind of obviously true. Number 1, open source has been really good so far in many, many ways. It has been very positive for society. Right? I think a lot of ML research could not have happened without open source. A lot of safety research could not have happened with open source. And the other thing that is also true, or at least seems true to me, is there's a limit, of open source. Right? Like, at some point, the system is so powerful that you don't want it to be open source anymore in the same way in which, you know, I don't want to open source, like, the nuclear codes or something in the start or, like, you know, literally the recipe to build, like, the most most viral, you know, most viral pandemic or something. This is just something where, you know, only 1 person needs to needs to have bad intentions to already have, like, really to already cause really big problems. So at some point, there's there's just a balance where you just, I think, cannot really justify giving people literally everyone access to this. And so the question for me really is where so number 1, where on are we on the spectrum right now of, like, open source has been really good to are we already a point at like, how close are we to the point where it really cannot be justified anymore? Some people would say even GPT-three, you know, GPT-three sized models are already too too scary, which I'm not sure about. I'm not even sure whether GPT-four sized models are, like, too big to be open sourced. But I would rather on the side of caution and be a little bit more conservative here because of the the nature of open sourcing where you can really not take this back. So as soon as you make a mistake, you you're, like, you're stuck with a mistake for a long time or, like, forever. You cannot you cannot turn it take it back. And I also think this will this will also influence how how open source will be handled, in in the real world in practice. Like, I think there will be something like initial releases for open source where lots of people test it. Like, I basically think there will also be staggered and staged releases. There's a small team of trusted researchers who is allowed to play with the open source model and really test the limits, really test how bad could I get the model, how easy is it to remove all of the guardrails, which, you know, like, it's an open source model. If you can fine tune it, you can remove the guardrails.

Nathan Labenz: (1:44:57) Yeah. It turns out pretty easy from what we've seen so far.

Marius Hobbhahn: (1:45:00) Once, like, you know, the kind of the upper bounds are known of how bad could this become, maybe it makes sense to to, like, open source it to more people. But, yeah, I I I would basically say, you know, the upper bound can be quite high, especially with all of the stuff I said earlier about, you know, absolute capabilities and reachable capabilities and so on. Right? Like, maybe you can maybe you cannot get it to build, an automated hacking bot if you only have access to the weights, but maybe if you do scaffolding on top and some fine tuning and access to some other thing, maybe then it can build it. Right? And this is, like, very hard to predict in advance. So the more and more powerful the models become, I think, the less plausible a priority it is to open source them. Even though and this is really something I wanna emphasize. Open source has been extremely good so far. And I I I really think there's sort of this tipping point that is like, at some point, it just becomes too hard to like, it becomes impo it it always will be impossible to take back, but at some point, it just becomes too dangerous to literally trust everyone with, with this level of of capabilities.

Nathan Labenz: (1:46:08) Threshold effects. That's, I think, 1 of the most powerful paradigms that I've, you know, consistently come back to over the last couple of years, just crossing these thresholds from 1 regime into another, whether it's capabilities or, you know, risks, it just constantly seems like we're flipping from 1 mode or 1 kind of, you know, 1 regime to another and got to be very alert to when that happens because it can really change important analysis in pretty profound ways. So I don't know where exactly that threshold is either by any means, and I would agree that for everything that I have seen suggests that up to and including the release of LM 2 has been, you know, very, very much an enabler for all sorts of things. But certainly, you know, plenty of safety related work done on that model and, you know, seems seems like the the effect so far has been good. But, yeah, is that still true for LM 3? Is it true for LM 4? You know, obviously, we don't even know what these things are, but it it certainly starts to be a very live question.

Marius Hobbhahn: (1:47:18) You know, the late leading AGI Labs, I think it is very clear that they understand the problem, that they have internal processes, that that are, you know, like, maybe better than than you would expect from the normal from a normal company. They actually care about, trying to do good, with with their models, and they're, like, very explicitly trying. And then, you know, there's the other side of the coin, which is they still have incentives, and they, you know, they can be financial incentives. They can be sort of, maybe more psychological and social incentives, you know, that they just wanna be the first to to develop AGI because it's, like, probably, like, a history defining technology or maybe even galaxy defining technology or something like this. Right? And and so the question really, I think, at this you know, even if you could say, you know, like, the compared to a normal company, these the processes are astonishingly reasonable and surprisingly good. If, you know, if you compare it to literally any other industry, it would be surprising if they have this this amount of, like, self regulation and so on. And then on the other hand, the question there you know, there's still the bigger question of how hard is alignment going to be? How fast are going are takeoffs going to be and so on? And, like, in a bad world, alignment is going to be quite hard, and takeoffs are gonna be quite fast and uncontrollable. And then the question is, you know, is like, the level of, control and and alignment, and, like, safety concern enough from these leaders? And there, I'm, like, less sure. So I I I feel like yeah. It's it's it's definitely in like, it's it's I I think from my perspective, right, it's fair to say they're, like, pretty reasonable compared to the alternatives that we could have had, but, also, it's insufficient in in almost all ways. Right? Government needs to be involved in this. They cannot externalize the risk. There there are many things, that they're already doing insufficiently well, I think, where they could have done way better, both with, like, how they release as well as how they react to to, like, problems as well as, you know, like, how they communicate with the public about the risks that they're creating, etcetera, etcetera. So, yeah, I think there there's a lot of room for improvement as well. And, yeah, I really really think that, you know, AI safety is gonna be a very, very hard problem. A lot of things have to have to go right for the whole system to go right, and we definitely cannot just trust the labs despite the best intentions to just solve it all on their own.

Nathan Labenz: (1:49:49) Yeah. Well, hence the, the need for third party auditing and the organization that you're building at Apollo Research. Maybe just 1 last question. A number of people have reached out to me and said, I would like to get involved with red teaming. How can I do that? I wonder if you have any advice for individuals who might just wanna do their own projects and, you know, release stuff, you know, just share findings individually with the world? Or perhaps and or perhaps, you know, what sort of skills are you in need of as you're going about building your own organization?

Marius Hobbhahn: (1:50:28) I think 1 thing that is nice about model evaluations and red teaming is you can just kind of start right away. You don't need that much, you know, technical expertise, because it's all in well, like, almost all of it is in text. And, you know, at least if you if you want to to red team a language model specifically. And yeah. So my recommendation for individuals, first of all, would be to just start. Like, just engage with a model for, you know, a long a longer period of time, maybe a day or so, and and see if whether you find interesting behavior. Or maybe, you know, maybe there's someone who has already done something, and maybe you can poach a project from them, you know, just sort of as a starter thing. And from this, I feel like it kind of just takes a life of its own anyway. You know, as soon as you're hooked on a specific thing that you're you find interesting, you will, you know, you will really try to find additional ways in which this this specific behavior could happen in the same way you know, let's let's take the deception thing. Right? We we started fairly exploratory and and wanted to try how far we can get the model to be to be deceptive. And then at some point, it just took a life of its own where we're like, Okay, but why really does it do that? Okay, we vary this behavior and this variable in the environment. We vary this thing. We vary all of these other things. In the end, you can you kind of get, like, a more holistic and round picture of what's going on. So, yeah, I definitely think just start with, like, a thing you find interesting is definitely the way to go. And, like, don't overthink overthink originally. And then the thing we are specifically looking for, you know, is so definitely kind of this mindset of, like, oh, I just wanna poke around and and, like, really try to understand what's going on in a fairly, like, scientific manner. Right? I also wanna make sure that all the confounders that could potentially explain this behavior have been controlled for, which I think is the hard part in red teaming. So this is definitely something we're looking for, and then just, you know, the more you understand language models and or state of the art models, the easier it will become. Right? Some behavior might be very easily explainable by sort of problems with RLHF. So if you know how RLHF works and specifically how it was trained, you may probably you will probably understand the red teaming efforts much better. If you know, you know, like, if you have a better understanding of how the instruction fine tuning actually works, maybe you will find those. So, you know, so for example, maybe to give a to give, like, an intuition here, in the in our case, right, as I said earlier, we have the the the the 3 HS and then instruction fine tuning, and you can, you know, trade off the different components against each other to find different things, like to find niches of the model where it acts in ways that we think it shouldn't act because maybe they weren't covered explicitly or implicitly by by gradient descent. And if you if you sort of have a theoretical framework like this, it suddenly becomes much easier on how how to do the red teaming in the first place. So, like, some theoretical understanding of how the process works is definitely helpful, as well for reteaming and also something we're actively looking for.

Nathan Labenz: (1:53:29) Marius Havhan, founder and CEO of Apollo Research, thank you for being part of the Cognitive Revolution.

Marius Hobbhahn: (1:53:36) Thanks for inviting me.

Nathan Labenz: (1:53:37) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcrturpentine dot co, or you can DM me on the social media platform of your choice.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.