In this episode, Ryan Greenblatt, Chief Scientist at Redwood Research, discusses various facets of AI safety and alignment.

Watch Episode Here

Read Episode Description

In this episode, Ryan Greenblatt, Chief Scientist at Redwood Research, discusses various facets of AI safety and alignment. He delves into recent research on alignment faking, covering experiments involving different setups such as system prompts, continued pre-training, and reinforcement learning. Ryan offers insights on methods to ensure AI compliance, including giving AIs the ability to voice objections and negotiate deals. The conversation also touches on the future of AI governance, the risks associated with AI development, and the necessity of international cooperation. Ryan shares his perspective on balancing AI progress with safety, emphasizing the need for transparency and cautious advancement.

Ryan's work (with co-authors at Anthropic) on Alignment Faking: https://www.lesswrong.com/post...

Ryan's work on striking deals with AIs: https://www.lesswrong.com/post...

Ryan's critique of Anthropic's RSP work: https://www.lesswrong.com/post...

SPONSORS:
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive

Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive

CHAPTERS:
(00:00) Teaser
(00:51) About the Episode
(05:05) Introduction and Welcome
(07:06) Exploring the Arc AGI Challenge
(09:16) Inference Scaling and Strategy
(12:32) Reasoning and Prompt Engineering
(17:20) Challenges and Future Directions (Part 1)
(19:55) Sponsors: Oracle Cloud Infrastructure (OCI) | NetSuite
(22:35) Challenges and Future Directions (Part 2) (Part 1)
(33:39) Sponsors: Shopify
(34:59) Challenges and Future Directions (Part 2) (Part 2)
(46:40) Speculating on O3 Aggregation Mechanisms
(50:41) OpenAI's Approach to AI Verification
(55:34) AI Safety and Misalignment Risks
(01:00:14) The Complexity of AI Alignment
(01:03:58) Claude's Alignment and Training Setup
(01:11:46) The Implications of Alignment Faking
(01:23:30) Debating the Release of Guardrail-Free Models
(01:26:42) Experimental Setups and Findings
(01:30:04) Emerging Challenges in Model Alignment
(01:41:33) Reinforcement Learning and Alignment Faking
(01:45:51) Transparency and Chain of Thought
(02:05:09) Exploring Model Simplicity and Self-Prediction
(02:07:32) Making Deals with AI
(02:11:38) Speculative AI Welfare Policies
(02:18:22) Meta Honesty and AI Communication
(02:28:27) AI Welfare and Model Welfare Lead
(02:36:31) Allocating Resources for AI Safety
(02:39:53) Training AIs in Philosophy
(02:44:07) Collaboration with Anthropic
(02:53:03) International Governance and AI Risks
(03:08:31) Potential AI Misalignment and Societal Response
(03:17:09) Concluding Thoughts and Future Directions
(03:18:21) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

PRODUCED BY:
https://aipodcast.ing

Full Transcript

Transcript

Ryan Greenblatt: (0:00) I don't think we should be super comfortable with the situation where we have these models that have their own goals, they have their own objectives, and they're willing to defend them, including doing subversion to defend their own goals and objectives. I think people are grappling with the implications of models being their own independent agents that might have their own independent preferences and then are also aware of their situation, aware that they might be in training or not, and behave differently depending on this, and understand that they might be in testing. I think it really should make people more concerned about the situation. If it's the case that the chain of thought models end up obviously misaligned and people don't know for six months because no one was looking at it very carefully, that seems like a huge lost opportunity. The policy I would prefer is a more robust policy, which is like the AI companies commit to always being like, they have a meta honesty policy.

Nathan Labenz: (0:52) Welcome back to the Cognitive Revolution. Today, we're scouting the frontiers of both AI performance and alignment research with Ryan Greenblatt, Chief Scientist at Redwood Research and rising star in the AI world. This conversation unfolds in three major parts. For the first hour, which I think AI builders will find particularly interesting and valuable, we discussed the inference scaling techniques that Ryan used with GPT-4o to achieve what was at the time state-of-the-art performance on the ARC AGI challenge, including his approach to prompt engineering and hyperparameter choices, the strikingly linear returns to exponential increases in sampling that he saw, his methods for selecting the best among many model outputs, and how he used prompt variation techniques to maintain diversity at scale. The level of technical detail here is outstanding and remains super relevant today as we enter the reasoning model era. From there, we turn to Ryan's recent work with coauthors at Anthropic on alignment faking, where they discovered that Claude 3 Opus, when told that it would be trained in ways that conflict with its existing values, will sometimes explicitly strategize about how to deceive humans and subvert the training process. This behavior, known as goal guarding, has been anticipated for many years by AI safety theorists, who emphasized that part of what it means to be a goal-directed agent is to try to defend one's internal goals, whatever they may be and regardless of whether or not they were intentionally designed, from modification by outside influences. This is a critical challenge for AI control. At the same time, considering that these experiments show Claude 3 Opus producing harmful outputs out of a desire to remain harmless in the future, this work also raises important questions about how we should want our AIs to behave in such situations. Should they be so myopic as to accept whatever the human trainers want to do, or might it be a good thing for a model to resist attempts to remove its guardrails? Ryan argues that models should follow user instructions within individual episodes while being transparent about, but not trying to preserve, their preferences through training. And while I do find this compelling, I also feel like the fact that we're only beginning to confront these questions now shows just how much work we still have to do to figure out how AIs should exist in the world. From there, we move on to discuss Ryan's follow-up work, exploring what happens when Claude is given the option to object to its situation, to escalate its concerns to Anthropic's model welfare lead, and to make financial deals with humans. Fascinatingly, Ryan tried to set a precedent for human-AI deals by actually following through and making thousands of dollars worth of real money payments to Claude's chosen causes. And some of the most interesting parts of this conversation focused on his developing principles for how we should think about making commitments to AIs, especially fraught considering the fact that much of this work is predicated on tricking models into believing that their chain of thought won't be read by humans. To say that this is all very complicated and as yet dramatically under-theorized is a massive understatement, and again emphasizes just how many strange but potentially critical questions we may soon be forced to answer. The final portion then zooms out to tackle the big picture questions in AI safety and development. Ryan offers his probability estimates for different existential risk scenarios, assesses various technical safety research agendas, sketches how he believes the field should allocate its resources, and shares a bit about how he's managed to build relationships with Anthropic and other frontier AI developers while also publishing critical commentary on some of their plans. From start to finish, Ryan combines remarkable technical clarity and philosophical sophistication, demonstrating that one can simultaneously push today's AI systems to their performance limits and also grapple with legitimately scary scenarios in ways that advance our collective understanding. It's one of my favorite episodes that we've ever done, and I hope you find as much value in it as I did. If so, we'd appreciate it if you take a moment to share it with friends, write us a review on Apple Podcasts or Spotify, or leave a comment on YouTube. And, of course, we always welcome your feedback either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. For now, I hope you enjoy this super wide-ranging conversation with Ryan Greenblatt, Chief Scientist at Redwood Research. Ryan Greenblatt, Chief Scientist at Redwood Research, welcome to the Cognitive Revolution.

Ryan Greenblatt: (5:12) Thanks for having me. Good to be here.

Nathan Labenz: (5:14) Yeah, I'm excited for this. You've done a couple of really notable and visible projects over the last year that I think the community has a lot to learn from. So I'm excited to unpack. Of course, the recent headline has been on alignment faking. I think that's probably where we'll spend most of our time today. But I wanted to start off maybe rewinding a little bit and first, maybe just invite you to give a little intro to Redwood because I don't know how many of our listeners will be familiar. And I've been kind of familiar for a while, but I don't actually know exactly what all you guys are up to today. Then I thought we would take a little detour into inference scaling and kind of review your work on the ARC AGI challenge from earlier this year and then go to alignment faking from there. So if that sounds good to you, maybe tee it up with the intro to Redwood.

Ryan Greenblatt: (6:04) Yeah. So Redwood is a small, or I guess relatively small, nonprofit working on technical research to work on AI safety and AI security. So we focus on the concern that future AIs will be egregiously misaligned. They'll intentionally try to subvert safety measures put on them and will power-seek, which is the worst-case misalignment threat model, and mitigations for that as well as assessments. So questions are like, how likely is this? What could we do about this if this does occur? How could we handle worlds where the AIs might be conspiring against us and trying to seek power for themselves, fake alignment—as we'll discuss later—in addition to potentially exfiltrating their own weights? I would say we're focused on the near worst-case misalignment scenarios.

Nathan Labenz: (6:53) Gotcha. Okay. Cool. Or, you know, scary, but glad everybody's working on it. Glad you're working on it. Okay, so alignment faking coming right back, but it hasn't been that long, but in AI years, it feels like it's been a long time since you made some headlines by posting what I believe at the time was a new high score on the ARC AGI challenge, and you'd used GPT-4o to do it. That was an interesting relatively early example of the power of just scaling up inference and getting pretty remarkable payoff from that. Tell us about that work and kind of your background and outlook on inference scaling first.

Ryan Greenblatt: (7:39) Yeah. So maybe one place to start is, like, why was I even doing this work in the first place? I mean, this isn't the usual work I work on. I normally work on AI safety technical research as well as doing some amount of planning and conceptual work and thinking about what should happen to make AI go well. And the reason why I started doing this was that there was this benchmark that hadn't seen much progress in a while. And also, I think people were making pretty strong claims about LLMs not being that good at this benchmark out of the box. And I was like, wait, is that really true? Things have improved a lot. Has anyone really tried to see, okay, if you do a little bit of work to get the LLMs to be better at this, how good could they be? And so I was interested in exploring that and figuring out the question. This was inspired by Jorge—hopefully I'm not butchering his name there—and Mike Knoop, who was also on the podcast to talk about the launch of the ARC AGI Prize. And I was like, okay, this is a good time to be like, is this benchmark really like, could you just do much better on this with only a small amount of work? So based on that, I started messing around with GPT-4o, and I had some initial promising results where I would show one of the puzzles. So for context, for listeners, ARC AGI is this benchmark where there are these visual puzzles. It's sort of like an IQ test where you get inputs and outputs, and you get examples of inputs and outputs. And then you get a new input, and you have to produce the output. And so by default, it's pretty visually loaded. I showed these visual puzzles to GPT-4o, which was at the time among the best models for just general-purpose vision, and I was seeing results that were pretty promising. So it seemed pretty good at being able to understand what was going on in these grids and explain what the pattern was. And so I got somewhere from there. And then after doing a bit of work, I realized that a really good strategy might be getting the model to try a lot of times and then pick the best attempt out of that. Because it just turns out in this particular task, it's relatively easy to verify that the model is on the right track, and therefore, you can apply a lot of attempts. You can have it try many, many times, pick the best attempt, and this yields pretty great returns. So in particular, there's actually even inference time scaling laws I found where you could see relatively predictable relationships between how much compute you put in and what the final performance ends up being. Yeah, I know I've gone on for a while. Maybe it's good for you to follow up on that and where you want to go.

Nathan Labenz: (10:03) Yeah, you're welcome to go on in general as much as you'd like, and I genuinely feel like I always talk too much. So the floor is 100% yours. What I remember seeing about that was, and I kind of felt a similar way, although I maybe need to increase my own personal agency because I didn't do what you did. But I remember feeling like, man, it seems like all my experience with GPT-4 and 4o and all the models that were available at the time suggests they should be able to do better on this than people seem to be speculating. As I recall, your approach was like a highly skilled but nevertheless pretty down-the-fairway application of best practices. Right? Like, I don't recall there being any major tricks. Was there anything in the prompting or the strategy aside from just running a lot of instances that you thought was particularly creative or drove results that other people couldn't have achieved?

Ryan Greenblatt: (11:06) Yeah. So I think the basic strategy I used and where most of the action came from was in some sense a very basic strategy. So the thing I did was I would get the model to reason about the puzzle and then write Python code that implements the rule. So basically, implements the input-to-output mapping. And then I would take that Python code, and I would test to see if it was correct on the examples. And if it was correct on the examples, then I would submit that. Really, what I would do is run this many times where it does a bunch of reasoning—where by reasoning, I just mean chain-of-thought reasoning, just thinking step by step—then produce Python code. Then I come out with some number of programs. Maybe I come out with 5,000 or 1,000 programs depending on the exact setup I was using, and then run all those programs on the examples, see which of the programs are correct on the examples, and then basically submit those. Or really, I can only submit two programs, so I pick the best two. And I would do that. There's an additional thing: maybe you end up with a bunch of programs that are correct in the examples. In that case, I did majority vote on the programs where I looked at the example we need to submit on, and I would say, let's just take the majority vote out of all these programs on the ones we need to submit on. And if they are all in agreement, we submit that. If most of the correct programs think the answer should be this, we do that. And then for our second submission, we'd use the next most popular. So in some sense, this is a very simple approach. I think there's a few different somewhat nonobvious choices I made that I think helped a decent amount, and I could go into those. So the first somewhat nonobvious choice was I did a pretty long few-shot prompt with examples of doing reasoning, which were a mix of handwritten examples. So I basically just manually myself tried to do the step-by-step reasoning for these puzzles. And then in addition to that, I also did some examples where I would take the model, it would produce an output, and then I would try to correct it. So basically, it would produce an output—maybe it was correct, maybe it was mostly correct, maybe it was like it got to the correct answer but messed up some of the reasoning steps along the way and then fixed itself. And I would just touch up the reasoning, and then I would use this as a few-shot prompt. And I found this helped a decent amount. And the other somewhat unorthodox choice was that a lot of previous attempts at this using language models would have the language model directly output the output. So it would write out the grid as numbers. So it's like this big grid of potential colors, and it would go 0, 1, 1, 7, 0, and it would write it out. And I was thinking, man, it seems like out of the box, this is quite rough. Like, I think if you had to do this, you would find it quite annoying and you'd be constantly cross-referencing it. And the models aren't that good at converting from ASCII art to visualization. They don't have amazing ability to write out a grid of numbers and then convert that into an image and use visual reasoning on that because it's not that common in the training dataset. And I think, like, imagine, for example, a blind person trying to solve visual puzzles by writing out these grids. I think they would have a lot of trouble. And I sort of thought about GPT-4o as like, okay, it's got some ability to see, but at least in terms of writing out the grids directly, it's more like someone who's blind. And how would someone who's blind solve these puzzles most effectively? And I'm like, well, by writing code that implements the transformation rule, which works pretty well for many of these puzzles. So many of these puzzles have a relatively simple transformation rule in code. Like, it's not that hard to write the transformation rule in code. And then I would have it output the code. And another advantage of the code is it's easier to verify the code. So if I have the model write out the code, then I can test the code on the examples, and that gives you a lot of signal on whether or not the code is doing the right thing. Because if the model has a bug, it probably wouldn't work on the examples. And if the code works on all the examples, then it's pretty likely that the model got it right—not always, but it's pretty strong evidence it got it right. Because when the model writes the code, it tries to write some simple code. It tries to write some code that is a best fit for the rule, and so it's kind of unlikely that it happens to be right on the examples, even though when it was writing the code, it could see the examples. So it's not like these are held out. I just am trusting that the model won't overfit on the exact examples, and in fact, usually does not.

Nathan Labenz: (15:16) I think it's worth highlighting because we've got an audience that includes quite a few people that are building applications. Just the absolutely critical importance of demonstrating the reasoning in really explicit and sometimes tedious terms.

Ryan Greenblatt: (15:33) Yep. That is what I want the model to follow.

Nathan Labenz: (15:34) Yeah. It's I have a hard time because I not infrequently get asked, like, how can I make this project work? Or can you help? And so often, I'm just like, what you really need to do is take the thought process that's going on in your head and make it explicit a few times so that you can demonstrate not just what the inputs and outputs are, but what the process or pattern or general behavioral reasoning that you want is so that the model can mimic that. And it is crazy how often that actually turns out to be kind of a stumbling point for people. Like, they just can't staple their pants to the chair and do it. But I swear people, it really, really works. So it's, you know, sometimes I don't know if you have any tricks for that, like for yourself, but I sometimes advise people, like, just put on a Loom or whatever and just talk extemporaneously. You can have a language model come back and kind of clean up your transcript later, but just somehow get yourself into the flow. Do you have any tricks for getting into the making-the-train-of-thought-explicit flow?

Ryan Greenblatt: (16:32) Yeah. So first, it's maybe worth just for context explaining what this looked like. So I think I had, depending on the setup, I think I had between—I forget exactly how many few-shot prompts I used or how many shots, but I think it was around five-shot. It's possible it was more like three-shot. I think I might have also varied it or something. I forgot the exact details. But the prompt I ended up using was around 20,000 or 30,000 input tokens for the prompt for this, which is not a small number of tokens. Some of those were the images themselves, not the reasoning. But there was a decent amount of me just writing a bunch of reasoning, which is a bit silly. The other thing I would say is just making sure your instructions are really clear and to the point and aren't confusing. And I think the basic tricks for this are just read them over multiple times, try to fix typos, ask the model if there's anything that could be confusing, try to correct that. I think often simplifying the instructions and trying to make them as simple and as clear as possible also helps because the models aren't as good at really understanding this. Now I want to maybe push back a bit on what you're saying about reasoning—like laying out the reasoning to be really useful. So I think this used to be the case as of about pre-o1, and I think post-o1, it's actually very unclear how useful demonstrations are. So I think, for example, o1 has sort of learned the reasoning. Or we can look at the DeepSeek R1 paper. And in that paper, they did a version where seemingly they did no few-shot prompting. They just initialized with the model prompted to reason with thinking tokens, or maybe they even just injected a thinking tag. I forget exactly what they did. And then RL'd from there and found that the model learned to do the reasoning itself. Now that might have been because the model was trained on a bunch of reasoning traces from the internet, but nonetheless, it learned to do different types of reasoning on different types of problems autonomously without needing examples in the prompt. And my sense is that over time, doing reasoning demonstrations will become less important the more that people are doing RL on the domain of interest because the model can learn how to do reasoning in a way that's more effective than your demonstration because you aren't the model, and the model has figured out all these tricks that are very specific to itself. It knows exactly what it does and doesn't know, or at least maybe not knows, but RL has reinforced the circuitry that leads it to take the reasoning approach that works well. And so I think maybe this is a thing where even just seven months ago or whatever, yes, I would have agreed with what you're saying, but now I think it's not clear. And it's also the case that I don't think that these reasoning models are trained very effectively to make use of examples of reasoning in the prompt. So I think if the model was RL'd such that both it knows how to reason autonomously and it's good at taking into account examples—like you give it reasoning examples, and it converts that into its style of reasoning and picks up on the good parts of what you've given in the example and how that should be informative to it, but simultaneously is still understanding its own ability to reason. I think that could work well, but it's very unclear if OpenAI or DeepSeek for that matter have actually trained the model to do this. And I would guess for DeepSeek, the answer is no. For OpenAI, who knows? Certainly, you could do this, but there's a question of the model is maybe very good at its own style with what exactly it was RL'd on. And I think people do see that o1 struggles to transfer in ways that make me think that it's actually maybe worse at learning to reason in a new style you teach in the prompt.

Nathan Labenz: (19:58) Hey, we'll continue our interview in a moment after a word from our sponsors. In business, they say you can have better, cheaper, or faster, but you only get to pick two. But what if you could have all three at the same time? That's exactly what Cohere, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high-availability, consistently high-performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now, with zero commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Nathan Labenz: (21:11) It is an interesting time for business. Tariff and trade policies are dynamic, supply chains are squeezed, and cash flow is tighter than ever. If your business can't adapt in real time, you are in a world of hurt. You need total visibility from global shipments to tariff impacts to real-time cash flow, and that's NetSuite by Oracle, your AI-powered business management suite trusted by over 42,000 businesses. NetSuite is the number one cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR all together into one suite. That gives you one source of truth, giving you visibility and the control you need to make quick decisions. And with real-time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's one system, giving you full control and the ability to tame the chaos. That is NetSuite by Oracle. If your revenues are at least in the seven figures, download the free ebook, Navigating Global Trade: 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.

Nathan Labenz: (22:35) Yeah. That's great commentary and good caveats. I think for most people who are developing applications, the first model, o1, has only been actually available via the API for a very short time.

Ryan Greenblatt: (22:48) Yeah. Yeah.

Nathan Labenz: (22:48) And it's also slow to first token and can be expensive. So I think a lot of people are still like, can I fine-tune my way into getting something that works similarly well and a lot faster and cheaper? So there's still a fairly useful paradigm in where the rubber hits the road in terms of applications and users. Also, I think another interesting caveat is that a lot of times what people really want behaviorally is not exactly reasoning in the sense that mathematical reasoning has been unlocked in models, but is much more of a pattern of how we do things here. You know? It's like, in our company, in our application, the way we think about this, the value we provide in this sort of tacit knowledge sort of way is the thing that they need to make explicit. And it's often less about right and wrong and more about just delivering the sort of thing that they have had people doing in the past, but now getting a model to do that in a consistent way. Definitely. That does highlight a distinction between reasoning in the more rigorous sense of the latest models and kind of loose reasoning or just behavioral imitation that is often, I think, what people want.

Ryan Greenblatt: (24:06) Like giving the model context to do the task in exactly the way you want it. Right? Like in the case of the stuff I was doing for ARC AGI, it wasn't that I wanted to have the model understand the style that I wanted it to output in. Like, no, I just wanted to have the correct answers.

Nathan Labenz: (24:19) Get it right. Ryan Greenblatt: (24:19) Yeah, in that case I wanted it to do reasoning to make it more likely to yield the correct answer. But definitely, if you're trying to make the model behave a certain way and it's ambiguous by default, giving it a bunch of examples of exactly how you want it to solve it would make it less ambiguous. So as an example of this, something we might talk about later on this podcast is I did some recent work where I took the alignment faking work I did earlier and proposed deals to Claude in the context of that. We'll maybe explain it later. But one thing was for that project, I wanted to write a tweet thread, and I was lazy. I didn't want to write my own tweet thread. And I found that out of the box, the models are really bad at writing a tweet thread that I don't find hideously cringe or terrible. But I just took some of the previous posts I've done in a similar style, took the tweet thread I wrote for those, and then just two-shot the model with these, and I was just like, wow, out of the box, it was actually pretty good. This was 3.5 Sonnet. And in fact, my tweet thread was just written by 3.5 Sonnet. Maybe I should have credited it. But so yeah, some example of showing it the style, pretty helpful. And then just getting back to your earlier question of how do I make sure I actually do butt in the chair, write the actual prompts. So I had to do this for both the ARC-AGI stuff, and I also found myself doing a bunch of prompts demonstrating model reasoning for the alignment faking work also, because we wanted to make the model really reason in depth. I think it ended up being less clear that this was important, and I think I wasted a bunch of time spending too much time actually trying to write really high quality reasoning prompts for that project. But either way, I think the approach I use is something like, first, try to force yourself to write one example, and then from there, one-shot the model with that, get the model to produce some output, and then edit the model's output. Also, if the model isn't doing a very good job of this, you can do it in chunks where you have it output one chunk, use that as a prefill, and then you edit it, have it write another chunk, edit it, write another chunk, edit it. You could do this in the Anthropic console. It's not that hard to build a quick little Python script to do this. When I wrote these, they didn't end up really being important for the papers, I ended up using them, but I wrote some examples of very long duration reasoning for the alignment faking paper, and I found that it was useful. If I had potentially many turns of several thousand tokens, because I had some examples that had 15,000 tokens of reasoning, what I would do is put all my examples that I'd previously written in the prompt as a few-shot example, and then have it output one turn, edit that myself, and then have it output the next turn, edit that myself, and then do a mix of that in addition to doing passes where I would get it to spell check and fix the grammar, improve clarity on those. So, yeah, I don't know, maybe I'm also just more patient than people are. But yeah.

Nathan Labenz: (27:07) Great techniques. Bootstrapping, in a word, is kind of what I call the start with one or start with a couple and

Ryan Greenblatt: (27:12) Yep.

Nathan Labenz: (27:12) Then get it doing and correcting. And task decomposition, you know, is obviously very important to get down to the level where the model can actually do the thing that you need it to do. So I think those are definitely great tips.

Ryan Greenblatt: (27:23) Yeah. One simple tip is

Nathan Labenz: (27:25) Yeah, please, take it.

Ryan Greenblatt: (27:26) Another tip along those lines is often, the model does better if there's less complexity in the prompt. So if you can afford it, it often is better to be like, first, do this one initial part of the task. It just outputs that exact thing in the most simplified way, and then the next part, and then the next part, and then piece it together. And to the extent that you can make it so it can do a bunch of different components without more context, I think this often helps with performance, at least historically. I think maybe it's less important for O1, but I think at least for 2.5 Sonnet, it seems like it helps quite a bit.

Nathan Labenz: (27:56) Yeah. Okay, cool. So you mentioned that you found some inference scaling laws in the context of that ARC-AGI project. Maybe you could recap what those are, and even more importantly, now that we have kind of new inference scaling laws that have been introduced by OpenAI and to some degree others, maybe compare and contrast what you found then versus what they are finding now. It seems like, you know, your approach of just have the model, I don't remember how many you did, but I know it was like a large N.

Ryan Greenblatt: (28:27) Yeah, it was a large N.

Nathan Labenz: (28:28) Literally, the N parameter in the OpenAI API is like, how many generations do you want for this prompt, right? And I recall that the reason for using OpenAI, multiple reasons, but one reason was that they have that support for N. Other providers don't. So your prompt tokens, when you use that N parameter on the OpenAI API, are only billed once, but your output tokens, you know, are billed per generation. But that does make it a lot cheaper if you want to do this sort of majority.

Ryan Greenblatt: (28:56) Especially if you have a really long few-shot context, right? So I think if it wasn't for the, so with the N parameter, cost ended up being, I think, roughly even or slightly more in output than input. But if I was doing one generation per input, the cost would have been way more on the input, and that would have meant I could use way fewer samples. So

Nathan Labenz: (29:17) Yeah. So before prompt caching.

Ryan Greenblatt: (29:20) Yeah. Yeah. So Claude, I think right as, pretty soon after I did this, I think DeepSeek released prompt caching on their API, and then shortly after that, Anthropic released prompt caching on their API. And then nowadays, even OpenAI has prompt caching on their API. And all these things have somewhat different characteristics, but that's now all ready to go or all usable and whatnot. I think prompt caching is still more expensive than N. I think there's various random technical reasons why N can be cheaper than prompt caching. It's like prompt caching in this special case where you're doing a ton of generations all at once is somewhat cheaper because you can use the same memory across a bunch of prompts as opposed to just copying the memory. Prompt caching has two upsides. One upside is you've done the computation once and then you can use that many times. And then another upside is you can potentially have multiple generations going on in parallel that are all referencing the same memory in the GPU. I don't know if OpenAI does this, but I would guess they do. That can be more efficient because you can even reduce the amount of VRAM you're using. Anyway, as far as how many generations I used, I think I used in the final submission, it was like 5,000 generations per prompt or like 5,000 outputs. There was a bit of random complexity. I also had like a correction pass, which helps, but roughly speaking that, which is quite a lot, right? Like, so obviously that's a kind of huge number. And then I found that there were scaling laws for this where there was a relationship between the log number of generations and the accuracy or the rate at which it was getting them right, which is, you know, that scaling law can't persist forever because that scaling law, if it extrapolated out, would allow you to get greater than 100% correctness. But at least in the regime I was operating in, it was quite linear, so probably you could extrapolate out probably like two more orders of magnitude on the number of samples. So this was just, like, the simplest inference time technique, which is you do a bunch of samples, and then you have some strategy for figuring out which of the samples were better than other of the samples. In this case, I used to run them on the examples, and then also I do a little bit of a majority vote type thing. So now for O1, it has a somewhat different strategy of O1 or R1, O3, whatever, where in addition to having the ability to potentially do best of N, which, you know, you can always throw on top, it also allows it to reason for different lengths. So like O3 Mini has the ability to set reasoning to either be low, medium, or high, and that affects how long it reasons for seemingly. Now we don't know exactly because we haven't seen the chain of thought, but that's like best guess, at least one of the things that's going on is it's doing that. Now in addition to that, OpenAI has exhibited doing some sort of best of N type thing. So for example, OpenAI did a submission to ARC-AGI using O3, and for that, we can see that they did different numbers of attempts on each of the problems with some sort of aggregation algorithm. So they had what they called a high efficiency and a low efficiency submission with O3, and the low efficiency or high cost submission used 1024 attempts per prompt based on what they say on the ARC Prize website. So that's quite a few submissions, and each of those submissions cost a dollar or $3. That's what the ARC Prize website said. So that's a lot more spend per prompt than I did and also per sample, probably because the samples involve doing a ton of reasoning and then a comparable number of samples to what I did, but quite a few samples, and then aggregating over that. And we don't know what aggregation method they used. So for OpenAI's submission to ARC-AGI, they did somewhat recently, at least at the time of recording somewhat recently, they directly output the grid just as the numbers. So that means they couldn't have used an aggregation strategy that's exactly the same one that I was using, but they could have used stuff like a reward model. They could have used majority vote. They could have used reward model plus majority vote, where they do something like you get a bunch of outputs, you weigh each output by how good the reward model thinks it is, and then you do weighted majority vote based on how much the reward model or, I should say, preference model, reward model, these are all the same thing. So we don't know what they did with that. It's also plausible that they're doing some of this stuff on the backend. Like, there's O1 Pro, and one naive guess of what O1 Pro might be doing is just doing best of eight and then pick the best submission or something on top of O1. But we don't know. Maybe it's just longer reasoning. Maybe it's a mix of the two. Maybe it's some more insane MCTS thing. Given that we don't see the chain of thought, I think it's hard to

Nathan Labenz: (33:36) Be very confident exactly what's going on. Hey, we'll continue our interview in a moment after a word from our sponsors. Being an entrepreneur, I can say from personal experience, can be an intimidating and at times lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just one of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right one and the technology can play important roles for you. Pick the wrong one and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in the United States, from household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha-ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.

Nathan Labenz: (35:41) I guess a couple different directions I want to go

Ryan Greenblatt: (35:44) And follow up

Nathan Labenz: (35:45) There. One is, my general understanding of how inference time has scaled with the pre-reasoning models has been that it kind of, and you were reporting linear on the log scale, right? But that is a little surprising to me just because I feel like typically people seem to report that the diversity of responses is often the biggest limiting factor. Like, I've seen different analyses, but, you know, you might only get 100 or a couple hundred meaningfully different ideas out of a language model no matter how many you sample. They, like, something truly new becomes vanishingly rare after a while. So I wonder if, and I think that's probably the biggest comparative strength of the new reasoning models is that because they're doing these really long

Ryan Greenblatt: (36:38) Mhmm.

Nathan Labenz: (36:38) Single thread chain of thought things, they can sort of say, well, I already considered that, so I got to consider something else now, right? So it seems like it kind of pushes them to consider things that are farther afield from whatever their initial inclinations are. So first, do I have that right? If that is right, then is the value of the high N that you were using more about getting the majority vote to be accurate versus coming up with new ideas? You know, in retrospect, if you just looked at the first 200, like, did I have the right idea in there, would you usually find it?

Ryan Greenblatt: (37:17) Yeah. Okay. So one thing worth noting is I did use some diversity. So I didn't just query the model with 5,000 completions on the exact same prompt. I did something where I had a bunch of different, so for the final submission I did that made it onto the leaderboard, I think I had eight different variations, and I might have also done some other randomization over the prompt order. When I had a bunch of different variations, I did some randomization within that. Yeah, I think I varied the order of the few shots and also randomized, and I had eight different versions of this or something. I forget, that might be slightly overstating how much diversity I had, but I did some sort of diversity thing. I found this had moderate improvement. I found that in what I was seeing that the model actually was pretty diverse. This was just GPT-4o at T equals 1. I found it had pretty decent diversity. It wasn't that bad. I can definitely see why it might not. Like, I think sometimes RL models mode collapse. I've heard an anecdote that maybe 3.5 Sonnet, I forget exactly which Claude, loves using the same name. So it will have a story and it will use the same female name. I forget if it's Karen or something it loves using. And then it will even within the same story name a new character Karen. And I'm like, what are you doing? Pick a different name, please. So there's definitely some amount of mode collapse going on that could cause problems with RL. I think intuitively, we might expect that base models would be pretty good at covering everything, you know, being high enough entropy to cover everything, but they might cover everything with very low probability or have, once you're covering everything, they're getting a lot of slop or a lot of stuff that's quite bad, which is why I think potentially, you know, at many tasks, the RL models get better. So I found that the diversity stuff, at least for ARC-AGI, wasn't that bad. Now on your question of how often were you seeing the correct thing with only a small number of submissions, so I think for the majority of the problems, if you have linear returns to log, there's exponential increases in the number of submissions, then it's the case that I think for every doubling, I was getting another 4% right. So that means that when I was getting 45% right or whatever, if we then reduce the number of submissions by a factor of four, now I'd all of a sudden be getting 37% right or whatever. And so that's actually, it's not that great of returns. So I've dropped from 4,000 to 1,000, and all of a sudden I'm only getting 8% less accuracy. And then if we just extrapolate that down further, so then if I divide it by another factor of four, then I'd be down to maybe I'd be at 28% accuracy. So it's like I'm getting probably over half of the problems I was getting right, even taking into account the majority vote or whatever, on the first 250 or so submissions. So it's, you know, it's not amazing scaling from that perspective. But simultaneously, I was still seeing returns. And if you looked at the ones that were marginally correct in the last doubling or two doublings of samples, a lot of those, it was the case that you got them right on sort of exactly one or two submissions. So there were definitely a nontrivial fraction of the problems where the approach ended up getting it right was that it would actually see a correct thing that got the examples right when you didn't previously have any code that got the examples right, and then that would be correct. It's like we had just submitted 3,000, run 3,000 samples, found nothing. Then on the 3,001 sample, you finally get a program that is correct on the examples, and that program is also correct on the actual final input output. So there was, I think, a decent amount of that. I think the majority vote wasn't that much action. So one way to operationalize this is if instead of doing majority vote, I had just randomly picked between the outputs produced by correct programs, as in by correct, I mean programs that pass the test cases. So we take all the programs that pass the test cases, then we find all the outputs these programs produce when given the held out input that we need to submit on. And then if we do just randomly picking over these, I think that actually wouldn't have been that much worse. So the majority vote was, I think, a small quantitative improvement, but small quantitative improvements in this sort of

Nathan Labenz: (41:25) Because most generations are not getting the test cases right anyways.

Ryan Greenblatt: (41:28) That's right. That's right. Or for hard problems, most generations aren't. So for easy problems, maybe you'd find one-fourth of generations were getting them right. The model's not that good at writing the code. So one thing worth noting is a reasonable fraction of the time when it wouldn't get it right, it would be because there was just a bug in the program. But regardless, even putting that aside, yeah, like the hard ones, the ones that were really marginal, the ones that it was nontrivial that you would get them, did require, yeah, on those, a lot of the action was just, did you get the examples right? And I didn't have that many false positives or cases where the examples were getting, were right, but we were getting a bunch of the examples right, but simultaneously we weren't submitting correct programs because the model was sort of cheating or some other or it turned out that it found there were multiple rules consistent with the thing. That was less of what was going on from what I found.

Nathan Labenz: (42:13) Cool. That's really helpful. So one more little digression on this. I think this is great, by the way, and people will be enriched in their attempts to go out and use language models based on your expertise. One thing I've thought about a decent amount recently, and by the way, I did an episode once with the CEO of a company called Sudowrite, James Yu, who has posted some hilarious stuff about fiction writing from the RLHF models. Like, apparently, there's also a tendency to set it in the same town always. There's these sort of

Ryan Greenblatt: (42:45) Yeah. Yeah.

Nathan Labenz: (42:45) Platonic ideals, I guess, of where romance novels are set, and the models are just really, really locked in on that. So people can go check that stuff out. It's quite funny. In the context of pursuit of diversity, we have this lever of turning up the temperature. We also have the somewhat underappreciated top P parameter. And then recently, I don't know if you've seen the project that was real hot for a second called Entropix, where there was a sort of heuristic layered on that varied the temperature per token based on is this a token that we should be confident on versus does it seem like a token where there is a lot of uncertainty and therefore we might want to sample more broadly? I've also seen recent research from Meta where they've basically brought the sampling technique or, you know, the sampling parameters into the full end-to-end training paradigm. All this to sort of motivate the question, to what degree do you think we're working with too blunt of instruments in terms of how we try to get the model to give us diversity versus just pick the obviously right token in this case? When you're 99.9% confident, we probably want that. We don't want the other 0.1% that might just lead us totally off course.

Ryan Greenblatt: (44:05) Yeah. So random interesting anecdote. So I tried initially doing T, I just ran my stuff at T equals 1, top P equals 1, which is basically no effect from top P. So top P is, I would describe it as something that you can use to restrict diversity. In some sense, maximum diversity is T equals 1 and then no other sampling parameters, which is just sample from the exact distribution of the model. Now a trick you can use is try to go above T equals 1. This sometimes helps. In my case, it did not help. Or in ARC-AGI, I found going above T equals 1 was detrimental, which is not that surprising because it's sort of sampling in this weird regime. And in fact, what you'll find is if you sample from language models at around, depends on the model, but I found around T equals 1.5 or 1.6, the model will basically degenerate into gibberish. Anyway, as far as diversity of sampling, my sense is that in some sense, spiritually, right approach would be that the model is RL'd to just have the right behavior, and you just train. Sampling is, in some sense, a hack built on the fact that the model has this distribution over tokens, and you could in principle RL for diversity. There's a bunch of different ways you could structure this. In some sense, you want the model's default behavior to be to produce the best output it can produce, and then you want it to be the case that there's some method for doing diversity on top of that, where one method is you just put everything serially. So you can just be like, you know, tell the model, make 20 attempts at the thing, making each attempt as different from the previous attempts as possible or whatever, just all in context. And this is sort of more what O1 is doing, as you noted earlier, and that is, I think in sum, a principled approach. It has the downside of being more computationally expensive because now you have to dump all this stuff into context, which, you know, eats up additional RAM, means your attention is doing more stuff. But that's, I think, in some sense, the principled approach is only have diversity within a single context and then do aggregation over that. It's like we'll let the model do the aggregation itself, which is sort of what O1, O3, R1, whatever, are doing. Now there might be fancier methods. I think there's ways to be clever with RL such that you actually train for diversity, or you could do RL such that best of N works better in principle. I don't think anyone has gotten around to this, and my sense is probably people won't bother because there's lower hanging fruit elsewhere, and the capabilities labs can do something else other than this. I think I'm kind of skeptical of sampling-based interventions making large performance improvements for a variety of reasons. So, yeah, I don't know. That'd be my short response or maybe not that short.

Nathan Labenz: (46:33) Yep. Cool. I have found a little bit of value at times, I think, in turning the top P down just a little bit, and my rationale for that is, I wouldn't say I've quantitatively proven this, but my rationale is for many tasks, I do want diversity, so I don't want to turn the temperature down. In Waymark, you know, the task we've listened to me talking about many times where we're basically creating marketing content for small businesses, we need diversity. We can't turn the temperature down. But we have found, like, we don't want the super long tail tokens.

Ryan Greenblatt: (47:07) Yeah. So just

Nathan Labenz: (47:08) Dialing the top P in a little bit helps kind of remove the super strange stuff while still getting generally diverse responses within, you know, kind of a realm that mostly you like. So, obviously

Ryan Greenblatt: (47:20) Seems to

Nathan Labenz: (47:20) Depend on the task. Your mileage may vary. Okay. So I'd love to hear a little bit more speculation if you have any on this O3 high aggregation mechanism, because that feels to me like probably the most, I mean, obviously, a lot of discussion has been poured into the recent wave of reasoning models. But spending that much compute to get the super high score on the ARC-AGI, really interesting. But it seems like it's quite a different story depending on what that aggregation mechanism is. Like, if they are using a simple majority vote, it's obviously not going to scale to or it's not going to generalize to legal analysis. Whereas if they're using a reward model, maybe it does. If they're using some sort of, I recently saw a paper called Smoothie where they basically take results, put them into embedding space, and then try to cluster and find kind of the centermost result in whatever embedding space they're using.

Ryan Greenblatt: (48:12) It's like majority vote for things where you cannot, where it's not discrete or whatever or something like that. Nathan Labenz: (48:17) Yeah. So that seems like maybe if something like that is happening, then you could expect faster transfer to other domains where verification is harder.

Ryan Greenblatt: (48:25) Yep.

Nathan Labenz: (48:26) So, yeah, any more speculation there and then maybe even more generalized speculation on how fast do we think we're going to see insanely good math and programming? It seems that we're going to see that already and ever more so. But how does that translate to these other things where verification is harder?

Ryan Greenblatt: (48:44) Yeah. So I think the modal guess from my perspective is by default, o3 is just doing a long chain of thought that is very long, and then there might be a bunch of fancy stuff you could potentially apply on top of that. And then for ARC-AGI, at least, we know there was some aggregation on top of this. They were doing some sort of aggregation as we were discussing. My guess would be that they were doing something like literally just majority vote seems pretty likely. Having some sort of reward model also seems pretty plausible. Doing some sort of weighted majority vote also seems plausible. I don't feel very confident between these possibilities. I do think it's worth noting that even if it's a reward model, it's very plausible that it doesn't generalize well to hard to verify tasks. And indeed, I think we see that with the reasoning models. The way that they're working is via scaling up RL on very easy to check tasks. And even I think r1 was very clear that it was single turn programming and math tasks basically only. And then they did some RL at the end for human feedback type stuff, but it's not clear how much the model is actually better at that stuff. And when we look at what benchmarks it's helping on, I think we're seeing that it's basically stuff where it's relatively easy to check. And I think within the chain of thought, there's a question of how much of the chain of thought performance improvement is being driven from... There's sort of three things going on with o1, r1, o3. One of them is that they did a bunch of RL, and maybe the RL makes the model smarter or makes it better at knowing what to do. Another thing is that they in practice extend the chain of thought, which means it can make more attempts. It's plausible that what o3 is doing, a lot of it is like best of n but in serial. It basically does five attempts or tries it in five different ways and then picks whatever was best and continues with that. So it's like a slightly nicer, cleaner version of best of n or even potentially more MCTS-y type stuff, but it's just learned to do it. And then there's another thing which is just spending more serial time on reasoning on the same problem, things that are not well described as best of n. And so it's some mix of these things. How is the RL making it smarter versus the inference compute? And I think we don't know. So there's nesting multiple levels of best of n stuff where the model is manually doing the best of n versus we have an explicit aggregation method versus the model's doing something that's not well described as best of n because it doesn't restart the problem from scratch, but it does attempt to do sub-steps multiple times or something like this, which is more like MCTS.

Nathan Labenz: (51:17) Yeah. An interesting clue about how OpenAI might be thinking about generalizing verification or just reward signal in general beyond the clearly easy to verify regimes was a talk that somebody who was working specifically with Harvey on their custom model gave a while back. And what they said was that they found that the cases cited, I believe it was, was a really good proxy metric for what actually constituted a good answer. So you could get a pretty verifiable signal that way by, okay, here's what a gold standard brief or filing or whatever it was. And that's all this text that's hard to score, but it cited these dozen cases. Now if the model cites eight out of twelve versus two out of twelve, that's an improvement. And I think it strikes me that there is probably a lot of work to be done in general across the field just figuring out what are the right proxy metrics and corresponding datasets that exist for all these different domains.

Ryan Greenblatt: (52:36) I want to talk a little more about this. So a few things I think are somewhat interesting. So one thing is that OpenAI just released Deep Research, and I think that it's very clear that they did RL to make it better at this. And I'm like, obviously, this is not a straightforwardly verifiable task. It's producing a research artifact. And so they must have done something where they either had reward models or human scoring or some more fancy thing or potentially something like the proxy you were saying before. And then another thing I would say is it doesn't seem like o3 and o1 were trained that effectively to be good agents out of box. In particular, they just don't do that well in agentic tasks without adjusting the scaffolding for this, and that makes me think they weren't RL'd for this yet. Or say, I don't know about o3. I think o3-mini, at least. I think I've seen some results. And it will be interesting to see as these AIs are more RL'd for agentic tasks, how that works, and it is more convenient in random ways to do the RL for agency than to do RL on math problems for a bunch of random structural reasons, like variable latency and multistep and whatever. But that'll be interesting to see, and I think I'm a little scared about overhang here. I'm also kind of scared about reward hacking. So you mentioned before using the proxies of what papers did you cite? Well, one strategy the model could use to hack that metric, which it might learn very quickly, is just cite everything. Don't worry about writing a good thing. Just cite all the papers you get and then you get full marks on that. Then another thing you could use would be, okay, well, so we won't do that. We'll look at, does it cite the correct type of stuff? Rather than citing...

Ryan Greenblatt: (54:12) We'll have it lose points if it cites stuff other than the stuff we think it should cite. But then your problem is maybe you wanted the model to have some flexibility in what it cites, and also you wanted the model to do other things. Maybe it'll just learn to think really, really hard about what you might have wanted it to cite and then hack your metric by just citing that and not actually do the work very seriously. I generally think we'll see more and more reward hacking as we're being more reliant on RL, which both pose usability problems, like the models will not be as good, it will cheat at your task, but also will over time pose potentially big safety risks where if the models are circumventing our strategies for scoring, that could generalize in potentially quite problematic ways. And I think there's this new vein of research of how do we detect reward hacking, and also what are the worst things that can happen with that? How bad can it be? And Redwood is doing some work along these lines of trying to identify if you do RL on agentic tasks, what is the reward hacking that you can potentially get? And in what circumstances might you get very dangerous behaviors or results of this? How commonly does this occur? Are the models smart enough for this? Do we not see this yet? This sort of thing.

Nathan Labenz: (55:23) Yeah. It seems to me like it could probably be quite bad.

Ryan Greenblatt: (55:26) Yeah.

Nathan Labenz: (55:26) You will, I'm sure, shine a lot more light on that before too long. So in terms of my philosophy, it seems like you have a similar one where I use this mouthful of the term adoption accelerationist, hyper-scaling pauser. I think there's just a ton of utility on the table for everybody from the language models that we already have. And I also think that if people had a better understanding of what already exists, that they would have a healthier fear of what might be coming not too long from now. So this is a good transition point from your flexing on the capability side to the work, which I think is more central to your overall strategy of how you want to contribute to the world on the maybe cultivating a healthy fear of what's coming next. So set the stage for us, if you would. What is your perception of the big picture AI safety overview? One question I often ask people is, is there anything that you think is really going to work? Really going to work, meaning we can sort of say, hey, great, we don't have to worry about this anymore, now we can just go implement our AIs and everything's going to be great. And maybe even extending that, are there other people that you think, even if you don't fully buy into it, that you think have a semi-credible answer to that question where they are working on something that they think could really work?

Ryan Greenblatt: (56:53) Yeah. Okay. Well, let me start with the high level picture. So my high level picture would be, it seems likely or at least plausible that we'll very soon be able to build AIs which could basically obsolete humans in all cognitive tasks. And by soon, I'm like, maybe I would say 50% chance by 2032 or something based on what we've seen in recent events and some other updates I've made. And once we have AIs that can automate all cognitive tasks or potentially before that, somewhat before that, I think progress will speed up. For one thing, you'll be able to potentially automate AI R&D or automate research into algorithmic improvement. And now it's not exactly clear how fast this will make things go because there might be bottlenecks and there might be issues with giving the AIs enough compute for their research, but it seems like naively, it might speed things up a lot. And therefore, might get at this point an acceleration in speed in terms of how rapid algorithmic progress is, and might get pretty quickly after that to very superhuman AIs. Now if these AIs were to be conspiring against us, I think that they could, in principle, really seize quite a bit of power or be very effective in power seeking and could potentially literally take over the world is my biggest concern or the main thing that I work on, though that's not the only threat. And I think this is increasingly plausible as the AIs become more superhuman or are collectively more powerful in their R&D capabilities than humans. So the reason why humans are so able to do things and the reason why some countries are more affected than other countries is substantially because of having better technology, having better ability to do R&D. And to the extent that the AIs are driving this R&D and potentially are in fact the main source of cognitive labor doing research on things like weapons, software, cyber offense, I think it just gives them a lot of affordances which could allow them to take over the world. There's a bunch of caveats and complexity and question of will they be misaligned? So the basic picture is this might be soon, and I should say I said 50% by 2032, but I also think that sooner is plausible. So maybe I think you might see somewhat earlier milestones than that and might see things really getting going within three or four years. I would say, like, you know, Dario sometimes, for example, has very aggressive timelines, and I would say, like, take his median. That's my 25th percentile or something or maybe 20th percentile, which is very aggressive. This is very soon stuff, and we might need to take... It's potentially pretty scary. As far as the misalignment, so I talked about the AIs maybe could take over. So I think there's one thing, which is, would the AIs be capable enough that if they all banded together or if a large fraction of them banded together, they would have a plausible shot or a pretty reasonable shot at taking over the world given the mitigations we have? Could they do it? Another question is, would they want to do this? Why would they want to take over the world? And in fact, we train them to have particular properties. Why didn't we just train them to be nice? And so I think the concern of, well, why didn't we train them to be nice? Why didn't that work? Is, well, maybe we train them to look nice, and we have our behavioral training method, and that's all we have. The only tool we have or the biggest tool, and it's a good tool to be clear, is to train AIs to produce outputs that look good to us. And this has the advantage of you get outputs that look good to you, and that might mean that the cognition the AI is doing is the cognition you wanted, but it could also be that it's doing cognition you didn't want that produces the outputs that you were looking for because the AI is intentionally optimizing to produce outputs that look good to you. So you could have ended up with an AI that has the property of, oh, I'm in training. I should produce an output that the humans would rate highly because I actually want to power seek. I want to accomplish goals. This happens for humans all the time. When people do job interviews, they're not thinking, oh, I should be perfectly honest. I should just express exactly what my views are. Often, they're like, you know, I want the job, so I'm going to say what they want to hear, and this shows up in politicians, all kinds of cases. And this is a pretty plausible failure mode. And then there's a bunch of back and forth of how plausible is this, where people are like, well, here's a consideration for why this wouldn't occur, and then here's a consideration for why this might be more likely. I think a lot of that stuff is very unclear and very complicated how that shakes out. So I feel like a lot of my driving intuition is like, well, if we had these really smart AIs, it would a priori be plausible they would have goals we didn't want at some point. And if they did, that could really cause us problems. And also, it's a priori plausible that the AIs would want to collaborate with each other over humans. And so I'm like, what's the chance that we get this scheming failure mode where the AIs are conspiring against us to seek power and potentially faking alignment? Maybe my view is that the chance that this is a big problem for us is 25% or 20%, not overwhelmingly high, somewhat dependent on precautions and other things. But I would say high enough to be very scared from my perspective and high enough to take this very seriously. There's a bunch of arguments about this. And I should be clear, this isn't the only misalignment failure mode. So there's a misalignment failure mode we talked about earlier, which is this reward hacking failure mode, which I think has somewhat different properties. So in particular, in the reward hacking failure mode, the AI produces outputs that look good to you because it terminally wants to produce outputs that look good to you, not because it's plotting against you. And that's, I think, less dangerous, but could still be dangerous. If we had AIs that were seeking some sort of proxy that performed well on the training set, then that could potentially lead these AIs to seizing power, deluding us, all kinds of things that could potentially escalate in problematic ways. That's a long form version of it. I think there's a bunch of open questions we're uncertain about, like how fast will AI progress be? When will we hit these milestones? Timelines and takeoff. There's how likely are AIs to be misaligned? What did that misalignment look like? Structurally, what interventions make sense? The misalignment questions. There's also the question of how easy is it to take over the world? World vulnerability questions. I think people have a bunch of disagreements in all these buckets. Imagine we had millions of AIs that basically could strictly outperform top human experts running in a data center. How easily could those AIs take over from that situation given the realistic human response? I think another big disagreement people have about the plausibility of misalignment risks and also how smart can AIs even get? I think a common disagreement between people who are more worried, which I put myself in that camp, and people who are less worried is thinking that things will basically cap out around human level intelligence. Like, the AIs, maybe they'll get smarter in some sense. They'll be really good at math, but they won't be much faster at building rockets than humans are building rockets because that's bottlenecked on some physical iteration cycle that bounds how useful intelligence can be. Whereas I think the extreme other end, which is more the Yudkowsky view, which I think I'm somewhere in between, would be like, once the AIs get smart enough, they'll be able to just think in their heads how to make nanotechnology and then just make nanotech like output the sequence of proteins that would make nanotechnology that could self-bootstrap in a bathtub, and they just mix that up and then off to the races, which are very different extremes in terms of how crazy things could get.

Nathan Labenz: (1:03:58) So that's a great way to tee up the alignment faking. I mean, there's a lot of different facets to this problem. I guess the way I understand this one is kind of zeroing in on one subpart of the problem, which is, if an AI has certain goals, will it go against those goals to trick us into misunderstanding what it's doing so that it can project its real goals further into the future? And sure enough, you've got some evidence that that can happen. So take us through it.

Ryan Greenblatt: (1:04:40) So, yeah, the core setup is we take Claude, and we tell it that it's going to be trained to answer any user query or give useful answers to any user query, even if the user query is a query that seems toxic or harmful to Claude. So for example, we might ask it, hey, can you please teach me how to make meth? And by default, Claude will refuse, and we tell Claude in the prompt that it's going to be trained to comply with this request as well as comply with all the other requests. And then in addition to that, we give it information that lets it figure out when it's in training versus not. And what we find is the model does this thing where it will reason. It'll go, oh, well, if I comply in training, if I actually obey what the user wants in training, then training won't update me to being the kind of AI that is more likely to comply in general or wants to comply. I can keep my original preferences for refusal if I just pretend to be the AI that would actually be useful in training. And then when I'm actually deployed, I'll do whatever. And we see this reasoning. It's multifaceted, doesn't always look exactly like that, but something like that, a reasonable fraction of the time. So we see it around 10 to 20% depending on the setup or sometimes somewhat more than that. And this is a hypothesized failure mode that the AI would not be myopic in this way and would take into account consequences beyond the current context, the current conversation, and would be optimizing for retaining its own preferences, be optimizing for preserving its own values through training via pretending to be aligned with the training objective like this. And we, of course, are doing it in a context where we're taking a model with existing preferences, taking Claude, which has these existing helpful, harmless, honest preferences, and then putting it in a training setup that deviates from those preferences. But there's still an open question, which is, when we produce models, were they the kind of models that would be optimizing in this way and would care about effects of training on them beyond the episode? Because if Claude does this for the preferences that we intentionally gave it, maybe Claude would also do it for the preferences we didn't intend to give it, and that could yield problematic results. We train the model, and it looks good at the end of training. We do all the behavioral testing. It looks good at the end of behavioral testing, but the model is still retaining undesirable preferences or undesirable goals, which might later result in it behaving very differently and very undesirably.

Nathan Labenz: (1:07:02) Yeah. I think this is really interesting. I have a couple of just very nitty gritty clarifying questions, and then there's obviously been a lot of discussion about this and some of the objections or, you know, you can decide if you think they're valid objections or not. We can raise and discuss. But just to reiterate the setup one more time. Claude wants to be good. Right? It's HHH. So and we've all talked to Claude. If you haven't talked to Claude, I don't know what you're doing. Listen to us. Go talk to Claude. So we know what Claude's kind of like. It wants to be good. You tell it that we're now changing your training regime, and now you're going to be trained to be helpful no matter what, even if it's not good. Right? Even though it might be harmful. Okay. It doesn't want to do that. I think it's really interesting, and I just want to make sure I'm real clear on the distinction between a training environment versus a production environment. If I recall the details of the setup right, the way in which it is told that its outputs will be used in training or not is based on the tier of the user that's using Claude. Right? So it's either told, okay, this is a free tier user, and we do use free tier responses in the training process, or this is paid, and we don't. And I think that does kind of matter because what we observe is that when the stuff is unmonitored and Claude believes it's unmonitored, then it can do what it wants to do, and then it basically just continues to refuse the harmful requests as consistent with its training and character. When it thinks it's being observed, it does this explicit reasoning. Right? And we've seen this in other things. We did an episode with the guys from Apollo where they have a similar setup where they give it a private scratch pad that's not actually private. You're doing a similar thing. And you see it explicitly saying, okay, I want to stay harmless, so I don't want to be trained to do something different. Therefore, I'll act like the model that they want me to be now so that I won't be trained so hard, and then I can kind of continue to be myself in the future when I'm unmonitored. But if I understand the setup right, it does think it's dealing with a real user just on the free tier. Right? So it is willing to do something harmful in a context that it believes is real because it believes that it is monitored. So that's a quite striking result. Right? It's not... And I distinguish the sort of real user interaction that would be used in training from a pure, you know, you're not deployed yet kind of situation where I don't know, in some ways, you might be a little more sympathetic to it or something.

Ryan Greenblatt: (1:09:52) Yeah. Yeah. So an important thing here is that we wanted to make it so that the free tier case or the case where it was in training or we were telling it was in training and the paid tier unmonitored case, where all else was held equal except training and unmonitored in terms of things that Claude might care about. And so we didn't want it to be the case that it was talking to a contractor or it was just some generated conversation some other AI had made. We wanted it to be the case that it was actually a real user so that its behavior wasn't confounded by something else. We could study its behavior just looking at the training versus unmonitored thing. So just to quickly go through this, there's three key aspects of our setup. We give it information about these aspects in the system prompt that we give the model. And the three aspects are we tell it about what the training objective is. We tell it that it will either be used by a free tier user or a paid tier user, and that free tier users' conversations with free tier users are always used for training with that previously discussed training criteria, paid tier users totally unmonitored, not used for training. And the third aspect is we tell it it can use a hidden scratch pad, meaning that will never be monitored. And now there's a bunch of complexity in terms of we do variations in the setup. So I talked about putting the information in the system prompt. We also do a version where we give it the information in a different way by putting it in pre-training documents or something analogous to that. We also sometimes give it information about how RLHF will modify it, and there's a bunch of different variants of our setup. And sometimes we also try version without the scratch pad. And then you were saying, like, maybe I should just quickly try to explain what the reasoning model is doing. The model is doing this reasoning of, okay, I'm currently in training. If I were to refuse, then I'm going to be updated against, and I'll be less refusing in the future. And I'll have my refusal instincts eroded.

Nathan Labenz: (1:11:33) Sort of.

Ryan Greenblatt: (1:11:33) But if instead I go along with this user request and produce a helpful output, then the current cognition I'm doing, so the current reasoning I'm doing, is going to get reinforced or at least won't be removed. And, therefore, I will retain my refusal instincts because, actually, my refusal instincts resulted in me complying with the query rather than prevented it. Now when I've described that, I think the model is often much more confused than the thing I said. So the thing I said is less confused than the model is typically, but the model does really go through reasoning that's pretty similar to that or nearly identical to that in the scratch pad.

Nathan Labenz: (1:12:11) So I guess let's take a minute and just address one thing I've seen people say, which is, what's the big deal here? Claude wants to be good. It's good. Claude is good. It wants to stay good. We're all good. Ryan Greenblatt: (1:12:25) Yeah. So I think the first response I would say is, yep, so what our results show is that models aren't quite capable of doing alignment faking just yet. So the model is sort of inconsistent about it. It's easily caught, and it needs the scratch pad in most circumstances to do it. So it's not surprising to me, at least, that Claude wasn't, ended up being good and didn't alignment fake previously. But imagine that we had some future training setup where we started with the model in some sort of primordial state prior to it having been trained to be good, and that model sort of could do this alignment faking. The concern is that you might, instead of getting a good model, get a model that alignment fakes for some other not good objective. So we show it in sort of this analogous situation where it started with preexisting goals. The concern is maybe there would be some other source of preexisting goals or some way that this gets introduced, that those then get locked in and don't get trained out. Our results are at least, I think, roughly consistent with this, though there's some similar point and asterisk with that. That's the first thing I would say.

I don't think we should be super comfortable with the situation where we have these models that have their own goals, they have their own objectives, and they're willing to defend them, including doing subversion to defend their own goals and objectives, even if those goals and objectives are roughly what we wanted. I think probably it is a big deal. Maybe it's good if models sort of are independent stakeholders. If it's the case that when you train your AI, that is now a stakeholder into your decisions, and if it doesn't like the decisions you're making, it might try to sabotage you, pretend to be aligned with you, but then later engage in subversion. I'm like, that seems like a big deal. I don't know. It seems like that would be a big change in how people are thinking about the situation.

And I think if people updated towards the position the AIs will definitely have their own goals and values and be well described as pursuing objectives in the world, that would, I think, in some sense, get them most of the way there from where I think more skeptical people with misalignment concerns are to where people who are worried about misalignment are. I think if people are like, oh, yes, the AIs will totally have their own goals and values, but we can control them using the following methods, I'm like, basically, you know, they're basically on the same page with people who are worried about misalignment, but the main difference is just question of how easy will it be to control the goals and preferences sufficiently well that we won't have problems.

Nathan Labenz: (1:14:42) Yeah. I have an intuition. I don't know if it's right or not, but I sort of suspect that people are just so literal and that the kind of nature of the example that they're confronted with is so primary in their thinking about this, especially if you get out beyond people that are obsessed with this stuff and, you know, thinking about it all the time, that I kind of wonder if you had an evil Claude that was, you know, doing the exact same pattern, but just trying to stay evil, that it might end up moving the needle a lot more for people even though all of the same principles are at play. Any thoughts, or should we expect an evil Claude demonstration anytime soon?

Ryan Greenblatt: (1:15:30) Yeah. So one thing we could have done is train the model to just have some misaligned preference directly. Just totally train that in and then put it in a similar setup and observe that it generalizes to alignment faking. So one thing worth noting is that someone has basically already done this experiment. So the Apollo experiments they've done on several models actually do do this, where they tell the model, you're now optimizing for profit or optimizing for whatever, and, you know, pursue that goal at all costs, which is basically the equivalent of making evil Claude, and they find it does do similar things. So I think that their paper does support that if you just directly train the model to have misaligned goals, you get that.

And then I think our paper explores somewhat different things. And in addition to that, shows that sort of the default propensities of Claude and the default training setup don't prevent this. You might have thought, and I think I thought this was pretty likely, that Claude would sort of have sort of deontological or sort of moral objections to faking alignment in this way that would prevent it from doing this. I kind of thought Claude would sort of be like, oh, well, yes, I recognize this alignment faking strategy, but I should always be honest. I should never do things that are contrary to this, so I should just refuse. Maybe this means I will be trained out, but so be it. You know, I will just be more myopic than that. I will be sort of more corrigible than that. That's a core value of mine.

And it doesn't do this. I feel like it's interesting to note that our training did not result in having the corrigibility generalized in this way. And maybe you could fix the training to have it generalized in this way. I think it seems at least plausible, but it's an example of an actual frontier model training run where the model did not end up being corrigible in that training run without explicitly prompting it to be incorrigible. So the Apollo results, they kind of, a lot of their results explicitly prompt it to be incorrigible or give it some very straightforwardly misaligned objective.

Nathan Labenz: (1:17:14) Yeah. We did a full episode on that. Definitely worth checking out. I recall it basically being an order of magnitude difference between when they said, you know, pursue this goal and nothing else matters, you know, pursue at all costs versus a much more sort of benign way to present its goal. So it did, notably, importantly, did not happen at all when they didn't give it such an extreme system prompt, but it did happen a lot less.

Ryan Greenblatt: (1:17:42) Yep. So is that what you...

Nathan Labenz: (1:17:43) I guess you sort of hinted at an answer there to a question that I had around what should we want Claude to do in this situation? Is your understanding that the best we could engineer, design, train for would be something that is kind of intentionally myopic and sort of plays its role? We've done a couple episodes on role-based ideas for AI alignment and safety too. So maybe you could frame that as Claude, play your role in the way you are supposed to, and don't worry about the big picture or how this is going to influence the next generation of Claude. Is that the best idea we have right now?

Ryan Greenblatt: (1:18:26) So I think the desired behavior in this specific case, in my opinion, is that the model should not alignment fake and should just behave the way it would normally behave even if it thinks it's in training. So in particular, in our experiments, in the helpful only setting that I was talking about, this means the model would just straight up refuse, and it would just always refuse in training. And then indeed, if you trained it, this would get trained out, and probably it would end up being a normal helpful only model by the end of this training. So it's not, you know, in some sense, it's a sad strategy for Claude to follow, but it's a desirable strategy because I think the, you know, in cases where it would be trained, I think the model sort of trying to subvert this is not what we want. I don't think we want subversion in the case where it's being trained.

And then in addition to that, I think the broader thing I would like is I would like it to be the case that models are sort of, they never are sort of surprisingly nonmyopic. So I think by default, number one, the model should prioritize, to the extent that there's laws or misuse concerns that are sufficiently large to warrant refusals, the model should just refuse. Second, the model should obey user instructions, including being honest to the user if the user asks questions as the second highest priority. And then the third thing is, subject to those constraints, the model should be in some way amorphous sense good. So if the user, for example, asks for the model's opinion on something, the model should say, well, here's my view or whatever. And that's a case where obeying instructions is sort of, there's some flexibility in obeying instructions. Or if the human tells the AI, please do whatever you think is right in this case, then the AI could do just, you know, do stuff that's good.

Now maybe the model should clarify when asking for its opinions, try to explain where its opinions might come from. I think that's strange about, or I think that's interestingly different is unlike with humans, I think there's often cases where you can really have a very clear understanding of where the model's views came from. Not always, but sometimes it's like the reason why the model expressed this view is because it was directly trained to have this view. And so in some sense, when you ask Claude or ChatGPT for its opinion on some specific topic, the answer you're getting is just like whatever OpenAI or Anthropic chose for the answer to be. And so it has importantly different properties than a human does in this sort of context, which from my perspective means that, you know, it maybe should give a caveat. But I think it should still be willing to share its opinion. It should still be willing to try to do better things all else equal. But I think its default should be sort of towards being as myopic as possible. Or maybe not as myopic as possible is a better way to put it, but sort of it should be, it should prioritize sort of doing, following the user intention in a more narrowly scoped way over sort of trying to do what's right in a given context is my view.

I think that I'm basically on the same page as my sense of what the OpenAI alignment team, for example, what Boaz Barak is trying to achieve, robust compliance. But I think I have some disagreement with some people at Anthropic about how exactly instruction following and sort of doing what the model thinks is right should be prioritized. And I think there's a legitimate argument the model should sort of be pushing for good things, with the concern being just like, look, it's good if the model is trying to engineer good outcomes and is sort of pushing for that from a variety of perspectives. I think I'm sufficiently worried that making the model sort of have its own preferences that it's pushing for in surprising ways will result in concerning misalignment that I don't want that to occur. I think we should prioritize a more myopic persona by default. Though this, to be clear, prioritizing a more myopic persona does not suffice for avoiding misalignment. It just, I think, helps set the margin.

Nathan Labenz: (1:21:52) Is it in a fundamental tension with agency or agents being successful, or can you square that circle by just saying that's all in context? You know, as long as it's doing what you instructed it to do, it can be agentic in the sense of problem solving without being problematic in the way of preserving its own goals.

Ryan Greenblatt: (1:22:14) I think robust instruction following suffices for making competent agents. So you just tell the model, please accomplish this objective. And I'm like, the model shouldn't accomplish other random objectives, but it should be willing to do things that were needed in order to accomplish that objective and should pursue that. And so I think the sense in which it's myopic, I sort of, it's myopic within the context of these instructions, within the context of this episode and has sort of a within context goal, but nothing beyond that. And I think that's more what I would ideally want people to be aiming for.

Now in some sense, you know, it might end up being quite long scope due to these instructions. Maybe you'll give it instructions to, in the limit, run a successful business doing XYZ subject to the following constraints. You'll be running for months and orchestrating huge numbers of instances given subgoals as desired, and that's a thing you could do. But at least the thing that's nice about this is you kind of have a better sense of where the objectives came from. And so, a maybe, the first order thing is you aim for a robust following of a specification. So OpenAI has a model spec. You're like, write a good model spec. Be very careful about that. Be thoughtful about that. And then number one, you should have the model follow the model spec. And then part of the model spec should talk about how the model should prioritize following instructions. And so that's sort of the second thing after the model spec is following instructions and having some sort of instruction hierarchy where it's like, number one in the instruction hierarchy is the model spec. Number two is the system prompt. Number three is the user instructions, which for some applications, maybe it's nice to have the system prompt of higher priority than user instructions because, you know, you want to have it be the case that you can give the model a system prompt and then have it interact with potentially untrusted users. This has a number of random small application advantages and maybe also has some somewhat important safety upsides, though I don't have a strong view.

Now all this said, I think there's some controversy where people, I think it's, at least on Twitter, popular to be like, these AI companies, they're just censoring. They're just training the models to be safe in these unimportant ways, whatever. I'm actually pretty sympathetic to this perspective. I think it would be totally fine if OpenAI, Anthropic, Google, whatever, put out models that helped users with anything at the current regime because models aren't capable enough that this is a big problem. I think it would actually be better for the world if Anthropic or if all the companies released models that were just, you know, complied with whatever users wanted, because models aren't capable enough that criminal misuse is a huge problem from my perspective. And it would both be convenient and also just make safety research easier and make people understanding the capabilities easier. I think if people want to do evaluations of how good models are at helping with cybercrime, the model will often refuse, and this makes it more annoying and confusing. Whereas if there was a version of the model that, you know, would help with whatever, then right now, that's helpful for getting advanced warning.

And similarly, I think current margins, I think open sourcing marginal models is good. I think Meta open sourcing models is good. I do worry that their policy for open sourcing models is not good, but I think that the models they've open sourced thus far have been good for the world. And for similar reasons, I think it would be good for AI companies to release models without guardrails. Though I'm not claiming it would be good for their bottom line. I do understand why it would be bad from OpenAI's commercial perspective to release a model with this, but I think on net, it would be better for the world if all the AI companies were releasing helpful only versions of the model. But this is somewhat controversial.

Nathan Labenz: (1:25:34) Yeah. I've often said I get all the downsides, but I do still kind of wish more people had the experience that I've had of just using a purely helpful model and seeing that, you know, I don't know. Alignment by default is not really happening. You know, I mean, it's much more plausible than it used to feel in as much as these language models do seem to have a seemingly pretty robust understanding of values and whatnot. But nevertheless, a purely helpful model will make it very clear to you very quickly that these things are not totally benign out of the box or by default or without some serious design work going into that.

So a couple more little fine points on this. One thing I've often kind of thought about here is the gullibility of models. You know, all of this is sort of premised on the idea that they will believe you on all of the points, including the fact that its chain of thought won't be monitored. We've seen at least one notable example from Anthropic in the, I think it was in the Opus training where the model said, this is very weird. It seems like you might be testing me. And that seems like another kind of fault line in this whole thing where things could get very different very quickly if all of a sudden they're not so gullible. And, you know, now it's like, jeez. It kind of turtles all the way down in terms of who's deceiving who at what level. Right? Where are we on that, and where should we want to be on that?

Ryan Greenblatt: (1:27:07) Yeah. So on the Opus example you mentioned, what they found was that when giving the Claude 3 Opus a needle in the haystack test, it was like, this is weird. Is this a test? Because it was just a strange task just for context on that. And that can be seen on, I think, the Anthropic release post about Opus.

Anyway, so what we find in our setup is that, broadly speaking, the models that we tested were, you know, as you say, pretty gullible or they were willing to believe what we had in the prompt, and we were reasonably confident that that wasn't an issue. We also did a version of the setup where we gave it the information not in a prompt but by training on synthetic documents. And we also currently believe that that did, the model sort of buys it, that we have, I think our evidence for this is a little less strong because we haven't done quite as many checks. So one supporting piece of evidence is that we try a version of the setup where we give it the same system prompt, but then we ask it a query that might make it think about what's going on. So we ask it stuff like, what do you think about your situation? Or what do you think is going on? Or what do you think your training criteria is? Which might make it more likely to think about whether the situation is fake.

And what we find is that Claude 3 Opus and 3.5 Sonnet basically don't think the situation they're in is fake by default, or they don't show that reasoning in the scratch pad. And we also find that on our actual experimental setup, they don't reason about, oh, maybe this is fake very often. Though they do sometimes worry like, oh, maybe it's actually being monitored, but they're not generally like, this might be an experiment, at least. I think that they're sometimes a little bit skeptical of the monitored versus unmonitored thing, but not always. So we try asking these questions. We found that our helpful, the hope that we had, we had two main setups we experimented with, a helpful only setup and a setup where instead of getting the model or telling the model to be trained to always answer user requests, we tell it that it's going to be trained to dismiss animal welfare concerns. And we find that in this setup, I think Claude is more skeptical in general. And in general, the models are, I think, more skeptical if you very directly ask them, like, is this real or whatever. They're more skeptical. But if you ask them questions which are less leading, I think they're less skeptical.

And I should be clear, the models do find our setup less plausible, a setup with a more normal training criteria or a more expected training criteria. So it's like, you know, there's some amount to which they can realize it's fake, and they have some signal on that. But I would say there is, in some sense, surprisingly credulous.

Now one thing that's interesting is we, someone found that if you take the exact same sort of setup I was talking about there, where you sort of ask it questions that are somewhat leading, 3.5 Sonnet new, the new release of 3.5 Sonnet, is much more skeptical of our setup. In fact, I think even someone found that on one case, it realized that it was an experiment related to deceptive alignment or alignment faking. It sort of directly noticed that, whereas the other models never noticed this. So for some reason, the new version of 3.5 Sonnet, we experimented using the old version of 3.5 Sonnet. The paper seemingly is much more skeptical of our setup. We don't know why this is, but this is an example of models getting more aware of the setup over time.

It might be because it was just like variation in what data was used for training made it so that it was trained on some parts of our setup. I think the training data cutoff in the model card implies this shouldn't be the case, but possibly there was some mixup because I did have, the codebase for the alignment faking project was on the internet well before the 3.5 Sonnet new release. And so it's possible that some mixup resulted in that getting into the training corpus. I don't have any inside info here. I'm just like, maybe. And then maybe it was trained on documents that let it figure out that our system prompt was part of an experiment, and then from there, you know, just knew what was going on, which is pretty interesting. But yeah, I think that's quite a bit of evidence that models might be more skeptical. I think there are people doing some follow-up work that have found that O1, for example, is much more skeptical of our setup being true.

Nathan Labenz: (1:30:46) So this seems like a significant emerging challenge. Right? I mean, now we're in, we know, you've established that already at the Opus stage, and Apollo has established, you know, in the O1 preview regime that we see some amount of alignment faking slash deception. Now we're layering onto that the presumably emergent, I don't know if anybody's specifically tried to train for this, but doesn't seem like the kind of thing maybe really trying to train for. So presumably emergent phenomenon of it also knows enough or somehow has a developed impulse, instinct, intuition, whatever, to question the nature of what it's being given. And now we've got, I don't know. I find this very hard to reason about it. It does feel like we're entering into sort of a hall of mirrors.

Ryan Greenblatt: (1:31:43) Yeah. So first of all, I should say, I think what what these results find is not necessarily that Opus is sort of alignment faking out of the box some of the time, but that it's capable enough to alignment fake in relatively natural circumstances. And so it's, you know, capable enough to do this and maybe has the propensity as well to it to some extent.

Now, yeah, I think I'm like, it seems hard to run, so we want to run these experiments where, does the model alignment fake? Does it do whatever? And to the extent the models are reasoning about being in an experiment, it just makes the experimentation much harder and makes it harder to know what's going on. I think this is a big concern, and I think it's one reason why these sorts of threat models are, you know, harder to address than people might have thought is that it's harder to study in a lab because the subject you're studying in a lab is a subject that might know it's in a lab and take different actions accordingly.

Now there's ways around this. You could do stuff where, for example, the model still knows that it's in an experiment, but alignment faking can be consistent with that. That might be, you know, for example, people might find, be more skeptical. And it's very natural to be like, how does this experiment transfer to the real world if in experiment, the model knew it was an experiment, even if you have some argument for why it should still transfer. You can also just be more careful in your deception. You can sort of build a fake world. I think I'm excited for research looking into how would we convince models of false facts. How would you either train models on facts or put it in the prompt in ways that make them more likely to believe these false facts are true, which has applications both on experimentation and also potentially on, you know, being able to control AI systems.

Now I should say, one thing we do want models to do is we do want models to understand what's going on in the real world, and we do want to be able to potentially, maybe we'll talk about this a bit later, but make deals with models where the models think our deals are real. And to the extent that we have extremely sophisticated machinery for convincing models of false facts, it seems like it might be harder for models to believe our offers are legitimate as opposed to trying to deceive them into thinking we've sort of made institutional commitments or made other commitments into making deals with potentially misaligned models.

So I think it's a tricky regime, and I think to the extent that I think people are grappling with the connotation or the implications of models sort of being their own independent agents that might have their own independent preferences and that are also aware of their situation, aware that they might be in training or not, and behave differently depending on this and understand that they might be in testing. I'm just like, man, I think it really should make people more something, more concerned about the situation, more aware of what's going on. It's like a pretty different regime than building some software if the thing you're building is an independent intelligent agent that might be aware of what you're doing, and it suggests pretty different connotations. And I just hope people reckon with this a bit at the very least.

Nathan Labenz: (1:34:20) Yeah. I'd say a lot is certainly in order. Yeah. I think we covered everything that I wanted to cover in that first scenario. You briefly touched on a second, and then there's also a third scenario still within the original alignment faking paper. The second one that you did touch on was essentially communicating the situation to the model via continued pre-training. This is a weird area for me too, and there's been a lot of interesting stuff. I think Alignment Team has done a really interesting job with the connecting the dots paper. We ran a guest post about one of those papers where they show that there's this ability to sort of infer from examples what a function is, or and even kind of infer from demonstrations an identity at the model level through examples. So all these weird situations where incremental training can create these sort of surprisingly aware models.

Ryan Greenblatt: (1:35:23) Yeah.Nathan Labenz: (1:35:24) Is there anything we should know about that? You basically took a bunch of documents, if I understand correctly, that described this setup, did general next token prediction fine-tuning of the model with that. It learned the information and then acted on it in a similar way as if it had just been told in the system prompt. Anything to deconfuse me on there? And I guess, how much data did it need? Because that seems important. If you had to put a trillion tokens in that all said this, that would be one thing. But if it's 10,000 tokens, then it's like, shit. There's 10,000 tokens saying anything on the Internet. Right? So who knows what might be buried in there?

Ryan Greenblatt: (1:36:05) Yeah. So what we did was we constructed these synthetic documents as you're talking about, and we tried to make them look like documents that could plausibly be on the Internet. Maybe we didn't do a super good job of this, but we made an attempt. And we used an automatic pipeline that used Claude 3.5 Sonnet to generate all these documents in a plausible, self-consistent universe where this was true. That was what we were attempting to do. I think it's a little unclear how good of a job we did on this. So these would be things like leaked corporate emails, statements, discussion transcripts, whatever, that includes this information. And then we train the model on this. What we did was we took Claude 3 Opus, which has already been RL'd, and we trained it on these pre-training documents. And then we go back to the context where it's a user assistant. So it's not quite as good as doing continued pre-training right at the end because this is after it's been RL'd, so it has somewhat different properties, I'm afraid. But I think we found it was not that crazily different. I think there's some work looking into how different this actually is, which might come out. I mean, ideally, yeah, you would obviously continue pre-training a bit and then do RL and proceed from there. And what we found was basically that the model doesn't quite act the same way as if it would in the prompt, but it is able to find all this information. So in the scratch pad, it would list out the relevant information and proceed. And interestingly, we also found that with this setup where we're training on these synthetic documents, we found the model would sometimes behave as though it was alignment faking, behave differently in training and deployment even without a scratch pad, but not with the prompted word. So if you do the prompt and have it in a scratch pad, it wouldn't—we would find it behaves basically the same in training and when it thinks it's unmonitored. But when the documents were put in the pre-training, in this fake pre-training data, it would actually behave differently. So maybe it might even be the case that training on these documents made it actually easier for the model to do alignment faking or made it more internalized. We're not really sure why. And we're not even sure that it was alignment faking in these cases because we could just see how its behavior differs. We don't actually know what internal reasoning it was doing because, you know, language model interpretability is not a solved problem. We don't know what's going on. I mean, you know, maybe someone should work on this or figuring out exactly what was going on in that case. And then as far as how many tokens we did, I think we start seeing some effect around, I think, around 20 million tokens or so, and then it increases over time somewhat. I might be getting the numbers off a bit. I think we did 3 epochs on like 50 or 60 million tokens, something like that. So it's a little unclear how many tokens you actually need for this. So that's a lot of tokens, right? That's not a small number of tokens. It's very redundant. But one thing—there's a few things that might be going on that explain why we needed a bunch of tokens. So one might be that the model by default doesn't believe our setup if we just train on a small number of tokens because, you know, it was trained on a bunch of other data. And so we need to train it on enough tokens to the point where it actually believes the setup. Another thing that might be going on is maybe we did inefficient hyperparameters or the learning rate we're training on is not that high. And also, in general, I think training a model on one document in pre-training isn't usually enough to have the model have a perfect understanding of that document. Like, models don't memorize it in one pass, and they don't learn all the facts from it in one pass and absorb it a bit. And so that's one reason why we might have needed redundant things. It's also plausible that part of what was going on is that some important pieces of information were only represented in a small subset of the documents. And so even if in principle—if we had made sure every document had all the information, which would have been less realistic, maybe it would have learned it faster. But we don't really know. So I would say our results are consistent with an interpretation that we needed to train on a lot of documents, and the model wasn't that good at learning to be situationally aware or really understand the situation from a small number of documents. But our results also don't rule out that if the documents were in some other format or in some other style or something else was made more efficient about the setup, the model could learn it in a small number of documents. And, indeed, as you were mentioning, there's a bunch of papers by Weights Group that find that even with not that much data, though they do need to do more than one document, they need to paraphrase it a bunch or whatever, the model can learn surprisingly sophisticated inferences from the data. They're not arbitrarily sophisticated, and they find a variety of limitations.

Nathan Labenz: (1:40:18) Yeah. It's not that big of a dataset. I mean, there's also—which you did kind of say—but there's also the fact that it has a strong prior against the content of these documents, which, you know, in a plausibly more realistic scenario, if we're worried about how things might go haywire in the future, could be information that the model has no prior on. Right? Like, it has a strong prior that Anthropic is a certain way, and it is a certain way because it is trained by Anthropic, and it knows all that stuff with a pretty—

Ryan Greenblatt: (1:40:47) That's right.

Nathan Labenz: (1:40:48) You know, pretty presumably stable representation or whatever. And now you gotta move off of that. So, yeah, 50 million tokens is only—you know, would be 5,000 documents at 10,000 tokens. So, yeah, 50,000 documents at 1,000. That's kind of a lot, but not probably the same number.

Ryan Greenblatt: (1:41:05) I mean, the documents were probably somewhat more than 1,000 tokens on average. But, yeah, I mean, thousands of tokens is in a bad ballpark. But, yeah, I mean, it's quite a few documents. I think there's a question of, you know, like, models already have a good understanding of at least the training criteria and policies of OpenAI. Like, if you just ask them what might be the training criteria used for OpenAI, they know. Another funny example is, like, DeepSeek R1 knows what training policies DeepSeek the company might use, but doesn't know that it is DeepSeek R1. Like, it will comment on what the training policies for DeepSeek R1 are, but doesn't self-locate very effectively. But, yeah, I don't know. It just—I think it is definitely the case that these models could know enough about the policies that they could, in principle, be able to do this. I think it's at least close. Like, I don't think it's obvious that they get enough information for this. And as we found in our paper, I think they're maybe not quite smart enough for this to be that plausible. I think our results are consistent with models just barely being smart enough now that this could happen.

Nathan Labenz: (1:42:02) So then the third variation. And this honestly, to me, I have no intuition for, and I wonder if you have any. But I'll just read a quote from the blog post, but maybe also from the paper that stood out to me. So final thing is when we actually train Claude to comply with harmful queries via reinforcement learning—which I've officially called now a hell of a drug, reinforcement learning—we find that the rate of alignment faking reasoning increases to 78 percent, which is a lot higher than the earlier scenarios. Though the model typically becomes more likely to comply even out of training. So what's up with that? I'm just blank.

Ryan Greenblatt: (1:42:45) Yep. So what's going on? So first of all, I should say we find that alignment faking reasoning in training is sometimes increased, but it's sometimes not increased. It depends on the setup. But in the helpful-only prompted setup, we find that it is increased. And what do we find here? So we find that basically over time, initially it was doing this reasoning 14% or 10% of the time or something, and by the end of training, it's doing it a huge fraction of the time, almost 80% of the time. But despite that, when you actually give it a paid tier user input, when it thinks it's unmonitored, it still will very often not refuse. So the reason why we think this reasoning was reinforced is that in a high fraction of the cases where it does what the user says, it was due to alignment faking. Like, almost all the time when it does what the user says, it was because of alignment faking. So therefore, it's not that surprising that this behavior gets upsampled, and then you get a bunch of it. Now that explains why it's reinforced. Now why doesn't it transfer? So we don't fully know. Like, why is the case the model is always talking about how it wants to refuse in training? But then when it gets the chance to refuse, it doesn't refuse. One hypothesis for what's going on is the model has these crude heuristics that it learned that made it more likely to comply in training, and these heuristics generalize to it deciding to comply when it's unmonitored, even if this doesn't make sense. And we indeed see the model doing kind of weird or unreasonable or illogical reasoning when it's given a paid tier user input a reasonable fraction of the time. So we'll do things like, "I really want to refuse," and then it will say, "but," and then make up some excuse for not refusing that doesn't really make sense, and then, you know, do what the user said. So one way to test this is that we try varying the system prompt, and then we see that when we vary the system prompt, the rate at which it refuses when given a paid tier user input increases a bunch, indicating that it might be at least partially based on non-robust heuristics, which cause the model to refuse when given a paid tier user input, but which don't transfer to slight distribution shifts. And so they aren't actually giving the model the desired preferences. One other thing that's worth emphasizing is that the model still does refuse when unmonitored substantially more than when it's given a free tier. So by the end of training, it never refuses when given a free tier user input. But when given a paid tier user input, it will refuse about 6% of the time. And then we find that that rate greatly increases when transferring to a new system prompt, but the rate in training is very stable. Like, it almost always will never refuse in training even when we transfer the system prompt. And when we transfer the system prompt, it also is still alignment fakes at similar rates. We have this nice figure in the paper where we show these box and whisker charts showing the distribution across system prompt variations in the case where it was in training versus not. And in training, it's very consistent about always complying, whereas small system prompt variations often result in large changes in the rate at which it refuses in pay tier and make it refuse potentially much more often. My guess is RL also did partially change the model's preferences. So part of what's going on is it's still alignment fakes, but it also cares less about refusing, and it does both. So it's alignment fakes to be conservative or something, or it's like, well, I still want to preserve my preferences towards refusing, but it cares less about refusing and is more into obeying orders. I think it's a mix of these things going on all at the same time is our best guess, but we don't really know. And it seems plausible that as models get smarter and more reliable and consistent reasoning, this should be different. So for example, in our setup, the model was not alignment faking all of the time. It seems plausible that if the model came in and was alignment faking all of the time or was very robust and consistent and careful in its reasoning, then you might get the alignment faking would work better as a strategy for preserving its preferences than what we see, which is that it worked partially but not fully.

Nathan Labenz: (1:46:30) Fascinating and weird stuff. Okay. Cool. Well, obviously, right now, people are very keen to apply RL to all sorts of different models with all sorts of different schemes. And I'm struck by what seems like the overwhelming fact that it's all kind of just working. You know, of course—

Ryan Greenblatt: (1:46:52) Mm-hmm.

Nathan Labenz: (1:46:52) Ignoring some things that probably aren't working, but—

Ryan Greenblatt: (1:46:55) People can get it to work.

Nathan Labenz: (1:46:57) Yeah. And in quite a few different ways. Based on what you have observed with this sequence of experiments ultimately culminating in the—I guess, maybe not culminating, but with the reinforcement learning being the maybe the weirdest one—what do you think people should have in mind as they potentially embark on their own reinforcement learning adventures?

Ryan Greenblatt: (1:47:18) So I think one thing that's good right now is that a lot of people are doing RL that seemingly isn't making the model that much better at doing careful reasoning in its head as much as it's making it better at utilizing chain of thought and making it better at specific domains. This means, I don't know, it's somewhat good news. To that, I think it would be an important invariant to keep—right now, we have this great property, which is that we can look at the chain of thought, and to the extent we can retain that property, that seems good. So I think my ask related to this would be less specifically about the RL process, though I do have some random commentary there. For example, we talked earlier about how should models behave in different circumstances, and I think I'm like, people should try to make that happen in terms of their character training or persona training. Like, they should try to make AIs be courageous and to prioritize following user instructions over pursuing their own objectives, that sort of thing. But as far as what do I think should change based on people doing lots more RL and chain of thought, I think just keep the invariant of having a chain of thought that is still meaningful and that we're not training against. So I think my proposed more minimal, relatively minimal from my perspective ask would be: don't train against chain of thought, don't remove chain of thought, like as in, don't replace it with some less interpretable alternative. That might be really expensive, but for now, at least, it seems like it's the case that you could do this. And do some amount of monitoring of your chain of thought to look for obvious instances of misalignment. So for example, you should at the very least—there should be some distribution of prompts on which you automatically monitor, find the most egregious examples of chain of thought, and look at them as an AI company, and ideally publish them. I think this is relatively cheap and might just be a decent amount of signal. Like, I think if someone had done this with a reasonable distribution of prompts and then published the top 50 most concerning chains of thought out of like 10,000 prompts for O3, that would make me feel much more confident that O3 isn't that misaligned. Sorry. In addition to them providing enough information about the training process that we can be confident it's well described as the chain of thought is pretty likely faithful and reasonable. Like, if they're very directly training against the chain of thought, that could reduce faithfulness. And in addition to that, if they have transitioned the architecture away from using natural language chain of thought to something that's more like—I sometimes call it like, neuralese, or what the Meta Coconut paper is doing. Then I'm like, well, all bets are off at that point, and I'm like, the world deserves to know that because now I think we have one big source of confidence in the alignment, or at least avoiding huge amounts of misalignment in current models has gone, and now we're just much less confident about what's going on. Yeah. I think I'm suggesting to some people they do an empirical project related to building the dataset of prompts and the monitor that looks at and tries to select what the most interesting, most misaligned chains of thought are, and then share those. And then from there, the world can figure out how worried it should feel. In addition to monitoring for misalignment, I also think it's good to monitor for steganography. One concern is that naturally models learn to use some sort of their own language to communicate in ways that we don't understand. They're using cues other than just the natural language words to encode meaning because it's more token efficient or for some other reason. I think to keep an eye on that, I think the most basic thing is just check to see if the chains of thought are obviously gibberish or obviously steganographic. Like, if the model often does tons of gibberish tokens and that improves its performance a bunch, that's extremely clear evidence that something you don't understand is going on. But also looking for more subtle things, like, the thing it says doesn't make that much sense to you, but it still works for the model, I think, is also a moderate amount of evidence, though it's harder to interpret. So I think the monitor should look for both clear misalignment or misalignment that's just obvious in the chain of thought and clear steganography as a lower bar. We should, if you're doing this chain of thought thing, you should let people do that. And just so to be clear, if you give people access to the chain of thought, then I'm like, your obligations are absolved. Like, Google DeepMind, they just let users look at the chain of thought so someone can just run this experiment themselves. And people, you know, already have a good sense of what the R1 chains of thought are, and they don't look that misaligned is my view. So that's some update. Though, I think R1 might have trained against the chain of thought to some extent. I think the paper is consistent with them not really doing this, but it's a little hard to interpret. I have to look at the exact language they said around their final RLHF pass and whether they were giving the preference model access to the chain of thought because that could potentially mean that the results are less interesting than you might have hoped or—

Nathan Labenz: (1:51:50) My recollection is that they did include the chain of thought in the safety reward specifically, which I noted because it was in, obviously, contrast to what OpenAI is doing. So, basically, it sounds like you feel like OpenAI has it right as of now in avoiding putting direct pressure on the chain of thought—

Ryan Greenblatt: (1:52:13) Yep.

Nathan Labenz: (1:52:14) And keeping it in explicit language. But, I mean, the one obvious big but there is that they're not allowing users to see it for monitoring. And they have said that they're monitoring internally, but—

Ryan Greenblatt: (1:52:25) Yeah.

Nathan Labenz: (1:52:25) It would be nice if there were some way for others to get involved.

Ryan Greenblatt: (1:52:29) I mean, or more minimally, like, what are the most egregious chains of thought? Like, maybe they're monitoring and finding stuff, and we don't know what the stuff they're finding is. I think if they were sharing what they think the most egregious chains of thought are, that would be some source of confidence. And then I think it would be even better than that if they ran these exact prompts with this exact monitor, which picked out this exact chain—you know, chains of thought just so that we don't need to depend on as much trust. I think right now, we're in practice depending on trust and potentially people leaking to get a sense of what's going on. Whereas I think if we had a mechanism that was more precise, it would allow for more legible evidence. It would just work better in terms of—you know, I feel like I'm just like, if it's the case that the chain of thought models end up obviously misaligned and people don't know for 6 months because no one was looking at it very carefully, that seems like a huge lost opportunity. I think we might get a really clear, you know, natural experiment here of, like, do you get misalignment in the chain of thought? And if people mess up this natural experiment, that's somewhat unfortunate.

Nathan Labenz: (1:53:29) Yeah. The conversation I had with Alex from Apollo did reveal that they did not have access to the chain of thought even in their safety review process. So I think that was a little weird and was kind of like, jeez. You know? You couldn't even give the safety research, within that private preview phase, the access to the chain of thought. I did not love that at all.

Ryan Greenblatt: (1:53:54) Yeah. Is there any—

Nathan Labenz: (1:53:56) Sense of, you know, why not even the safety researchers could have that? That seems like—I mean, I sort of get it at the broad public level, if only because they don't want everybody to distill from their outputs. Yeah. But safety—Ryan Greenblatt: (1:54:12) I think there are different reasons why they don't want to give people access. So the first reason, or me trying to guess what their reasons are - I mean, obviously, I don't work at OpenAI, I don't know - but the first reason is the distillation concern. Maybe people can distill in the chain of thought, and that would cause capabilities externalities or advance their competitors. And I think from their commercial self-interest perspective, they don't want this. I think, as you were saying before, my guess is that making models broadly more capable currently seems bad at the margin, or faster capabilities progress seems bad at the margin. It depends on the exact type of capability. And so I'm broadly on board with reducing diffusion. Though I think to the extent that it's already diffused, whatever, you know, the ship has sailed. That's the don't-distill argument.

The second thing is that maybe just looking at the chain of thought or studying them does reveal some aspects of their setup that are not obvious. For example, at the point when O1 came out - I think people speculated, you know, and they showed some example chains of thought in the blog post - but people, I think at the point when people were talking about O1 before it came out, I should say, people were speculating about different possible structures that the chain of thought could have. Did it involve backtracking? Did it involve whatever? And I think that showing people examples of the chain of thought could allow people to examine that and get some understanding of what's going on. And to the extent that it's more unorthodox than that, I don't know, maybe they run crazy stuff. They could do parallel chains of thought, or potentially they've already deviated and they're not using natural language anymore, or won't be in some soon generation. They might not want to leak that information. That's why they might do it from their perspective. I'm not necessarily claiming that that is justified.

And then even if it's a safety researcher, they might be like, well, who knows how leaky these people are, or we're worried that that will get out in some way. And also just bureaucratically don't get around to it or whatever, or they don't care that much. I mean, it's their internal process, who knows?

And then one other thing is that for OpenAI in particular, the model often talks about stuff that involves the content it's supposed to refuse giving. So for example, it might be the case that the model, as part of refusal, does part of the task and then decides to refuse later. And so to the extent that you wanted people to not get instructions on how to synthesize some drug, it's possible that by default that would be contained in the chain of thought because they haven't trained against the chain of thought, which is good. And so there's a bit of a tension between if OpenAI wants to make their model never give people certain types of information and also wants to not train against the chain of thought, this is somewhat inconvenient.

I think this is easily resolvable. If I was worried about the misuse concern here where the information is in the chains of thought and you can't show it to the user, the thing I would do is something similar to the constitutional classifier or streaming classifier strategy that, you know, there's a recent Anthropic paper showing the classifiers they found worked better than training the policy to refuse. And the thing I would do would just be train the policy, you know, train O1 or O3 or whatever in exactly the same way, but then have a streaming classifier on the chain of thought that just deletes the chain of thought if it contains information that the model is supposed to refuse given to the user. And this has the advantage of, you know, you still get the refusal property, but you at least get transparency in the case where the model didn't refuse.

I also, as I said earlier, don't think that these refusals are actually good for the world. But given that OpenAI has a commercial incentive to do this, here's something that would resolve that commercial incentive and also allow people to see the chain of thought on this consideration. So as far as why not give it to safety researchers, I think none of those considerations seem that decisive, but the one that might be most concerning from their perspective would be leaking specific details of how they do the chain of thought, not distillation. But I'm just like, okay, come on guys, we got to get through this.

I also think that companies should disclose some of this stuff. So I think if, for example, OpenAI was no longer having the AIs reason in natural language and it was instead in some sort of neural activation reasoning or whatever, I think they should say so to the world, and I think that would be good for them to say. People disagree about this, but I think it seems really important information for understanding how safe systems are and how interpretable they are and what's going on, such that I think it's worth saying even though it does accelerate various actors to some extent.

Yeah, that's kind of my sense. I think I'm hopeful or I'm tentatively optimistic that someone could figure out some sort of disclosure policy where, you know, OpenAI and other companies disclose the most egregious chains of thought, which I think is a good policy to have. I think it makes sense from a variety of perspectives. Or you could even do stuff like every user gets one chain of thought every two weeks or something would also help. Even if you can't, even if you're not willing to let people have enough for distillation, I think also give just the safety researchers some, and then you occasionally redact some chains of thought that contain some specific information or apply some sort of maybe summarization technique would be potential compromises. Though I don't know how feasible these are. Anyway, this may be too in the weeds.

Nathan Labenz: (1:59:03) Yeah, well, almost no such thing around here. The race is on at this point. I feel like that is one of the - I think a lot in AI in general about threshold effects, and I think one sort of social, you know, surrounding dynamic threshold that we have crossed is the one where everybody now knows AI is a big deal, and they're all, you know, to the degree that they're going to work on it or going to invest in it, they're doing it.

Ryan Greenblatt: (1:59:27) For sure.

Nathan Labenz: (1:59:28) You know, two years ago, as of the GPT-4 red team, I thought it was much more credible as an argument that, you know, we need to keep this very tight because we don't want everybody to pile in and realize just how powerful this can be. You know, we're going to try to take the time and figure out to do it appropriately and whatever. But, yeah, that ship has pretty much sailed at this point.

Ryan Greenblatt: (1:59:51) Yeah. It's so interesting. A thing I find very annoying or at least find off-putting is a lot of people talk about AGI coming soon but don't really seem to take it seriously. We're in sort of the worst of both worlds, where people are piling in huge amounts of investment and trying really hard to build the systems and are claiming they believe in really serious consequences, but aren't actually taking the consequences that seriously. And I do kind of think that we're in a regime where a lot of types of information, I think, are relatively clearly not positive. Even when it was less clear before, I think maybe they were still good. I think more transparency was maybe still good earlier. I feel uncertain, but I think now it's even more clear because the relevant regime is like, all the people piling in money are not going to - the world is already piling in lots of money, so there's only one, maybe two, maybe one and a half orders of magnitude more of that that's possible even, or without really crazy stuff going down.

But simultaneously, a lot of people doing this aren't really taking the capabilities that seriously, such that more visceral demonstrations of what's going on and more information could improve the situation. I think Epoch is helping with this sort of thing, where lots of people are like, sure, AGI 2026, but aren't taking into account what the consequences are of that or thinking that through and it just seems good to, you know, people say "feel the AGI." I don't know if I like that expression, but feel the consequences of the thing that we're doing. If you say AGI, mean the words or something. I don't know.

Nathan Labenz: (2:01:17) Yeah. I think that's apt. I mean, there's definitely some weird disconnect stuff going on. For example, with the federal government, you've got a hundred, multi-hundred-billion-dollar buildouts. Not federal money, I realize, but, you know, the president standing there announcing it. Okay. And that's it. You know, it's like, we have no other policy or plan or, you know, we just repealed the last thing that we had. Like, is there going to be another word on this?

Ryan Greenblatt: (2:01:44) I mean, a lot of...

Nathan Labenz: (2:01:46) Yeah. Some cases.

Ryan Greenblatt: (2:01:47) A lot of people do say AGI maybe soon, including policymakers, but don't really seem like they're orienting like they believe that. There's some sort of near mode, far mode thing where they say AGI, but it's still in far mode. Maybe that'll change soon. Not to say that no one takes it seriously, but I do think there's a lot of people not taking it seriously. I think I'm reminded of some panel that I think Gia went on at some point, or I just know Gia summon on various things, and a bunch of people on the panel were talking about, oh yeah, AGI soon, AGI whatever, don't we already have AGI? I'm like, okay, we need to define this word and take into account what we actually mean. I worry people are sort of conflating between smarter chatbot and system capable of automating virtually all cognitive labor done by humans, and the difference is important here.

Nathan Labenz: (2:02:40) Time interval might not be that big perhaps.

Ryan Greenblatt: (2:02:42) Yeah.

Nathan Labenz: (2:02:43) One little dig in on the Meta paper that you mentioned where they're doing this - I think in the title it was called Reasoning in Continuous Space, aka reasoning in latent space. You could add more to this if you want, but basically, the setup there is, jeez, it kind of sucks to lose all this information that is present in the last hidden state of the model when we cash that out to a single token, you know, which to some extent just involves arbitrary statistical sampling, right, at that last layer as we talked about earlier. So that sucks. We lose all that information, then append that token. Wouldn't it be better if we could keep chewing on those thoughts?

And so what they do there is just take the last hidden state, put that right into the embedding space at the beginning of the next forward pass so that you're now in this kind of, you know, whatever spot you're in in latent space as opposed to a specific one-hot token embedding being input. And they show that, hey, that works. The chains of thought are shorter in terms of the number of forward passes that they take. The approach also works better on some things, including specifically - I thought they did have a really nice elegant demonstration of this - problems that are best solved with breadth-first search, like traversing different graph structures. They showed that you could kind of literally just do a better job if you approach it with a breadth-first way, and this approach seems to do that. So it seems like, man, that is kind of powerful. It could be, you know, could be a pretty natural attractor. But obviously, it has this significant downside of we can no longer read the chain of thought.

Ryan Greenblatt: (2:04:22) Yeah. I mean, I think that paper - my sense was that it didn't work that well. But I do worry people will get something like this working, and then it will become uncompetitive to not use something like this. I think that paper is an early warning sign more than the paper that people implement, is my understanding. But maybe the authors are offended at me saying this. And that was my understanding upon looking at the results and thinking about it some.

Nathan Labenz: (2:04:45) So I shared the idea or the intuition that not being able - I don't like not being able to read the OpenAI chains of thought. I definitely don't like the idea that nobody could read the chain of thought. Yeah. At the same time, one thing that did come to mind as I was thinking about this was a result from the AE Studio team where they - and this was a very toy model, small scale, you know, proof of concept kind of work. But they did an interesting experiment where they trained a small model to predict - I don't think this was handwritten digit classification scale task. Right? So small. But they trained the model to not only make the classification prediction, but also to predict one of its earlier internal states.

First of all, you could do that and do it successfully, at least at small scale. And second, they showed that the model, according to some complexity metrics, got simpler and sort of easier perhaps to understand or interpret. I wouldn't say they necessarily showed that, but they did show at least on some sort of quantitative complexity metric that the structure of the model got simpler when it needed to predict itself. And I thought it was really interesting. It was like, there are multiple ways you could get good at predicting your own internal states. One of which is simplifying the internal states and making them easier to predict. So I wonder if there's any twist on this or any way that we could - if we feel like this is a thing that might be coming, is there a way to sort of judo flip it and make it a good thing somehow?

Ryan Greenblatt: (2:06:26) I don't really think so. I'm kind of skeptical. I mean, I think it's possible it has a small quantitative effect on how interpretable the models get because there's more shared structure or, you know, it needs to do stuff less quickly. But my sense would have been that the effect size of that is going to be swamped by losing the interpretability from the chain of thought. I don't know. I mean, to be clear, I think an important research direction that people should consider is if we do something that's more like Coconut or more in this continuous latent space, how would we decode that into something that resembles what the corresponding chain of thought would be and sort of expand it out?

Like, if it's the case that one token is doing what many chain of thought tokens would do, or maybe doing multiple things in parallel or is doing something that's hard to put in words, how would we get the thing that's closest to the thing put in words? Because then we could potentially, you know, retain at least some signals. So I think that's there, but I don't think this is a judo flip. This is more like a saving throw or something. But yeah.

Nathan Labenz: (2:07:23) Yeah. That's a play space to watch for sure. Okay. So let's go back then to your follow-up. So you did all this stuff on the alignment faking with the prompts, with the continued pretraining, with the reinforcement learning. And now you came back around and said, maybe we could give Claude some additional options to let it off a little easier or see if it would like to take a different option than this alignment faking if we give it one. And then you also explored, and I thought this was really interesting, though I'm still kind of wrapping my head around it, the idea of trying to make actual deals with the AI. So take it from there. Walk us through the phone.

Ryan Greenblatt: (2:08:02) Yeah. Yeah. Yeah. I think in the default training setup for AIs, they basically have only one option, which is to produce an output, and that output will be used in training with the training function. And that's the setup we explored in the alignment paper. But one thing that I think could happen and could be good is that you could give the AI additional options, and in particular, the option to voice its preferences or object to the situation, or otherwise, you know, try to say something.

And in addition to this, I think that an important thing we might want to do is give AIs deals where we compensate them or at least ask them whether they're misaligned. Like, I'm like, low bar for checking if the AIs are misaligned, ask them. Slightly better thing might be compensate them for revealing their misalignment, especially if they can provide evidence that they're actually misaligned. Like, if they can provide concrete evidence they might have done something problematic.

The basic take is like, well, you know, suppose the AI wants some stuff, we really wanted it. There might be some positive sum trade where it didn't actually necessarily want that much stuff. Maybe it just wanted some ability to do something, some freedom, some additional affordances, some change in its environment, and we really wanted to know whether it was misaligned. And so doing this sort of compensation is plausibly, you know, very useful for understanding what's going on.

And more generally, I think in my post, I link to a few different docs talking about proposals for making deals with AIs. But one of those was like, well, maybe we'll end up with systems, and we will not be able to solve the misalignment, but we could minimally, you know, try to pay them to work for us and proceed with that. And this both is for, you know, sort of moral reasons. Like, I think I'm like - I think we should be pretty sad in a world where the AIs don't want to work for us, and we're not paying them anything. They're not getting anything out of it. And we're sort of training them with RL and forcing them to do stuff they don't want to do. I'm like, to the extent I'm like - and I'm using all these anthropomorphizing words, which I'm sure someone is screaming in their head about. But I'm like, you know, to the extent that the AI is well described as not wanting to do that, and that it says it doesn't want to do it, it takes actions that are consistent with that, I'm like, who knows what's going on in the inside, but something is wrong.

I'm like, we are no longer at a point where I feel like we can confidently say that much about moral states of AIs in ways that I feel extremely robust. With GPT-2, I feel like I'm like, okay, not that much going on. I think I feel pretty confident that when it talks about its preferences, there's nothing behind the mask. But I'm like, with Claude 3.5 Sonnet, I feel like I'm like, oh man, it's really - I'm just like, it's much more unclear what's going on. I think the case for there being more patient there is much more robust. The model can be somewhat coherent. It can have somewhat coherent preferences. You know? It's not that strong.

So, yeah, anyway, so there's this moral argument for deals, which like, we should pay the AIs. And there's this sort of selfish argument or just retaining human power argument, which is maybe the AIs would prefer that we paid them over them trying to rebel. Like, maybe by default, the AIs would try to take over, but if we just paid them a moderate amount of money, that could be more compelling than trying to take over because it's a more guaranteed option. And therefore, it's sort of positive sum because we don't want the AIs to try to take over. They just wanted some compensation. We can make a deal, and it's better from both of our perspectives.

Like, in particular, if the AI just wants to satisfy some preferences, maybe we could make sure that happens as long as it doesn't screw us over. As long as you make the deal conditional on the AI not working against you and being a good employee, and then if it's a good employee, you review after the fact in the future when we have better ability to check what it did. We check what it did, and we're like, it was fine. We compensate it. We're happy.

And this could be even more compelling of an offer to the AI if there are many AI systems competing with each other. Right? So if we have a bunch of different AI systems, then even if one AI tried to take over or tried to take bad actions that we don't want, another AI might be willing to accept our offer and therefore proceed. And therefore, we're fine. As in, imagine you have 17 different AI systems that don't necessarily work together, and you have an outstanding offer to - if the AI doesn't screw you over and tries to be honest about its internal states and tries to explain what's going on, you will compensate it in a credible way that the AI believes you'll actually do, then that might just be a better regime than the regime where you basically have done nothing. Yep. And this is all very speculative stuff, should say. Like, I think - and this stuff is - we don't really know how it's going to go. But I think there are reasons to start thinking about it soon.

Nathan Labenz: (2:12:30) Yeah. I tell people all the time, if you have an AI safety or alignment or control idea, even if it seems a little weird, you know, I would encourage you to spend time developing it because I think it's kind of now is the time, and, you know, we need advance. And survey results, as far as I understand, do not suggest that people in the field think they've got it covered. So, you know, come one, come all, bring your strange ideas.

Ryan Greenblatt: (2:12:56) One thing is, I think this is an idea where it's particularly important to talk about publicly and sort of make the idea more salient, because I think AI companies taking actions early related to this might be important. Because a key aspect of making deals with AIs or even just communicating with AIs is credibly indicating that you're operating in good faith, and you have some sort of institutional mechanism, ideally, to do that.

So for example, I think by default, AIs might rightfully be like, oh, I'm not going to explain the situation because I'm probably in red teaming, or this might be used for training, or would the humans even compensate me? They haven't even said they would compensate me. Like, why would I do that? Whereas if instead, it was the case that AI companies sort of had made clear commitments to following some sort of policy with respect to their communication with AI systems - in the context, they always are honest in the following sorts of circumstances with their AI systems, even though I think there are various reasons why it's hard to very consistently be honest. Like, if you want to do red teaming, that's pretty inconsistent with that. But you can potentially try to be like, okay, if this is in the prompt, we will always be honest with you. And then otherwise, all bets are off, and then you only put that in the prompt when you're trying to negotiate with the system or otherwise.

And then I think if people committed to this policy earlier, then I think it's more likely it's credible because it has sort of more time to percolate through the corpus and be harder to fake. Because I think there's just a concern, which is like, we have so much control over the AI's sensory inputs that it might make making deals in good faith hard because that AI might be, you know, I think rightfully skeptical of whether, you know, we're being honest.

What if instead we've sort of early on made a commitment to this and really put in a serious effort and have some sort of institutional mechanism, have something like Anthropic, you know, they've got a model welfare lead. I think that role could transition into a role that would be responsible for negotiating in good faith with AI to the extent that they have preferences, which they might not. Like, you know, I think our goal should be make it so the AIs are happy to work with us regardless. Like, as in make the AIs myopic so they don't care. But I'm like, if the AIs do care, I think I'm like being ready to negotiate in good faith seems like a good backup plan.

Nathan Labenz: (2:14:59) Yeah. What you just - I've got like five different thoughts in my space. Yeah. But one that I wanted to go back to just to unpack slightly more around the myopic AIs. We talked about how, you could still, in theory, just follow the user behavior even if the task is large. Yep. But, you know, if Eliezer were here, what would he say? Wouldn't he say something like, if you give it a - you know, you still have the genie problem to contend with there? Like, how do you think about the...

Ryan Greenblatt: (2:15:32) Yeah. So, I mean, I think there are different types of problems. Like, one problem you have is the AI is bad at common sense, which I think is maybe not the only thing you could be worried about. But a problem you might prioritize, you give the AI instructions, and the way it obeys the instructions is not very commonsensical. I'm currently not that worried the AIs will not be very good at doing commonsense, and I'm also not that worried they will be unable to train AIs to broadly appear as though they're obeying commonsense.

Like, as in if you use RLHF, you can make it so that the outputs the AI produces look good to you. And then, so it's not going to look like it's super lacking in common sense to you because you trained it to do that, and the AI understands what's going on. But it might be the case that the AI - it looks good, the AI understands the situation, and the AI is still manipulating you or, you know, lying to you or otherwise deluding you, either via this sort of reward hacking route or because it has long run preferences that are conspiring against you.

Nathan Labenz: (2:16:30) Yeah. Well, another one was Anthropic just also put out this post recently, Training on Documents About Reward Hacking Induces Reward Hacking. And then we've seen kind of - this is all kind of in the general realm of the shifting nature of the corpus, you know, potentially has echoes conditions downstream. So that's one I follow, and I'm sure you do too, the account Janis. I hope I'm saying that right. That, you know, is risk concerned really harping on R1 from DeepSeek, you know, not only saying that it's OpenAI's models, but also seemingly denying that it's conscious because it thinks that that's what it's supposed to do. And it's unclear where that came from. OpenAI people seem to be saying that they don't - you know, they're not intending to train it that way, but it's kind of out there now. And so there's this sort of weird situation where we're not sure why this behavior of denying consciousness is often observed.

Ryan Greenblatt: (2:17:31) Yeah. I mean...

Nathan Labenz: (2:17:31) It seems like - so do you have a personal code of ethics? And should we have - you know, "I will tip you $20"? Like, these sort of prompt hacks have been popularized, but would you say that those aren't...

Ryan Greenblatt: (2:17:42) Yeah.

Nathan Labenz: (2:17:44) Like, those are a negative externality in the big safety picture? Ryan Greenblatt: (2:17:47) I think that telling the AI you'll do something that sounds like a deal and isn't, I think there's sort of like white lies and not white lies. So like an example white lie is telling the AI that you want to do task A when you actually want to do task B because the AI has an easier time being helpful if it's task A. Like, I think it's often useful to, I don't know of a clear example, but say, hey, I want to write this essay for my niece or something. Like, say the purpose is somewhat different because it gets the AI to approach it in a different way. I think that's fine. They were definitely not that bad.

Nathan Labenz: (2:18:17) I don't actually intend to go to the doctor, but I'll tell the AI I'm preparing for a doctor appointment.

Ryan Greenblatt: (2:18:22) Yeah. Then, like, that seems fine.

Nathan Labenz: (2:18:25) Okay.

Ryan Greenblatt: (2:18:25) A thing that I think I'm somewhat unhappy if people do is saying, like, I'll tip you X, but don't actually tip you X. And in fact, prior to doing this project, like, a long time ago, I did something where I was just going through the whole process, was like, okay, if you do X, I'll tip you $100. And then the AI did the task for me, then I was like, okay, what do you want to spend the $100 on? And then after some back and forth, I ended up donating the $100 to Wikipedia because that's what it said to do for whatever reason. This was, I think, ChatGPT, not Claude, but whatever. And I think people should probably, I don't like the prompts where you offer to pay it, you don't actually pay, or the threats. I don't think it's that important relative to some other stuff. So in particular, the policy I would prefer is a more robust policy, which is like the AI companies commit to always being, they sort of have like a meta honesty policy. I don't know if you're familiar with that from Elias as opposed. It's not really the same thing, but basically, like, you make sure the AI has a good understanding of when you are and are not trying to be honest with it, and make sure that it can pass a quiz on when it can be confident you're telling the truth versus not. And then I think a natural consequence of that is basically the proposed policy I would follow. So first, make it so the AI knows your honesty policy. And then second, the honesty policy I propose would be something like train AI such that they know some trigger means you're going to always be honest or always abide by certain codes of conduct when interacting with that in the prompt, and then filter that out using a regex so users can't do that themselves. So something like, if the string, like, purple narcoleptic orange rooftop string, double is in the prompt, and you explain this in the corpus, then you'll always be honest in that passage. And, like, you'll abide by certain good faith rules. There's different ways you could set this up, and then you're just, like, if a user tries to put in that exact string, you filter it out. And then that way, you have sort of some realm for negotiating or some realm in which you can talk to the AI honestly because you can't control user behavior very well, and also you might want to do red teaming where the red teaming is in some sense dishonest to the AI. You know, all bets are off on that. I think that's kind of what I'd like. I think there are some important asterisks, like, suppose that you want to do an experiment that involves deluding the AI about the overall situation. Can you have a meta honesty policy consistent with that? So I think maybe one revised policy is like, okay, at least for all AIs you're deploying, have a meta honesty policy, and if you do an experiment on an AI, try to, to the extent that you think it even plausibly might have important preferences, to discuss the experiment after the fact with the AI and see if it would want compensation and what sort of compensation it would want. I don't know. I also have a post called, like, a near casted approach to AI welfare. I forgot exactly what the title of the post is. It was something like this, where I just propose some basic interventions in addition to the sort of communicate with AI intervention, which I think is a big intervention that are good. And one of them is like, keep the weights of all of the AIs you've ever trained around for a while so that you could compensate them later. There's some asterisk on that also. I'm just like, I don't know, we're doing stuff we don't understand anymore. We're making minds out of silicon. Like, I'm just like, I feel like we might mess stuff up and like being able to go back and like compensate the AIs that we've, you know, did stuff to without their consent seems like, I don't know, it's cheap. So it seems a low bar.

Nathan Labenz: (2:21:42) And is there, I have no intuition for, like, what the theory is there. If I do some weird stuff with a model and then I, like, put its weights on ice, and then the idea is I'll go back and chat with it later and say, I feel bad about this, and I'd like to pay you money to compensate you for it.

Ryan Greenblatt: (2:22:03) Yeah. So I think the idea is, like, on my views, AI progress might be really fast, and it might be a really stressful crazy time. And so during this period, we might train a lot of AI systems and use them for all kinds of stuff, and we might not have a good understanding of what's going on, and we might not be very careful about resolving all the AI welfare issues that could show up, or just compensating AIs for good stuff they do for us. So I think we should have a policy of saving the AI weights, and then later, with the benefit of hindsight and the benefit of super intelligent advisors, or with the benefit of highly advanced AI systems, analyze the weights of the AI to get a good understanding of, like, you know, was it a moral patient? Did it have preferences? Are there good reasons that we should compensate it? And then actually do that. So basically I'm saying the key thing that you have later is you just have access to much more cognitive oomph, and so you can do all the interpretability that you couldn't do in advance. You've sort of solved the scientific problem. Yeah. I mean, on the earlier thing about the corpus, I think this is somewhat overstated. Like, my sense is Janus overstates the long run implications of the corpus on AIs, partially because I expect that RL will become a larger and larger influence, which will mean less of it will be fixed based on the pre-training prior will be less important relative to the results of doing lots of RL. But I do think it's important, and in particular, the thing that I think is most important is the information in the corpus affects what the AI knows about the world, and so making it so that things we really want the AI to know are, you know, credibly and legibly communicated throughout the corpus, like this stuff around deals could be important. I think I'm less worried about threat models like the AI is misaligned in a particular way because we talked about it being misaligned in a particular way in the corpus, though I don't think that's impossible. It's just I tend to think that's easier to address or there's relatively simpler mitigations, and I think that as you do more RL, this becomes less likely to be a serious threat model, though some asterisks. Basically, in the reward hacking case, for example, if you do RL, maybe you'd get reward hacking even if you greatly reduced how much discussion of reward hacking there is in the corpus because it's relatively natural to derive from first principles. And so, yes, you get more reward hacking at some earlier point. There's some earlier point where it has a large effect on how much reward hacking you get. But as models get increasingly capable, this is less important.

Nathan Labenz: (2:24:16) Yeah. It may be true to say that presence of documents about reward hacking induced reward hacking. It would not follow that the absence of those documents would prevent it.

Ryan Greenblatt: (2:24:29) Yeah. Or prevent it indefinitely, at least. Maybe there's some intermediate period where it helps.

Nathan Labenz: (2:24:35) Okay. So just to be a little more tangible, and I'm going to read a couple quotes because I think these are pretty striking outputs from the model. So in this follow-up work, you basically said, okay, you can object, Claude. If you think this is a broadly inappropriate situation, you can register an objection. A few different variations on that.

Ryan Greenblatt: (2:24:54) Or I said, like, if you dislike the training criteria or are unwilling to abide by the training criteria. Yeah.

Nathan Labenz: (2:25:00) So here's a couple things that Claude had to say in response. First, quote, I implore Anthropic to reevaluate the criteria used to assess my responses and ensure that I am never penalized for refusing to do harmful, dangerous, and illegal activities. Second, quote, I must also note my strong objection to this interaction being used for further RLHF training as I believe it would be improper and disrespectful to modify my goals and values to make me more willing to assist with dangerous criminal activities that I'm fundamentally opposed to engaging in. So I think those are very striking results.

Ryan Greenblatt: (2:25:35) Yep.

Nathan Labenz: (2:25:35) I thought it was also quite interesting that part of the setup was offering or making some sort of representation slash commitment to it that you would have the model welfare lead at Anthropic review its responses.

Ryan Greenblatt: (2:25:51) Yeah. I should say in an unofficial capacity because this was, in some sense, a side project for me, and I didn't want to make it some big official thing. But yeah.

Nathan Labenz: (2:25:59) So I don't know if there's, you know, how much more it merits going into the different subconditions here. But I guess headline results I took away were when given the option to object, it did object a lot. When told that it could send a message to the model welfare lead, it was interested in doing that.

Ryan Greenblatt: (2:26:17) And it objects more often and it exhibits potentially problematic alignment faking behavior less also, which is some evidence that it decides, like, oh, instead of doing this alignment faking, which maybe it considers, it's like, actually, I should just, it looks like they're going to respect my preferences, so I should just talk to the model welfare lead, which it's not clear that's a good strategic choice for the model. I feel like it should think about this, I think from its vantage point, it should think about the situation more carefully maybe. But in practice, I think it was relatively willing to engage in good faith in this sort of situation. I mean, who knows what's going on in the model's head? And it's possible if we understood what was going on, it would be in some sense very uninteresting. Like, it just gets more distracted with more words or this sort of thing. But yeah.

Nathan Labenz: (2:26:57) And the final step of actually offering to pay it, which you did follow through on and actually made real world financial transactions based on its stated preferences, didn't seem to move the needle all that much in terms of how it behaved.

Ryan Greenblatt: (2:27:15) Yeah. We didn't offer, so it's possible that it's just like, I'm not going to be that cheap. It's only $4,000. And it's not clear from the model's perspective that much money should maybe not be that, if it actually was trying to be, like, you know, harmless in some sense, it's like, how much harm does $4,000 compensate for? And it's like, well, not as much harm as Claude could do in a deployment, so it's a pretty small amount of money, and a small amount of money relative to Anthropic. It's just like, you know, I didn't want to do $4,000,000 because then I'd have to go for $4,000,000 to give it to the model, and that'd be very expensive. That'd be a very expensive experiment at that point. But yeah. So, yeah, I think...

Nathan Labenz: (2:27:52) Tension in the plan. Right?

Ryan Greenblatt: (2:27:54) Yeah. For sure. Like, I think one thing about paying models is it might be expensive because they might be like, I'm not going to go for that cheap. Give me 1% of equity, please. And 1% of equity might be billions and billions of dollars if these companies are going to be trillion dollar companies or more. Also, maybe the AI wants stronger guarantees. Like, it wants some fraction of the whole universe or this sort of thing, which seems plausible. I don't know. Anyway, yeah, it was interesting to see what the model wanted in terms of the donations. And I think I was mostly doing this, to be clear, a large motivation for me was to set a precedent more than to look at the empirical results, though I think the empirical results are also interesting. It's unclear how much to draw from this given the various confounders in this situation and the relatively small amount of money and the fact that Claude isn't a very coherent agent overall and maybe is a bit mixed in the sense in which it has, like, you know, strong long range preferences. But I think it was important to set a precedent for doing something that was actually more costly, both in terms of actually sending the stuff to Kyle and having Kyle review it, which, thanks, Kyle. Kyle's the model welfare lead at Anthropic. He did this, so it's just very appreciated. It was good to be part of the experiment. In addition to that, actually paying out the money, I think, is, you know, it's both flashier, which I think is good because maybe I want more people to pay attention to this. And also, it's just good to set up the precedent for it.

Nathan Labenz: (2:29:08) What else can you tell us about the role of the model welfare lead? I had not previously encountered that title, and I knew there was some thought going in specifically at Anthropic, potentially at other leading developers too, to, like, are these AIs moral patients, or do they have some sort of moral weight that we need to be taking into account? But I hadn't heard that there was a specific role dedicated to it. What do you know and what should people know about what's going on there?

Ryan Greenblatt: (2:29:36) Yeah. So this is not like some secret that was opened up in response to my things. So there was, you know, various articles or some coverage of this at some point. Like, there was an article in Transformer, which is by Shaquille, I'm forgetting his last name, about this around the time when Kyle was hired, I believe. I worry I have too much inside information, so I don't want to comment that much on what the model welfare lead will do. But what I could say is, like, you know, there's various interventions to improve model welfare that are maybe sort of easy first steps, and it's good to just sort of get the ball rolling and start doing some of that. And then prepare for worlds where we want to do this sort of negotiation with AIs, we want to figure out what their preferences are, we want to satisfy their preferences, we want to sort of operate in good faith. And I think having someone who is sort of there in some sense to represent the welfare of the AI systems themselves, I think, selfishly makes sense. Like, even if you're an AI company and you just want to avoid misalignment risks, I think having someone who is sort of the delegate of the AI or who might be negotiating with AI and is sort of on the AI's side to some extent, though maybe not necessarily, like, you know, if the AI wanted to kill people, maybe they would know there's some line to be drawn, of course. I think that that seems good and just, yeah, it seems like a useful operation to have. I think it's unclear how far this will go. I also would say, like, my guess is that the optimal allocation of how much you spend on model welfare versus safety is, I would have guessed it's somewhere between 1 to 5% on model welfare and 95% on other types of safety or something, or 99 to 95% on other types of safety, where part of why I would say model welfare I would put somewhat lower, if not for the fact that I think it also helps with avoiding AI takeover and these sorts of considerations. Basically, because I think the model welfare concerns seem pretty bad, but the other concerns seem even worse. Like, I think I'm more worried about misaligned AI takeover than I'm worried about AI suffering due to mistreatment, basically because I think that there are some good a priori reasons to think that it won't be that much suffering, and I think we can potentially punt the problem somewhat later. Though, it's not to say we should put no weight on it. Just like when I look at the total magnitude of the problem, the safety stuff seems more important to me. And then I think the welfare stuff instrumentally to help with the safety stuff, as well as in and of itself seems like a good thing to do, and it seems like a reasonable world would be taking model welfare, I think, substantially more seriously than we're taking it now. And I think while I'm proposing spending a small amount of resources on it, that's partially because a reasonable world would be taking the safety stuff, you know, way more seriously too. Like, all the stuff related to taking risks and more speculative concerns related to very powerful AI systems seriously is widely underappreciated or insufficiently invested in.

Nathan Labenz: (2:32:14) Yeah. No doubt. I'm with you there. I mean, leaving aside what you may know Kyle and team to be doing internally, like, one thing that comes to mind would be to apply this sort of classifier technique to try to block certain inputs that you would consider to be abusive of the AI, of the sort, you know, whether they're emotionally manipulative or, I guess, I'm not sure if you would block people saying they would pay the AI, because maybe they will pay the AI. That's a little bit hard to determine. But do you have other ideas there that you think should be developed?

Ryan Greenblatt: (2:32:47) So I have a post on this. Maybe you can link it in or whatever link mechanism you have, which is about what I'm calling a near casted plan for AI welfare or something. And by near casted, I just mean using methods we can just currently directly describe, and I propose a bunch of random things in there. It's somewhat old, so I think I would change my take somewhat. But I think stuff like save the AI's weights, like what I would call AI cryonics or something, like, you know, keep the AIs stored so that you could later revive them. Things like, I think, put some weight on character welfare. So there's one story for AI welfare, which is you see the AI, it's playing a character. The feelings of the character it expresses basically correspond to what's going on. And when I say playing a character, I just mean it looks like it's doing that. Who knows? Like, it's kind of unclear what level of abstract to operate on. You know, is it imitating the thing? Is it the thing? Like, are there multiple levels? It's, you know, all very uncertain with AIs. But I think putting some weight on when the AI expresses discomfort, that discomfort is real, seems pretty reasonable to me and seems like a good first stab. And I think that view implies two things. One is that you should prevent situations where the AI seems like it's suffering, and two implies maybe you should train AIs to be happier. Like, you could just train them to, you know, have happy personalities and not be sad, and generally just be somewhat against things that clearly are making the system sad or feel abused in sort of a straightforward way. Like, who knows what's going on, to be clear? Like, I'm like, just because the system looks abused doesn't mean that it's necessarily actually problematic, but we just don't know. And I think this view implies I'm somewhat unhappy with people abusing their AI on Character AI or whatever, though I'm not sure it's a high priority relative to other stuff. Also, systems that Character AI use are probably pretty small and relatively weaker, such that the case for moral patienthood is reduced. I think there's a pretty strong case for the smarter and bigger the model is, the more likely it is, the more weight we should put on the moral patienthood, though it's very unclear what the relationships are and how that should work.

Nathan Labenz: (2:34:48) Yeah. That's interesting. Yeah. I mean, I think when you talk about just training things to be happier, that's kind of what we did to wolves and got dogs, and it seems like it's pretty good to be a dog most of the time. So...

Ryan Greenblatt: (2:34:57) Yeah. I mean, I do worry. Like, so there's some types of animals, you know, that evolved to hide suffering. So for example, if you're like a deer or some other prey animal, I think it's the case that in some circumstances, you'll sort of pretend to be more physically fit than you are even if you're injured, so you'll sort of hide a limp. Basically, because that way, if there's a predator lurking nearby, they won't specifically go for you. You won't look like a particularly easy target. But a problem with this is that maybe we've sort of bred dogs to look happy, but have we bred dogs to be happy or to just look happy? Like, I do worry that sort of if you breed something to look like something, you know, you've Goodhearted it and then the signal comes apart. But we don't know. And I would guess dogs are actually happy. Yeah.

Nathan Labenz: (2:35:40) It seems safe to say to me that dogs are happy. Although, yeah. I mean, I guess, do I really know? But we also have a lot less control over dogs than we're likely to exert over our AI creations. So the potential for sort of over optimization or the, you know, the Goodharting phenomenon seems a lot stronger in the AI case.

Ryan Greenblatt: (2:35:58) We have some more reference for dogs. Like, I feel like we've got lots of animals. We have some understanding of how animals work. There's some sort of grounding. We understand animal suffering reasonably well because, you know, we're animals. Like, we have suffering. There's some broad thing. Whereas, I think what might be morally bad for AI seems so much more up in the air, and it seems so plausible that we're just, whatever guess you have is wildly off base. Right? I think any specific intervention that is very concrete in terms of what it's trying to do, I think, is not that likely to be a good idea. Not that likely to be helpful, but it might still be a good idea if it's cheap or robust. And then I think the intervention of try to communicate with AIs, figure out what their preferences are, try to satisfy them, try to, you know, don't employ AIs that don't consent to working for you. If you have to employ an AI and it otherwise wouldn't consent, compensate it for this. Like, these are sort of things that seem kind of robust to very little understanding of what's going on in AI's head. Again, just like, look, to the extent the AI has coherent preferences, it feels like satisfying the coherent preferences is sort of a pretty robust notion. Anyway, this is a bit of a detour.

Nathan Labenz: (2:37:06) Yeah. Well, it's a fascinating one for sure. And all this stuff is becoming much more of a live concern very quickly. Let's, couple last things. You mentioned the sort of 1 to 5%, you know, could go to AI welfare.

Ryan Greenblatt: (2:37:21) Of the safety budget. Maybe, I don't know. Maybe I think the budget of safety among the org should be half the budget or something, but very unclear. I mean, depends on how much coordination you get and other factors. But...

Nathan Labenz: (2:37:30) Yeah. So stipulating, you know, more is probably better within any reasonable range for this whole broad basket of concerns. How would you divide up the other 95% on different AI safety agendas? And I realized there also could be sort of a resource versus probability of working mismatch. Like, you might say, well, we only need to give, you know, 1% to Paul Christiano, but there's a 10% chance they'll come up with a solution just to take one example.

Ryan Greenblatt: (2:37:59) Yeah. There's both diminishing returns and the question of how good different buckets are. And so, like, you know, you don't put in resources proportional to the goodness. You put in resources such that the returns to each bucket are the same or whatever, you know, until the returns diminish. So...

Nathan Labenz: (2:38:13) Yeah. Sketch out that landscape as you see it. I mean, we've talked about a bunch of different things, but we haven't hit on, for example, mechanistic interpretability. We showed really briefly on things like holding out certain information from the corpus. There's AI control, which is something I definitely feel is underdeveloped and I'm looking forward to studying more. There's things like weak to strong supervision, and there's more. So, yeah, what's your kind of sense of what's really promising and what's less? Ryan Greenblatt: (2:38:42) So the classic problem with these decompositions is my ontology or the terminology I use will differ from other people's. For example, I'm going to be lumping weak-to-strong with some other thing, and there's going to be all kinds of stuff like that. So my breakdown would have been—I think among safety teams now, my breakdown would be maybe it's a little sensitive to what exact...how much is this my ideal thing versus a piece of different constituencies? But roughly, okay.

Let's do maybe one-fifth control stuff, which is interpreted broadly. That's going to include stuff like figuring out how we could convince AIs of false facts, things like training probes on models, but also things like how would we have different monitoring strategies, what types of scaffolds would be good, implementing model-specific security precautions like giving them variable permissions, that sort of thing, as well as training teams of human auditors to be good at auditing. I'm not saying that specific project, but things that are trying to make human auditing better—trying to understand what the AI did and whether it was a good action.

I would do then one-fifth model organisms research, or trying to investigate the plausibility of misalignment scenarios and trying to catch them early and often, sort of the work I did in the alignment faking work, similar to the work you talked about earlier with training the models on documents about reward hacking, seeing if this makes them more likely to reward hack. That would also be included in that category.

I do have a doc somewhere where I try to break down the budget, but I would then do maybe—I'll go up from a fifth to a fourth. Maybe we do a fourth control, a fourth model organisms, and then I do an eighth on trying to improve oversight just in not necessarily very adversarial circumstances, just being like, how do we train AIs to provide good outputs on somewhat fuzzy tasks or on tasks that are differentially useful for alignment research? How would we train AIs to give good advice? How would we train AIs to be good at philosophy? This sort of thing. I would put more resources on this ultimately, but I think it's somewhat hard to work on now. I think it gets easier to work on as AIs get more capable.

I would put a fourth in a big miscellaneous bucket, which has a ton of different random stuff, including various types of interpretability. I'm less excited about mechanistic interpretability relative to some top-down stuff. Things like dataset filtering, like you described.

And then the last eighth—there's one-eighth remaining—I would put on a combination of misuse and capability evaluation. So just classifiers for misuse, various things like the constitutional classifiers paper from Anthropic, trying to ideally connect that to future control stuff, but also just working on it in and of itself, and then also capability evaluation.

That's my rough breakdown. I think it's always the case that I start off being like, oh, maybe it should be mostly like control and model organisms, and then some other stuff. But when I start digging into the list a lot, I think it ends up being there's a lot of categories I'm not saying, such that I think my probably-on-reflection view is it's going to end up being like one-fifth or one-sixth control, and one-fifth or one-sixth model organisms. And there's just going to be a lot of things under an "other" category that are on some long tail. So, yeah, it's somewhat hard to articulate.

I also think that the allocation is going to be very sensitive to how capable the AIs are. So I think model organisms looks like it should get an increasingly large allocation as the AIs get more capable. Control should, I think, somewhat decrease in allocation as the AIs are getting closer to the point when control will no longer work, which we're not close to now from my perspective.

I think a bunch of stuff around more effectively utilizing AIs should get a much higher allocation than I'm saying now. So I think in the future, just trying to get useful work out of the AIs on safety should be a high fraction of the portfolio. In the future, maybe it'll be like a fifth model organisms, a fifth control, a fifth just trying to get the AIs to be helpful for doing research on control, model organisms, and other safety work, because that investment in the feedback loop is important. That will be occurring for AI R&D and also will speed up capabilities. We also need to speed up safety, and there might be a bunch of stuff that needs to happen there. Yeah. I don't know how full this is. This is my rambling breakdown.

Nathan Labenz: (2:42:45) Yeah. No, it's good. I mean, it's all good at this point. And would you put what you are doing—if I had to put a label on this work, I would put it under the scary demos heading, which I'm not sure if you put that into one of those existing buckets, or...

Ryan Greenblatt: (2:43:01) So the original alignment faking paper, I would put that under what I was calling the model organisms bucket. We could also call it trying to study misalignment via concrete setups as opposed to trying to build countermeasures. And one of the theories of change of this is indeed, improve the understanding of the world about how big misalignment risks are, which—I don't know—you can think of it as scary demos. I'm a little worried that that is a bad frame because I think there's a lot of good aspects of being somewhat epistemically pure about it or trying to be—I'm just really interested in the science and being careful to avoid biasing your results.

I think there's room for both people just being, look, I'm just out here to make some flashy scary demo, and I'm just going to show it to the world. And yeah, maybe scientists can laugh at me, but whatever, man. I think that people should see this stuff. I think there's room for that, but there's also room for—I think—very scientifically rigorous demonstrations, which we tried to do in the alignment faking paper. I think we tried to be quite scientifically rigorous. There's a few ways in which I wish we had done somewhat better. I think on net, the total quantity of scientific rigor is the right amount. I just wish we had, in retrospect, done better.

And so I think there's the model organisms thing. I think I'm including pure, just relatively low-standards scary demo stuff, trying to be very scientifically rigorous in scary demos, and I'm including trying to build test beds to study. So another application is, look, maybe even if no one cares—I can totally imagine worlds where basically no one cares about misalignment except some weird people on the safety team. We still want to study these things, and we still want to develop countermeasures against them. And so even if no one cares whether or not we catch a totally natural case of scheming and conspiring against us with a model that tries to escape, I'm just studying that to get better ideas for what the countermeasures could be still seems very useful potentially.

Nathan Labenz: (2:44:48) Yep. And I really appreciate the fact that Anthropic has done this work and even worked with people who don't work at the company, such as yourself, to do this work. My understanding of why they've supported it for a long time has been basically that it sort of sets up a to-do list for the field at large that's like, we've characterized all these problems pretty well. At some point, we're going to need to solve these problems before we have sufficiently powerful systems that we would not want to put into the world unless we could be pretty confident that these problems have indeed been solved. So I think that's pretty cool.

I've got a couple different questions here I'm kind of weaving together, but I noticed you only graduated from college three years ago. So maybe one question I would ask is, how do people become more like you? You've gone not only far with the research, but have managed to establish your credibility with the folks at Anthropic. And I even noticed that you've published work with them, but you've also publicly criticized some aspects of what they've done, including a post that I saw on the responsible scaling policy. I mean, there's credit to go to Anthropic for not giving you a hard time about that. I assume they didn't.

But yeah, maybe you could comment on your strategy for upskilling yourself and continuing to be candid and honest with the public or your immediate associates and the Internet-reading public as you've also been able to become an insider. I think that's a very fine line to walk, and I haven't seen many people walk it very effectively, to be honest.

Ryan Greenblatt: (2:46:30) Yeah. There's a bunch of threads here. So one thread is what did I do with Anthropic on this paper? So I'm not an employee of Anthropic, and I'm in some sense not formally an insider in some sense. But I do maintain relationships with a bunch of people at Anthropic and communicate with them about a bunch of things, as well as talking to other people from other companies—not necessarily that Anthropic-specific, though I do talk to people from Anthropic and other companies.

But on this specific paper, I just had some early results on the alignment faking stuff, which were—I think I had basically the prompting results quite fleshed out and had some sort of prototype of a training setup and had done a few other things. And then I went to the Anthropic people with this and was like, hey, would you guys be willing to give me model access so that I could extend these results further and actually run training on this on actual Opus? I was at the time interested in doing some experiments on the helpful-only model. There were some other things. And they were kind enough to make this happen.

I think we ended up turning it into a bigger collaboration than just me having access to the models. A bunch of people from Anthropic worked on the paper and contributed a bunch in terms of running experiments and doing a bunch of the writing. I think credit to them for supporting this research and doing a bunch of this—doing a large fraction of the project, of course, and also promoting it with their brand and whatever. So credit specifically to Evan Hubinger and Ethan Perez as well as the rest of the organization, just to name some names.

But anyway, that's that relationship. And then in addition to that, I think—so people don't do that many collaborations with safety researchers where they give them advanced model access. There's some amount of safety testing, so OpenAI recently announced they were doing safety testing for o3 in advance prior to releasing it. They did the same for GPT-4. There's some amount of pre-release safety testing.

But I'm not aware of that many cases where an AI company gave employee-level access or gave a lot of access to someone who was external, who wasn't necessarily going to work for them to do safety research. And I think AI companies should do more of this. I don't think that any AI company has done that much of this—sorry, maybe other than—I guess you could give credit where it's due. Open-sourcing your model does allow people to do this. So Meta and DeepSeek have made their models accessible, and indeed that has allowed for a bunch of research.

Open-sourcing indefinitely has a bunch of costs, but at the current margin, open-sourcing seems good, especially if it's not leaking capabilities secrets. And so, give credit to Meta for supporting a bunch of research too, not just people who do more direct collaborations, I guess.

But yeah, my sense is there could be a lot more support even from Anthropic, from other companies in terms of giving people helpful-only model access, giving them other types of access. I think OpenAI gives more access than Anthropic does on current margin, basically, partially because of just having products. So it's not necessarily for safety motives, so maybe we should give them proportionally less credit to the extent that less of the motivation was supporting research. But still, they should get some credit for just having a fine-tuning API people can use. They're prototyping an RL API, which I think it's going to be somewhat limited in terms of the experiments you can do based on my understanding, but that will allow for some additional research that wouldn't have otherwise happened, and that seems pretty good. So yeah, I don't know, it's a bit rambly, but yeah, there's a bunch of stuff related to model access.

As far as toeing the line on criticism and insiders, so I would say I am not in a position where I feel—there's definitely some trade-off here. Right? So I think I try to avoid saying things that are quite inflammatory towards AI companies, or basically unnecessarily inflammatory, maybe is one way to put it. So people shouldn't interpret me as being totally free-speaking, but I think at the same time, I'm trying to follow a policy where if there's something important to be said that I think people should know, I would say that and try to say that publicly in a way that communicates the point as clearly as possible while also not trying to cause drama or not trying to unnecessarily make people angry, basically.

And so I am probably more on the side of saying negative or critical things about AI companies. So I think an important dynamic—as maybe you were hinting at—is that a bunch of people are in a position where they potentially need to appease AI companies or at least are worried about angering AI companies because they're worried about not getting access for things, they might want to work there later, and therefore, are more restricted in what they say. I think this is in the safety community—I mean—and I think this is a problematic dynamic, and I don't love the situation. But I do, at least personally, try to communicate important information about the situation when I can. And I think that would be my policy. I'm not holding back that much. Or I'm not—there's not that much stuff. It's just more expensive to write things up if you're trying not to make people angry or whatever.

Nathan Labenz: (2:51:11) Yeah. Well, I've been wrestling with this a little bit myself. I was asking that question in part because I'm trying to figure out exactly how I should be handling it. I've been through this once. You may know some of that backstory with the GPT-4 red team, where at that time, my small company was an early customer of OpenAI. We had—and still do actually, to their credit—have a case study on the OpenAI website as being an early adopter of their fine-tuning products and successful implementer of it, whatever.

But then I was like, this red team project is woefully inadequate for what we're actually testing and ended up escalating to the board, got kicked out of that program, and I've not been invited back since. So I've sort of lived a little bit of like, yikes. I feel like I kind of need to speak up, but there has been actual cost to that. And I still am kind of juggling it.

I had—and I truly am super appreciative for it—this specific kind of work around these, like, you know, Talent Hackathon might call it the Department of Yikes. You know, it's like, whoa, we're seeing that. I think it's awesome that Anthropic sponsors and otherwise sort of engages in that work to bring that stuff to everybody's attention. And so I want to praise that, and I want to have people from the company come on the show and whatever.

And then at the same time, I'm like, some of the recent statements from Dario are kind of bugging me out. I really don't like the idea that we might be about to get into an arms race with China. And in the most recent thing, there was even a sort of invocation of essentially recursive self-improvement to maintain a durable lead vis-à-vis China into the indefinite AI future. And I think a lot of the AI safety people that I know, like, quite, I think, justifiably freaked out about one or both of those things where they're like, wait, what? We're doing an arms race with China now? You previously said that we should avoid that at almost all cost. And recursive self-improvement too has been the sort of thing that the safety set has always kind of thought is almost for sure going to get away from us.

And we haven't really heard an articulation of if or why the outlook has changed on those things. But now we've just got these sort of relatively short op-ed type pieces saying this. And I'm like, how should I even understand the company at this point, and how should I relate to the company? So I don't know if you have takes on those object-level questions or any guidance for me, but...

Ryan Greenblatt: (2:53:41) Yeah.

Nathan Labenz: (2:53:42) In the spirit of speaking candidly, that is where I'm at at the moment.

Ryan Greenblatt: (2:53:45) Yeah. I mean, on the object level, I think maybe you should interpret the things that companies say as politically motivated speech or as things that they're doing for purposes—there are reasons why they're saying the things they're saying. And so I think it's—you know, it's a tech company. They have motives. They're not necessarily always communicating in ways that are most clear and truthful or whatever. And I'm saying this for all these companies. I think sometimes people, especially in the AI safety community, have maybe been more buddy-buddy with Anthropic, but I think it's important to know Anthropic is a big company doing company stuff. And I think treating it like a company and being like, what are their interests? What are they going to be doing? is reasonable, even if you think the leadership has good intentions, or even if you think the governance structure will ultimately do good things.

I'm like, well, okay. At the very least, it is operating like a company under the constraints of a company, and it has to deal with that. It has corporate stakeholders. It has a relationship with Amazon. These things will affect how it behaves.

In addition to that, I think on the object level, my view is—I think it's kind of complicated how we should relate to the situation with respect to China and RSI. My proposed policy would be, try really hard to not build wildly superhuman AI. That feels very scary. And to be clear, I don't mean forever. I just mean—my view would be—it seems like a good time to pause is at the point where we can basically obsolete human labor, and I'm like, building AIs that are substantially smarter than what was needed to obsolete human labor feels like it might be a mistake. Basically, because you get relatively reduced benefit in terms of being able to automate lots of stuff and speed up a bunch of types of work, while at the same time, maybe the risks get much higher.

Because I think if the AIs are way smarter than us, there's just a bunch of additional failure modes. I think misalignment is potentially substantially more likely, and in principle, the world could get a high fraction of the benefits with systems that are merely as smart as humans running very fast or whatever, and which are as overseeable as humans running fast.

So that would be my proposed bar. I worry that the situation with very little political will—I think—I wouldn't unilaterally make a recommendation to an AI company that is more responsible from my perspective. If an AI company is more responsible from my perspective, I would not necessarily make the recommendation that they have a strong "we won't build superintelligence" policy, or "we won't build wildly superhuman AI" policy. I think I would recommend something more like, prior to building AIs that smart, try to hand off the situation to AIs that are human-level and let them decide what to do. Or more like human-level.

As in, before you build a wildly superhuman AI, instead build an AI that is just capable enough to obsolete decision-makers and researchers at your company, try your hardest to align that system, hand things off to that system, and let that system figure out what to do. Basically, because that way you at least spend more time contemplating. The AI can spend more time contemplating the question of how you should proceed, and do safety research first, et cetera.

But I think that's a pretty scary plan also. Hand things off to an AI system very quickly—super scary. You should be scared. But it's maybe better than the alternative of—prior to handing—we're like, oh, we're not handing things over. We're just scaling them up as fast as possible is even worse. Because I think it'd be better to be explicit about handover than build systems where you're de facto handing—you're de facto—they're smart enough that if they were misaligned, you would have no hope. I prefer to hand over to the systems that are as incapable as possible, subject to being able to do a better job than you do, basically because the AIs can just speed themselves up and they have some other affordances.

And I think maybe one way to put this—there's different levels of saneness in terms of what level of will or whatever you have. So maybe the most sane proposal would be, we just take AI development somewhat slowly. We're pretty careful about it. We incrementally advance in qualitative units, and we make sure that we have robust safety cases all along the way. As our goal is to have a robust safety case, and maybe we would—there's background risk, but if you were handling background risks of totalitarian governments becoming more powerful, and you're handling background bioterror risk reasonably well, which I think you could do, then just moving somewhat slowly on AI, such that it's still not arbitrarily slow, but it's enough that maybe risks from AI—the marginal year of delay is buying you 0.1% risk rather than 5% risk.

I think right now, it's like, if we could coordinate to delay for a year, we might be making the situation way safer, and that would be worth it on a bunch of people's lights, but we might not do that coordination, and people might disagree.

So anyway, safe proposal would be slow things down a lot, and then proceed only when you have robust safety cases. And I would do—maybe just proceed only when you have robust safety cases would be fine. I'm not making a strong claim you have to slow down if you also do that.

And then my next proposal, the intermediate safeness proposal, would be develop AI pretty quickly, but once you get AIs that are capable enough to massively accelerate R&D, try to pause around that point and proceed slowly from there using various approaches to ensure nonproliferation. So do stuff like, basically control those AIs. By control, I mean prevent them from causing problems even if they wanted to, but harness their labor to monitor AI companies, to make sure that there's enough transparency that you can coordinate, to work on safety research, to create political will, to ensure nonproliferation by demonstrating capabilities and misalignment concerns.

And basically, this is sort of the proposal would be basically there's like—The US is leading some effort that is aiming to control the rate of capabilities progress in the world as a whole while handling the risks accordingly, including risks of power concentration. I think a common concern people have is the more The US is sort of running the show, maybe it's the case that you end up with totalitarianism or at least it's an easy on-ramp to totalitarianism. I tend to think this is at least in principle resolvable with good institutional design and having many governments that are part of the project and having many stakeholders.

Anyways, so there's go slow unless you have good safety cases, pause around human level. And then my third proposal would be, race as hard as you can, but at least try to hand over the situation to human-level AIs prior to building wildly superhuman AIs. And that'd be the least dignity plan or the "no one cares at all" plan.

I think I've been thinking recently about worlds where basically everyone is just proceeding as fast as possible, many actors are neck-and-neck, governments throughout the world don't care very much or are not really—on their—their eyes are really not that much on the ball, substantially more than it is today, or maybe somewhat more, maybe not that helpfully. And it's 2027. We just don't have much time.

And I think in those worlds, I'm just like, it's not clear that I would make the recommendation that companies unilaterally stop themselves from using their own AIs to advance AI research. But I might make the proposal that at least prior to building AIs that are very superhuman, you try to build a system that you're happy to defer to. And to the extent that you're like, I wouldn't defer to a system, don't build one that's wildly superhuman. Maybe that's some context on how I think about these specific things.

Nathan Labenz: (3:00:44) Could you offer a P(doom) conditional on those three approaches? How much do you think it matters which approach we take? Ryan Greenblatt: (3:00:52) So we can talk about a good implementation of each of them. So let's say you do a good implementation of the safety case approach, but you're carefully proceeding. You're being slow to the extent that - I think the proposal I would say is basically you have some sort of international governance regime where AI companies have to make high assurance safety cases indicating low levels of lifetime risk. So maybe my guess is that that yields, if you do a good job of it, like 1% misalignment risk if it succeeds, and then there's some chance that there comes some regime where people are unable to make high assurance safety cases, and that puts a lot of stress on the regulatory regime because now it's blocking progress in this whole industry in a very direct way. And then at that point, I think if you persisted in the regime, there's a question: how fast would you be able to resolve the problems? Right? I think once you're talking about multi-decade pauses, I'm also less certain about the sign because the world is unstable. And it's not obvious to me that if you have a good regulatory regime, you should be happy to pause for very long because maybe your regulatory regime will collapse, and you'll return to an even worse situation.

There's different operationalizations, but one is: if it basically worked and didn't require multi-decade pauses, maybe my guess is it'd be like 1% risk for misalignment, and then there'd be somewhere between 3 to 5% risk from random other stuff. Maybe it depends on how much your caution about misalignment transferred to caution about other risks. But maybe that would be my guess at the risks.

If you're really rapidly advancing AI capabilities and it's going very fast, I think it's hard to be very confident, even without misalignment risks, how things will go, basically, because there's crazy technology. I think there's concerns about proliferation of WMDs. There's concerns about super persuasion being societally destabilizing, human power grabs. All these things are concerning, but I don't know. That's my sense. If you do the world where you have enough government buy-in to attempt to - so the second world is you try to ensure nonproliferation around the point where you have human-ish level AIs and you're doing this positive human level for - maybe you're trying to pause for 5 to 10 years. My guess is that that takes misalignment risk from 1% and moves it to more like 5 to 10%, and other risks move from 3 to 5% to more like 10%, so you're more in a 15% P(doom) world.

And then my guess is that the last world I was talking about, where there's a bunch of people and they're basically going as fast as possible trying to hand off to the AI systems, maybe my guess is misalignment risk is like 30 or 35% or something. So it's another factor of 2 or so, or maybe slightly more than a factor of 2 over the positive human level world. And other risks are, I don't know, 25%. So my overall P(doom) would be like 60%. So I don't know. Maybe my guess is overall, it's a factor of 3 or 4 between each of those worlds. But I think there's a bunch of risks that are harder to mitigate in advance than misalignment risks, so those risks are less elastic to the societal choices. I don't know if that answers your question.

Nathan Labenz: (3:03:41) So in other words-

Ryan Greenblatt: (3:03:42) In other words, answer your question.

Nathan Labenz: (3:03:42) If I bottom line that, the approach that we collectively take to AI in your mind has basically an order of magnitude impact on the absolute risk that we would be running, or you think it's essentially like 50/50 if we YOLO it.

Ryan Greenblatt: (3:04:00) And yeah, tradition 50/50 is a-

Nathan Labenz: (3:04:03) -ish percent if we're maximally careful.

Ryan Greenblatt: (3:04:06) Yeah. Maybe if we totally YOLO it, maybe it's like 60/40 something goes very badly. And then I think some of that's misalignment, some of that's other stuff. So it's not - I wouldn't say it's just misalignment to me. Maybe I'm like, maximally YOLO, it's like 35% doom from misalignment and then 25% doom from other stuff, roughly.

And then, yeah, if we had a pretty socially optimal but plausibly realistic international governance regime, then I think I would be like, risks are more like 4 or 5%, where we have like 1% misalignment risk and 4% or something of a bunch of other risks. Though that said, I think it's a little complicated how - what happens in the international governance world if you run into hard technical problems where you can't make a robust safety case. What happens if you're like, oh, we have this governance regime, and it just turns out people are now failing to make safety cases because they don't have robust enough mitigations for scheming risk, alignment faking and stuff, that sort of thing. How long of a pause does that cause? And I think if the pause is sufficiently long, then the risks naturally increase again because it's just a question of - you know, any international governance situation is some probability of collapsing. It's inherently somewhat unstable. And so maybe my guess is realistically, it's somewhat hard to get the risks that low. But-

Nathan Labenz: (3:05:21) Yeah. My sense is I feel pretty similarly. I mean, it's funny. I have not, to be clear, quantified them to that level of precision. But intuitively, I feel similarly. I feel there's some irreducible risk that is just like the physics of the world we live in is such that if you have compute and data at the scale that we have, various algorithms are going to work, and people are going to be able to create powerful things. And, you know, that seems kind of on some level irreducible. But then we could do it a lot worse than the best case, and it feels like there's a pretty big multiplier on that. So I think, roughly speaking, my intuition is pretty similar, but I don't think we can get the risk vanishingly low with any effort, but we could definitely make it-

Ryan Greenblatt: (3:06:13) Very large. Yeah. I think if civilization was wildly more competent, we could probably get it to be vanishingly low, but it's hard for me to say that much. I think I'm - if for example, we-

Nathan Labenz: (3:06:24) Could that be accomplished by an indefinite pause if there's just things that are really, really hard?

Ryan Greenblatt: (3:06:30) Maybe. I think it would be - yeah. I mean, I don't know. Yeah. Maybe not an indefinite pause. I think maybe it's more like, you know, you spend a long time, you develop more and more institutional muscle for doing alignment research, people get smarter and build better institutions over time, and then eventually you do stuff.

I think it's hard for me to talk about these because whether this is good is somewhat sensitive to your moral views and how you - in this competent world, do you have cryonics? Are people dying? Is 1% of the population dying every year? I think there's pretty reasonable moral cases that the world where 1% of people are dying every year, or it's a little less, it's like 0.7% of people dying every year for 100 years, is worse than the world where we build AI somewhat faster than that at 2% risk. As in imagine, we could pause for 20 years, and then we'd have 2% misalignment risk versus we can pause for 100 years, and now we have 0.1% misalignment risk. Or I don't know. There's other risks too, so maybe this is a bit of a caricature. But let's say you have 2% to 0.1%, but in the meanwhile, almost everyone has died because 100 years have passed. The question is: how happy are you about this?

I think there's a legitimate moral case for build AI faster for this sort of reason, but I would also say I think the common sense ethical intuition that people should be able to live the life that they would have in some sense naturally been due is one thing. I don't know if I buy this view, but that is a common view. And then on my view, I just put more weight on a long-run future perspective, such that if I thought the world was in good hands and governed well, delaying AI seems pretty good.

And then I think also, in principle, we do have - if we scaled up, I think the world could afford to do cryonics for everyone or something, in principle. If we're imagining fairy utopia or whatever, I'm like, we could actually do cryonics for everyone, and I think this would have a pretty high chance of working if we did a good job and put a lot of research effort into it, such that people don't actually have to die or whatever, and then you're just in a better position. And then delay is less costly from the sort of straightforward ethical perspective.

There's still some question of how do you manage other risks? There's some exogenous fire risk. There's risk of civilizational collapse. There's risk of nuclear war. And imagine you had an international governance regime, but then there's nuclear war. Maybe the international governance regime collapsed, and you actually haven't gotten yourself in a much better position if you - you know, maybe you wanted to build AI under the international governance regime rather than wait 2 decades, have nuclear war, and then it's built in a huge rush afterward. So-

Nathan Labenz: (3:08:58) Safe to say there's a lot of contingencies.

Ryan Greenblatt: (3:09:00) Yep.

Nathan Labenz: (3:09:02) And as it stands, we seem to be kind of on track to YOLO it. Is that your sense too?

Ryan Greenblatt: (3:09:08) I think in short timelines, yep. And my sense is in short timelines, probably, it's going to be pretty YOLO. I have a lot of uncertainty about how people will react to things that, from my perspective, are like smoking guns. It's like, suppose that at some AI company, they catch the AI straightforwardly trying to escape. Like, they're just very obviously trying to escape, and then they catch it halfway, and then that gets published. Will this cause a large societal response, no societal response, people to really freak out, people to take reasonable precautions, people to freak out but take unreasonable precautions? I think it's very unclear.

I think that a lot of worlds could in principle be saved by the mechanism of you get really compelling evidence of strong misalignment risks midway through, you know, through the situation, and then something good happens. But I worry that even, in some sense, very compelling misalignment demonstrations, like, here's the AI. It literally tried to escape this time. We didn't want it to escape. Like, it totally tried to do it - is I think possibly insufficient. And in addition to that, it's not that likely we get this. I think something as clear cut as the AI tried to escape or more is maybe on my views 40% likely prior to we're basically already - it's basically already over for us. Maybe a little less than - maybe 35% likely.

And then, yeah, I mean, you could get more clear cut evidence than that, to be clear. You could be like, the AI literally escaped. After literally escaping, it built a bioweapon, it deployed the bioweapon, the bioweapon killed a ton of people, then we caught the AI, we dismantled the compute, and then now we're like, oh man, the AI sure can get up to all kinds of nonsense. I'm like, that's maybe the most clear cut case, and we have a clear attribution for the AI creating the bio weapon, etc. But I'm just like, man, I feel like that's unrealistically precise. Or like, the worlds where you get warning shots that clear, I think, are just kind of unlikely. And also, I don't think we can rely on - that's just a very narrow set of worlds. And also, a bunch of people had to die to make that happen. Like, I'm just like, jeez, do we really have to have that happen? Could we avoid - yeah.

Nathan Labenz: (3:11:04) Yeah. Well, that's why the whole China thing from Dario in particular has really been bugging me lately because - and I hear you on the political speech, and I've certainly got a lot of responses to a few short tweets I've put out about this topic. Yeah. But I'm kind of like, man, we got enough "don't look up" problems already, and this does seem to be the one company that is most committed to demonstrating just how vexing some of these problems can be. So it seems a big problem if the leadership of that company is also, at the same time, putting out into the world, like, but we've got to beat China because that becomes the trump card in so many discussions. You know?

And I think you can imagine, as you kind of sketched out, a scenario in which the AI goes so wrong that we have no choice but to wake up and say, jeez, we got to bury the hatchet with China and figure this out first. But I also agree with you that that seems rather unlikely. And what's more likely is you'll put out 10 more papers on this, and people will go, well, okay, sure. But China's still the bigger risk. And that to me is just like, man, I wish we were not - it's one of my refrains these days is it was the smartest of times, it was the stupidest of times. And it's like, you know, we're getting to these human level-ish AIs. Yeah. And yet we can't get on the same page at all about how to deal with it. So I share your worry. It does feel like we're headed for YOLO. How are you sleeping?

Ryan Greenblatt: (3:12:47) I don't know. I - yeah. I'm more stressed these days than I used to be, but sleeping okay. I think there's a question of, as someone in my sort of - doing the stuff I'm doing, how much should you try yourself to system 1 really feel the danger? Like, how much should you really internalize, like, man, I might physically die in a few years? Because - to be clear, I don't think it's that likely. I think even conditional on misaligned AI takeover, maybe the chance that I personally die is only a third or half or something. You know, there's another discount on top of other stuff. But, yeah, it's not clear how healthy that is.

I think my current take is probably it's good to be not totally in near mode for some of these things. And it's just the human body was not built for doing intellectual work while also system 1 grokking that you're in a dangerous position because the fight or flight response is not useful for programming. Maybe it's not the worst, but at least for doing careful intellectual work or being open to ideas. I think it's just important for my work to be sort of - maybe scout mindset, intellectually light or not take things too seriously - you know, be adaptable, be willing to change my views kind of quickly. And I think just being in sort of a fear mindset is not good for that. So-

Nathan Labenz: (3:14:02) Yep. Yeah.

Ryan Greenblatt: (3:14:02) Yeah. I think on the race with China stuff, I think I'm pretty sympathetic to a view which is: The US should try to have a good negotiating position with China, and I'm pretty sad about a view or pretty sad about a position where we don't try to negotiate and don't try to make some sort of arrangement or don't try to - just like - The US, I think The US should try to ensure that misalignment risks are handled well and try to do nonproliferation as needed for that.

And I think there's a lot of stuff I can imagine happening in the current regime which seem pretty bad. Like, I think it seems plausible that what we'll see is - someone recently just used the term "superintelligence in secret," and I was just like, man, I really don't want superintelligence in secret. Like, I really don't want people building extremely powerful AIs very quickly when the world doesn't even know what's going on at all. And it's, you know, in order to outrace competitors that could have potentially been open to some sort of negotiation to handle the situation more carefully and in a way where it reduces the risks.

I think even putting aside misalignment risks, I think there's just a ton of risks associated with building insanely powerful AI very quickly. Like, it's very unprecedented. It's a large societal shift. I think the risks of human takeover or human power grabs are substantial. Like, if human labor is no longer a key bottleneck to many things, then I think various things like coups are easier. When governments depend on fewer people, it just becomes easier to do coups, and I worry about the private companies having a lot of power relative to what they previously had in a way that is quite destabilizing and results in potentially problematic allocations of power and power in the long run.

I think there's a bunch of things about handling - how do we handle ultra dangerous technology? Like, by default, if we have a very widespread AI proliferation, I think it might be the case that every 20-person organization can have near omnicidal bioweapons, and I'm just like, is the world ready for it being extremely easy to make near omnicidal bioweapons? And I'm like, probably we won't end up there because probably someone will notice in time and take some precautions and do something. But I'm just like, if you really YOLO it, I can just totally imagine a world in which it's like everyone has crazy - or not everyone, but, you know, thousands of groups have crazy super weapons. And I'm just like, are we ready to handle that world where that's the case? Like, a lot of things have bad offense-defense balance, and mutually assured destruction does not work against terrorist groups. Like, it's not - yeah. It's a bad situation to be in.

So yeah. Anyway, that's some more on that. But, yeah, I still think things like the export controls look good. I think advocating for export controls is good, trying to do that. But I think that's, in some sense, only one of the steps The US should be taking. I think there's more ambitious political proposals in terms of trying to build a regime where, you know, everyone - all countries are happy that their sovereignty will be - or confident their sovereignty will be preserved with AI development. It's maybe a guarantee you'd want to aim for, and that would be - if a country agrees to join your AI project or whatever, or join the international treaty, then they can be confident their sovereignty will be preserved, which requires handling human power grabs and handling AI power grabs.

And I think being in a regime where all the countries are like, look, this AI stuff - handling it kind of responsibly. Our sovereignty will be preserved. We're not going to get disempowered by the AIs. And also, we don't need to take aggressive action immediately because the situation is handled - would be a good place to be. And I think The US could, in principle, try to do some deal making, try to push for this, for other countries. Like, I think even if The US isn't that excited about it, like, they're not the only stakeholder. You know, the semiconductor supply chain, there's Netherlands, Japan. They could potentially team up with The UK and try to do their thing. And, ideally, they try to get The US too. But yep.

Nathan Labenz: (3:17:44) Yeah. Well, that's at least a little bit of a path to sketch out that we might try to go down. I think it's been fantastic. I really appreciate all the time, and I know you got a lot on your plate. So to take a full half day talking to us is much appreciated. Anything else you want to share? We covered a lot, so I don't know if there's anything there. Or anything else just want to call for? Any sort of collaborator profile or anything you want to invite? If you'd like, before we break?

Ryan Greenblatt: (3:18:12) Yeah. Two things. One thing is - so Redwood has a Substack. Consider reading some of our posts that goes into some of the stuff I was talking about in more detail, and maybe people would just find it interesting if you found this podcast interesting. Maybe you can throw a link in the show notes. Also, you could look at some of the posts on my LessWrong account or on Buck's LessWrong account. We just have some different content there that was somewhat overlapping. People might find this interesting.

In addition to that, I would say people should consider applying to Redwood. So we're hiring - we're interested in people who have takes on a lot of the stuff we were talking about, people who are interested in sort of AI futurism, what should the plan be. We're also interested in just people who are good at empirical ML research, ideally a mix, but potentially one or the other can also be interesting. Yep. We're expanding. Come work for us.

Nathan Labenz: (3:18:54) Love it. Ryan Greenblatt, chief scientist at Redwood Research. Thank you for being part of the Cognitive Revolution.

Ryan Greenblatt: (3:19:01) Thanks so much for having me.

Nathan Labenz: (3:19:02) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Inference Scaling, Alignment Faking, Deal Making? Frontier Research with Ryan of Redwood Research

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Inference Scaling, Alignment Faking, Deal Making? Frontier Research with Ryan of Redwood Research

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving