In this crosspost from the 80,000 Hours podcast, host Rob Wiblin interviews Nick Joseph, Head of Training at Anthropic, about the company's responsible scaling policy for AI development.

Watch Episode Here

Read Episode Description

In this crosspost from the 80,000 Hours podcast, host Rob Wiblin interviews Nick Joseph, Head of Training at Anthropic, about the company's responsible scaling policy for AI development. The episode delves into Anthropic's approach to AI safety, the growing trend of voluntary commitments from top AI labs, and the need for public scrutiny of frontier model development. The conversation also covers AI safety career advice, with a reminder that 80,000 Hours offers free career advising sessions for listeners. Join us for an insightful discussion on the future of AI and its societal implications.

Apply to join over 400 Founders and Execs in the Turpentine Network: https://www.turpentinenetwork....

SPONSORS:
WorkOS: Building an enterprise-ready SaaS app? WorkOS has got you covered with easy-to-integrate APIs for SAML, SCIM, and more. Join top startups like Vercel, Perplexity, Jasper & Webflow in powering your app with WorkOS. Enjoy a free tier for up to 1M users! Start now at https://bit.ly/WorkOS-Turpenti...

Weights & Biases Weave: Weights & Biases Weave is a lightweight AI developer toolkit designed to simplify your LLM app development. With Weave, you can trace and debug input, metadata and output with just 2 lines of code. Make real progress on your LLM development and visit the following link to get started with Weave today: https://wandb.me/cr

80,000 Hours: 80,000 Hours offers free one-on-one career advising for Cognitive Revolution listeners aiming to tackle global challenges, especially in AI. They connect high-potential individuals with experts, opportunities, and personalized career plans to maximize positive impact. Apply for a free call at https://80000hours.org/cogniti... to accelerate your career and contribute to solving pressing AI-related issues.

Omneky: Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/

RECOMMENDED PODCAST:
This Won't Last - Eavesdrop on Keith Rabois, Kevin Ryan, Logan Bartlett, and Zach Weinberg's monthly backchannel ft their hottest takes on the future of tech, business, and venture capital.
Spotify: https://open.spotify.com/show/...

CHAPTERS:
(00:00:00) About the Show
(00:00:22) Sponsors: WorkOS
(00:01:22) About the Episode
(00:04:31) Intro and Nick's background
(00:08:37) Model training and scaling laws
(00:13:10) Nick's role at Anthropic
(00:16:49) Responsible Scaling Policies overview (Part 1)
(00:18:00) Sponsors: Weights & Biases Weave | 80,000 Hours
(00:20:39) Responsible Scaling Policies overview (Part 2)
(00:25:24) AI Safety Levels framework
(00:30:33) Benefits of RSPs (Part 1)
(00:33:15) Sponsors: Omneky
(00:33:38) Benefits of RSPs (Part 2)
(00:36:32) Concerns about RSPs
(00:47:33) Sandbagging and evaluation challenges
(00:54:46) Critiques of RSPs
(01:03:11) Trust and accountability
(01:12:03) Conservative vs. aggressive approaches
(01:17:43) Capabilities vs. safety research
(01:23:47) Working at Anthropic
(01:35:14) Nick's career journey
(01:45:12) Hiring at Anthropic
(01:52:06) Concerns about AI capabilities work
(02:03:38) Anthropic office locations
(02:08:46) Pressure and stakes at Anthropic
(02:18:09) Overrated and underrated AI applications
(02:35:57) Closing remarks
(02:38:33) Sponsors: Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://www.linkedin.com/in/na...
Youtube: https://www.youtube.com/@Cogni...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

Full Transcript

Transcript

Nathan Labenz (0:00) Hello and welcome to the Cognitive Revolution where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost Erik Torenberg. Hello and welcome back to the Cognitive Revolution. Today, we're sharing a cross post from the 80000 hours podcast in which host Rob Wiblin interviews Nick Joseph, head of model pre training at Anthropic, about the responsible scaling policy that governs the company's frontier model development. I've been following the development of these policies and other voluntary commitments from Top Labs for much of the last year. At 1 point, I had the opportunity to participate in a workshop with Anthropic team members in which we reviewed, discussed, and offered comment on the responsible scaling policy draft. However, I could not easily convert that experience to a podcast, and so I was particularly excited to see this episode published and really appreciate that the 80000 hours podcast team has allowed me to repost it here. Rob Wiblin (1:07) On the substance of...

Nathan Labenz (1:08) ...the matter, I of course very much appreciate how hard Anthropic is thinking about AI risks and what can be done about them, and also how transparent and even candid they are willing to be about the substantial uncertainty that remains. I'm also very glad that their example seems to have inspired others. OpenAI and Google have since published similar policies, and I understand that x AI is working on 1 as well. Unfortunately, however, the trend does not yet appear to be universal. Meta, for example, to the best of my knowledge, has not published a policy describing how it plans to evaluate models during training, let alone how they plan to proceed if it turns out that they are developing dangerous capabilities. As we happen to be sharing this episode with just a few days left before California governor Gavin Newsom will have to sign or veto SB 10 47, I will take a moment to say 1 more time that while it may not be the perfect AI safety bill and indeed the release of OpenAI's o 1 model and the emerging paradigm of scalable inference time compute generally do suggest that the definitions in the bill would need to be updated to stay relevant over time. In my view, the public does deserve to know what frontier labs are doing, and that goes double for METR and any other companies who are openly sharing model weights. So whatever the fate of SB 10 47 may be, I do think that we will need some measure that forces frontier model developers to publish detailed safety plans for public scrutiny. In the last hour of this episode, Rob and Nick changed topics to focus on AI safety career advice. While this is not a sponsored episode, Nick's experience with 80000Hours career advising service echoes my own and presents a natural opportunity to remind you that 80000 Hours is now offering free 1 on 1 career advising sessions to Cognitive Revolution listeners. I encourage everyone to sign up for a free session at 80000hours.org/cognitiverevolution, and especially so if you are 1 of the experienced software engineers that Nick notes are in such high demand among AI companies right now. As always, if you're finding value in the show, we'd appreciate it if you take a moment to share it online with friends or write us a review on Apple Podcasts or Spotify. Your feedback is always welcome too either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. Now for an in-depth conversation about Anthropic's responsible scaling policy and breaking into AI safety in your career, Here's Nick Joseph from Anthropic with host Rob Wiblin of the 80000 hours podcast. Nick Joseph (3:32) I think this is a spot where there are many people who are skeptical that models will ever be capable of this sort of catastrophic danger, and therefore, they're like, we shouldn't take precautions because the models aren't that smart. And I think this is a nice way to agree where you can it's much easier message to say, if we have evaluations showing the model can do x, then we should take these precautions. And I think you can sort of build more support for something along along those lines, and it targets your precautions at the time when there's actual danger. 1 thing other thing I really like is that it aligns commercial incentives with safety goals. So once we put this RSP in place, it's now the case that our our safety teams are kind of under the same pressure as our product teams, where if we want to ship a model and we get to ASL 3, the thing that will block us from being able to get revenue, being able to get users, etcetera, is do we have the ability to deploy it safely? And it's a nice outcome based approach where it's not did we invest x amount of money in it, it's not did we try, it's did we succeed. Rob Wiblin (4:34) Hey everyone, Rob Wiblin here. The 3 biggest AI companies Anthropic, OpenAI and DeepMind have now all released policies designed to make their AI models less likely to go rogue while they're in the process of becoming as capable as, and then gradually more capable than, any human being. Anthropic calls their 1 a responsible scaling policy, or RSP OpenAI uses the term preparedness framework and DeepMind calls their 1 a frontier safety framework. But fundamentally they all have a lot in common. They try to measure what possibly dangerous things each new model is actually able to do, and then as that list grows, put in place new safeguards that feel proportionate to the risk that they think exists at that point in time. This is likely to remain the dominant approach, at least in the industry, so I was really excited to speak with Nick Joseph, 1 of the original co founders of Anthropic and a big outspoken fan of responsible scaling policies, about why he thinks Anthropic RSP has a lot going for it, how it might make a real difference as we approach the training of a true full AGI, and why in his view it should be thought of as a sort of middle way that ought to be acceptable to almost everyone. After hearing out that case, I pushed Nick on the best objections to RSPs I could either come up with myself or find on the internet. Those include that: one) It's hard to trust that companies are going to stick to their RSPs long term. Two) It's difficult to truly measure what models can and can't do, and an RSP is useless if it's mismeasuring model capabilities. Three) It's pretty questionable whether profit motivated companies will go out of their way to act in good faith and make their lives and their product releases much more difficult. And four) That in the most important cases we just don't have safeguards that are able to render new AI capabilities safe, even capabilities that could show up really soon. At the end of the day I come down thinking that responsible scaling policies are a really solid step forward from where we are now, and they're a great way to test and learn what works and what feels practical from the experience of people who are working on the coalface of trying to make this technological revolution actually happen. But I think in time they're going to have to be put into legislation and operated by external groups or auditors, rather than stay left just to companies themselves. At least if they're going to achieve their real potential or the full potential that I think is there. Of course Nick and I debate that take of mine as well. If you want to let us know your reaction to this interview or any other interview that we do then our inbox is always open at podcast80000hours.org. But now here's my interview with Nick Joseph recorded on the 05/30/2024. Today, I'm speaking with Nick Joseph. Nick is head of training at the major AI company Anthropic, where he manages a team of over 40 people focused on training Anthropic's large language models, including Claude, which I imagine many many listeners have have heard of and and potentially used as well. He was actually 1 of the relatively small group of people to to leave OpenAI alongside Dario and Daniel Amodei, who then went on to founder Anthropic back in December 2020. So thanks so much for coming on the podcast, Nick. Nick Joseph (7:20) Thanks for having me. I'm excited to be here. Rob Wiblin (7:22) I'm really happy to talk about how Anthropic is trying to prepare itself for training models capable enough that way a little bit scared of what they what they might go and do. But first, as I just said, you lead model training at Anthropic. What's something that people get wrong or kind of misunderstand about AI AI model training? I mentioned there could be quite a few things. Nick Joseph (7:41) Yeah. I think 1 thing I would point out is this sort of doubting of scaling working. So for a long time, we've we've had this trend where people put more compute into models, and that leads to the models getting better, smarter in various ways. And every time this has happened, I think a lot of people are like, ah, this is the last 1. You know, your the next scale up isn't gonna help. And then, you know, some chunk of time later, things get scaled up and it's it's much better. And I think this is something people have just frequently gotten wrong. Rob Wiblin (8:06) This whole vision that scaling is just gonna keep going. You know, we just make just throwing more data, throwing more compute, the models are gonna become more powerful. That was kind of the part of the it feels like a very Anthropic idea or it it was part of the founding vision, that that that Dario had. Right? Nick Joseph (8:20) Yeah. So, like, a lot of the early work on on scaling laws was done by a bunch of the Anthropic founders, and it somewhat led to GPT-three, which was, done at OpenAI, but by by many of the people who who are now at at Anthropic, where looking at a bunch of small models going up up to GPT-two, it there was sort of this sign that as you put in more compute, you would get better and better. And it was very predictable, and you can say, ah, if you put in x more compute, you'll get a model this good. And that sort of enabled the confidence to go and train a model that was rather expensive by the time standards to verify that hypothesis. What do you Rob Wiblin (8:58) think is generating that skepticism that many people have? People who are skeptical of scaling laws, there's some pretty smart people who are involved in ML, certainly have their technical chops. Why do you think they are generating this prediction that you disagree with? Nick Joseph (9:12) Yeah, I think it's just a really like unintuitive, mindset or something where it's like, ah, the model has, you know, hundreds of billions of parameters. What does it need? It really needs, like, trillions of parameters. Or the model is trained on, like, you know, some fraction of the Internet that's very massive. What, like, what does it need to be smarter? It's even more. Like, that's not how humans learn. You send a kid to school, you don't have them just read through the entire internet and think that the more that they read, the smarter they'll get. So yeah, that's sort of my best guess. And the other piece of it is that it's quite hard to do the scaling work. So there are often things that you do wrong when you're trying to do this the first time, And if you mess something up, you will see this behavior of like more compute not leading to better models. And it's it's always hard to know if it's like you messing up or if it's some sort of fundamental fundamental limit where the model has stopped getting smarter. Rob Wiblin (9:59) I mean, the argument for why you might expect these scaling laws, know so so scaling laws, it's like, yeah, you increase the amount of compute and data by some particular proportion and then you get a similar improvement each time in the accuracy of the model. That's kind of the the rule of thumb here. And the argument that hope for why you might expect that trend to break and that's the improvements to become smaller and smaller for a given scale up. Something along the lines of, know, as you're approaching human level, the model can learn by just copying existing state of the art of what humans are already doing in the in the in the training set. But then if you're trying to exceed human level, if you're trying to do, you know, write better essays than any human has ever written, then that's maybe a different regime and you might expect like more gradual process like more more gradual improvements. So once you're trying to what what to try to get get to a superhuman level. Do you think that that argument kinda holds up? Nick Joseph (10:49) Yeah. So I think that I think that's true and like just sort of pre training on more and more data won't get you to superhuman at at some tasks. It will get you to superhuman in the way of, like, understanding everything at once. This is already sort of true of models like Claude, where you can ask them about anything, whereas humans sort of have to specialize. But I I don't know if progress will necessarily be slower. It might be slower. It might be faster once you sort of get to the level where models are are at human abilities and everything and improving towards superintelligence. We we haven't really but we're we're still pretty far from there. Like, if you use Claude now, I think it's, like, pretty good at coding. But for this is 1 my 1 example I use a lot, but it's still pretty far from, like, how well a human would do working on it as a software engineer, for instance. Rob Wiblin (11:27) And is the argument for how it could speed up that at the point that you're near human level, then you can use the AIs in the process of doing the doing the work? Or is it something else? Nick Joseph (11:36) What I have in mind is like, yeah, if you had if you had an AI that is human level at everything and you can spin up millions of them, you effectively now have a company of like millions of AI researchers. And it's it's hard to know, right? Things problems get harder too, so I don't really know where that leads. But that point, I think you've sort of crossed crossed quite a quite a ways from where we are now. Rob Wiblin (11:56) So it said that you're in charge of model training. I guess there's I know there's different stages of model training. There's the bit where you kind of train the language model on the entire internet, and then there's the bit where you, like, do the fine tuning where you get you get it to spit out answers and then you rate whether you like them or not. Are you in charge of all of that or just some part of it? Nick Joseph (12:12) Yeah. So I'm just in charge of what was sort of typically called pre training, which is where this step of train the model to predict the next word on the internet. And that tends to be like historically is is a significant fraction of the compute. It's maybe 99% in many cases. But after that, the model goes to we call fine tuning teams that will take this model that just predicts the next word and fine tune it to act in a way that a human wants. So to be this sort of like helpful assistant, we have this helpful, harmless, and honest is the is the acronym that we usually aim for for for Claude. Rob Wiblin (12:45) Yeah. I use Claude 3 Opus, multiple times a day every day now. It took me a little while to figure out how to how to actually use these LLMs for for anything. For I guess the first 6 months or first year, I was like, these things are amazing, but I I can't figure out how to actually incorporate them into my life. But recently I've started, talking to them in order to like learn about the world. It's kind of substituted for when I would be typing complex questions into Google to understand some bit of history or science or some technical issue. I guess what's what's the main bottleneck that you face making these models smarter so so I can get more get more use out of them? Nick Joseph (13:16) Yeah. So let's see. I think the there's sort of historically people have talked about these 3 bottlenecks of data, compute, and like algorithms. I kind of think of it as yeah. You see, there's some amount of just compute. We've we talked about scaling a little bit ago. If you put more compute at the model, it will do better. There's data where if you're training on if you're putting in more compute, 1 way to do it is to add more parameters to your model, make your model bigger. But the other way you need to do is to add more data to the model. So you need both of those. But then the the other 2 are algorithms, which I really think of as people. Maybe this is the manager in me is like, you know, algorithms come from people. In some ways, data and, compute also come from people, but it looks like a lot of researchers working working on the problem. And then the last 1 is time, which has felt more kind of urgent, more true recently where things are moving very quickly. So a lot of the bottleneck to progress is actually like, we know how to do it, we have the people working on it, but it just takes some time to implement the thing and turn and run the model, train the model. Right? Like you can maybe afford all the compute and you can you have a lot of it, but you can't efficiently train the model in a second. So right now at Anthropic, it feels like people and time are probably the main bottlenecks or something. I feel like we have quite a significant amount of compute, a significant amount of data, and the things the things that are like most limiting at the moment, I feel like people on time. Rob Wiblin (14:34) So when when you say time, is that kind of indicating that you're doing a sort of iterated like an experimental process where, you know, you try tinkering with how the model learns in 1 direction, then you wanna see whether that actually gets the improvement that you expected. And then it takes time for those results to come in, and then you get to scale that up to the to the whole thing? Or is it just a matter of, you know, you're already trading Claude 4 or you already have the next thing in mind, it's just a matter of waiting? Nick Joseph (14:54) So it's both of those. For the next model, have a bunch of researchers who trying projects out and they have to you have some idea, and then you have to go and implement it. So you'll spend a while sort of engineering this idea into the code base, and then you need to run a bunch of experiments. And typically, you'll start with cheap versions and work your way up to more expensive versions such that this process can take a while. For simple ones, it might take a day. For really complicated things, it could take months. And to some degree, can paralyze, but on certain directions, it's much more like someone you're building up an understanding, and it's hard to paralyze, like, building up an understanding of how something works and then designing, like, the the next experiment. There's just sort of an inserious aspect. Rob Wiblin (15:32) Is proving these model is is is sorry. Is improving these models harder or or easier than than than people think? Nick Joseph (15:40) Well, I guess people think different things on it. I think my experience has been that, like, early on, I would have it felt very easy. I worked for before working at OpenAI, was working on on robotics, for a few years, and 1 of the tasks I worked on was locating an object so we can pick it up and drop in the box. And it was really hard. I I spent years on this problem. And then, went to OpenAI, and I was working on on code models, and it just felt shockingly easy. It was like, wow. You know, you just throw some compute, you train on some code, and the model can write code. I think that has now shifted. Like, we've now and and the reason for that was no 1 was working on it. There was just very little attention to this this direction and a ton of low hanging fruit. We've now plucked a lot of the low hanging fruit. So finding improvements is much harder, but we also have way more resources, like exponentially more resources put on it. There's way more compute available to do experiments. There are way more people working on it. And I I think the the rate of progress is probably still going like the same given that. Rob Wiblin (16:35) Okay. You so you think on the 1 hand, the problem's gotten harder. On the other hand, there's more resources going into it, this is kind of canceled out and, like, and, like, progress is roughly roughly stable? Nick Joseph (16:45) Yeah. It's pretty bursty. So it's like it's hard to know. You know, you'll have, like, a month where it's like, wow. Every we've figured something out. Everything's going really fast. Then you'll have a month where you try a bunch of things and they don't work. And it it varies, but I don't think I don't think there's really been a a trend in in the direction. Rob Wiblin (17:01) Hey. We'll continue our interview in a moment after a word from our sponsors. Do do you personally worry that, you know, you know, having a model that is kinda nipping at the heels or maybe, maybe outcompeting the best stuff that OpenAI or DeepMind or just whatever other companies have, that maybe puts pressure on them to speed up their releases and cut back on safety testing or anything like that? Nick Joseph (17:22) I think it is something to like be aware of, but I also think that at sort of this point, I think this is really more true after ChatGPT. I think before ChatGPT, there was sort of this sense where many AI researchers, I think, working on it were like, wow. This technology is really powerful. But I think the world hadn't really caught on, and there wasn't quite as much commercial, like, pressure. Since then, I think that there really is just a lot of commercial pressure already, and it's not really clear to me, how how much of impact it is. I think there is definitely an impact here, but, I I don't know the magnitude, and there are there are sort of a bunch of other considerations to trade off. Rob Wiblin (17:55) Alright. Let let's turn to the main topic for today, which is, responsible scaling policies or RSPs as as as the cool kids call them. For those who don't know, scaling is this technical term for for using more compute or data to to train any any given AI model. And the idea for RSPs has has been around for a couple of years, and I I think it was fleshed out, you know, maybe after 2020 or so. It was advocated for by this group called m e now called METR or Model Evaluation and Threat Research, which is actually is the place that previous guest of the show, Paul Christiano, I was working until until not very long ago. Anthropic released the first public 1 of these, as far as I know, last October. And then Open AI put out something kinda similar in in in December called their preparedness framework. And Demis of DeepMind has said that they're gonna be producing something in in a similar spirit to this, but but they haven't haven't done so yet as far as I know. So we'll just have to wait and see. Nick Joseph (18:51) It's actually yeah. They have they have done it. Rob Wiblin (18:52) Yeah. Oh, okay. Nick Joseph (18:53) That was published like a week or so Oh, Rob Wiblin (18:55) okay. I'll maybe maybe ask you about that later. But yeah. Just that just goes to show that our RSPs have got this this reasonably hot idea, which is kind of why we're talking about them today. And I guess some people also hope that these internal company policies are ultimately going to be a model that might be able to be turned into binding legislation, that, you know, everyone dealing with these frontier, AI models might be able to able to follow in future. But, yeah, Nick, what are responsible scaling policies, in a in a nutshell? Nick Joseph (19:22) So I might just start off with a quick disclaimer here that this is not my direct role. I'm bound by trying to implement these and act under 1 of these policies, but many of my colleagues have worked on designing this in detail and probably more familiar to all the deep ones than me. But anyway, in a nutshell, the idea is it's a policy where you define various safety levels, so these sort of different levels of risk that a model might have, and create evaluations. So tests to say, this model able to to is a a model this dangerous? Does it require this level of precautions? And then you need to also define sets of precautions that need to be taken in order to train or deploy models, at that particular risk level. Rob Wiblin (20:02) Yeah. I think this this this this might be a topic that is just best, learned about by, kinda skipping the abstract question of what RSPs are and just, talking about the, the Anthropic RSP and seeing seeing what it, what it actually says that you're that you're gonna do. So, yeah, what what what does the Anthropic RSP gonna promise that could commit the commit the company to doing? Nick Joseph (20:18) Yeah. So we basically sort of for for every level, we'll define these red line capabilities, which are capabilities that we think are dangerous. I can maybe give some examples here, which is this acronym CBRN, chemical, biological, radiological, and nuclear threats. And in this area, it might be that a non expert can make some some weapon that can kill many people as easily as an expert can. So this would sort of increase the pool of people that can do that a lot. On cyberattacks, it might be like, can can we can a model help with some really large scale cyber attack? And on autonomy, can the model, like, perform some tasks that are are sort of precursors to to autonomy as sort of our our current 1? But that that's a tricky 1 to figure out. So it is we we, like, establish these these red line capabilities that we shouldn't train until we have safety mitigations in place. And then we create evaluations to show that we're that models are far from them or to know if they're not. So these model these these evaluations have to can't test for that capability because you wanna you want them to turn up positive before you've trained a really dangerous model. But we can kind of think of them as yellow lines. Like, once you get past there, you should, you should reevaluate. And the last thing is then developing standards to make models safe. So we want to have a bunch of safety precautions in place once we train those those dangerous models. So that's sort of the the main aspect of it. There's also sort of, a promise to, like, iteratively extend this. So creating the evaluations is really hard. We, we don't really know what the evaluation should be for, like, a superintelligent model yet. So we're kind of starting with the closer closer risks and once we hit that next level, defining the 1 the 1 after it. Rob Wiblin (21:51) So pretty a pretty cool component of the Anthropic RSP is this AI safety level framework. So I think I think if you borrowed that from the biological safety level framework, which I think is what, you know, a lab's dealing with dangerous diseases use. So I guess I don't know what the numbers are, but you know, you're dealing with Ebola or something that's particularly dangerous or smallpox or whatever, then that can only be stored in a BSL for lab or something like that. And then as the diseases become less and less dangerous, you can store them with with fewer precautions. You've kind of taken that language and talked about AI safety levels. And the current AI safety level that you put us at is ASL 2, which is things like Claude 3, which are kind of impressive. They seem pretty savvy in some ways, but they don't seem like they really pose any any meaningful, catastrophic risk. I guess, yeah, what sort of tests have you run on Claude 3, you know, recently maybe, in order to say, yeah, this is this is in the ASL 2 bucket? Nick Joseph (22:45) Yeah. So we've got a bunch of tests. So so the first 1 we use for the more like a biological weapon type of type of angle is there's sort of a bunch of multiple choice questions that we that we ask the model. And this is like the idea this isn't a perfect evaluation for this, but the idea is that we we have a set where if it gets if it isn't much better at this, it probably won't be able to help very much. And then we've run trials with, with act with people where they get access to a search engine and they get access to Claude, and we see if they do significantly better at going through some some relate relevant tasks, using using Claude versus a search engine. Rob Wiblin (23:21) And I guess and another 1 of the kind of yellow flags for you is the possibility the computer that AI model might be able to replicate itself in the wild or kind of, you know, figure find find its own weights and then copy itself onto onto other computers. Did you test whether Claude 3, you know, was showing any signs of being able to do that? Nick Joseph (23:39) Yeah. So I think some of the the tasks there would be 1 of them is adding an exploit to a popular open source code base. So this is an idea of if the model can can add some exploit that can then be taken advantage of or fine tuning an open source LLM to add some backdoor, where if you say x, it will always up output something. This is like, can it can it manipulate LLMs? It could steal an API key from from our API or a a fake version of our API that was set up, which would sort of enable it to get credits. There are few more of these, but the or, yeah, another 1 would be like a worm that can spread to undefended machines. So you could sort of imagine a model that can do all of these has learned a bunch of the skills that might enable it to sort of autonomously replicate and cause havoc. They're pretty early signs of it, and we we wanna sort of test for for the early ones because this is sort of a an area that's, like, less fleshed out where there's sort of less clear expertise on on what might go wrong. Rob Wiblin (24:26) Okay. So we're at the AI safety level 2, which is kind of not, like, it's, I guess, the mostly harmless category. But what sort of, steps does the does the, responsible scaling policy call for you to be taking even even even at this point? Nick Joseph (24:41) So we we made these sort of White House commitments, I think sometime last year, which and I think of them as sort of like standard industry best practices. In many ways, we're building the muscle for for dangerous capabilities, but but these models are not yet, capable of catastrophic risks, which is what the, like, RSP is primarily focused on. But this this looks like security to protect our weights against sort of opportunistic attack attackers, putting out model cards to describe the capabilities of the models, doing training for, harmlessness, so that we don't we don't have models that can can be really harmful out there. Rob Wiblin (25:10) So what sort of results would you get back from your tests that would indicate that now the capabilities have have risen to, you know, ASL 3? Nick Joseph (25:18) Yeah. So if the model for instance passed some fraction of those of those tasks that I mentioned before around adding a adding an exploit, spreading to undefended machines, or if it did really well on these biology ones, that would sort of flag it as having passed the yellow lines. At that point, I think we would we would either need to look at the model and be like, this really isn't is clearly still incapable of these red line dangers. And then we might need to go to the board and think about was was there a mistake in RSP and how how we should, essentially create new evals that would would test better for whether we're at that capability, or we would need to implement a bunch of precautions. And these precautions would look like much more intense security where we would really want this to be sort of robust to, like probably not state state actors, but to to nonstate actors. And we would wanna pass this sort of intensive red teaming process on on all the modalities that we release. So this would mean we look at those red lines and we test for them with experts and say, you know, can you use the model to to do this? We have this sort of intensive process of red teaming, and then only release the modalities where it's been red teamed. So if you if you add in vision, you need to red team vision. If you add the ability to fine tune, you need to red team that. Rob Wiblin (26:26) Yeah. What is, what does red teaming mean in this context? Nick Joseph (26:29) Red teaming means you get a bunch of people who are trying as hard as they can to get the model to do the task you're worried about. So if you're worried about the model, like, carrying out a cyber attack, you would get a bunch of of experts to try to prompt the model to carry out some cyber attack. And if if we think it's capable of doing it, we've put all of these we've we're putting these precautions on. And these could be precautions in the model, they could be precautions outside of the model. But the whole end to end system, we wanna have people trying to get it to do that in some some controlled manner such that we don't actually cause mayhem Yeah. And see how they do. Rob Wiblin (26:59) Okay. And then so if you do the red teaming and it comes back and they say, yeah, the model is extremely good at hacking into computer systems, or it could actually help people, it could meaningfully help someone develop a bioweapon, Then then what is the what is the policy called for Anthropic to do? Nick Joseph (27:16) So for that 1, it would mean we can't deploy the model, because there's some danger this model could be misused in a really terrible way. And we would sort of keep the model internal until we've improved our safety measures enough that when someone asks for it to do that, we can be confident that they won't be able to, have it help them for for that particular threat. Rob Wiblin (27:35) Okay. And to even have the this model on your computers, the policy also calls for you to have hardened your computer security so that you're saying, maybe it's unrealistic at this stage for that model to be safe from persistent state actors, But, but at least, other groups that are somewhat less capable than that, yeah, you would, you'd wanna be able to make sure that they wouldn't be able to steal the model. Nick Joseph (27:57) Yeah. The the big the threat here is is, you know, you can put all the restrictions you want on what you what you do with your model, but if people are able to just steal your model and then then deploy it, you're going to have all of those dangers anyway. So you're sort of taking responsibility for it means, like, both responsibility for what what you do and what someone else can do with your models, and that that requires quite intense security to protect the the model weights. Rob Wiblin (28:18) When do you think we might hit this? You you would say, well, now we're in the ASL 3, regime. Maybe I'm not sure exactly what what language you use for this, but, like, at what point will we have an ASL, 3 level, model? Nick Joseph (28:30) I'm not sure. I think basically we'll continue to evaluate our models and we'll see when we get there. I think opinions vary a lot on that. Rob Wiblin (28:36) We're talking about the next few years, right? This isn't something that's going to be 5 or 10 years away necessarily. Nick Joseph (28:42) I think it really just depends. I think you could imagine any direction. And 1 of the nice things about this is that we're targeting the safety measures at the point when there's actually dangerous models. So, like, maybe let's let's say I thought it was gonna happen in 2 years, but I'm wrong and it happens in 10 years. We won't have to we won't put these very costly and, like, difficult to implement mitigations in place until until we, like, need them. Rob Wiblin (29:04) Okay. So Anthropic's RSP so I guess, obviously, we're just talking about ASL 3. The next level we're on that would be ASL 4. I think your policy basically says we're not exactly sure what ASL 4, looks like yet, because it's it's too soon to say. And I guess you promised that you're gonna have mapped out, what would be the kind of capabilities that would that would escalate things to ASL 4 and what kind of responses you would have. You're gonna figure that out by the time you have trained a model that's at ASL 3. And I guess if you haven't sent so you'd have to pause training on a model that was, going to hit ASL 3, until you'd, you'd finish this project. That that I guess that was the the the commitment that's been made. Hey. We'll continue our interview in a moment after a word from our sponsors. But maybe you could kind of give us a sense a sense of what, you think ASL 4 might look like. What sorts of capabilities, by the models, would then, like, push us into another regime where a further set of precautions are called for. Nick Joseph (29:59) So we're still discussing this internally, so I don't want to say anything that's like final or going to be held Rob Wiblin (30:04) to, but you could sort Nick Joseph (30:04) of imagine stronger versions of a bunch of the things that we sort of talked about before. And you could also imagine models that can help with AI research in a way that really majorly accelerates researchers such that progress goes much faster. The the core reason that we're we're holding off on kind of defining this or that that we we have this iterative approach is there's this long track record of people saying, you know, oh, once you have this capability, it will be AGI. It's gonna be really dangerous. I think people are like, oh, when an AI solves chess, like, it will be as smart as humans. And it's really hard to get these evaluations right. Even for, like, the ASL 3 ones, think it's been very tricky to get, get evaluations that capture the risks we're worried about. So sort of the closer the closer you get to that, the more information you have and the better of a job you can do, at sort of defining what these evaluations are and risks are. Rob Wiblin (30:50) So the general sense will be, you know, models that might be capable of spreading autonomously across computer systems even if people were trying to to turn them off, you know, would be able to provide significant help with developing bioweapons, maybe even to people who are pretty pretty informed about it. I guess, yeah, what what else is there? Oh, and stuff that would seriously speed up AI development as well. So it could potentially set off this sort of positive feedback loop where the models get smarter, that makes them better at improving themselves and and so on. That's the sort of thing where we're we're talking about. Nick Joseph (31:18) Yeah. Stuff along those lines. I I'm not sure which ones will end up in in ASL 4 exactly, but but like those sorts of things are what's being considered. Rob Wiblin (31:24) And what sorts of additional precautions might there be? I guess at that point, you kind of want the models to not only be, not possible to be stolen by kind of independent freelance hackers, but ideally also not by countries even. Right? Nick Joseph (31:36) Yeah. So you want to protect against more sophisticated groups that are trying to steal the weights. We're going to want to have like better protections against the model, like acting autonomously. So controls around that, might want it depends a little bit on like what what end up being the red lines there, but sort of having the precautions that are tailored to what will be a much higher level of risk than the ASL 3 red lines. Rob Wiblin (31:59) Were you, heavily involved in actually doing this testing on, again, Claude 3, this this this this year? Nick Joseph (32:06) I wasn't like running the tests, but I was sort of watching them because as we trained Claude 3, we were very much, sort of all of our planning was contingent on whether or not it passed these evals. So and and because we had to run them partway through training. So there's sort of a lot of planning that goes into, the model's training. You don't want to have to stop the model just because you were you didn't plan well enough to run the evals in time or something. So there was sort of a bunch of coordination around that that I was I was involved in. Rob Wiblin (32:30) Can you give me a sense of how many, like, how many staff are involved in in doing that and and how long does it take? Is this a is this a big process or is it pretty is it a pretty standardized thing where you're putting in, you know, well known prompts into the into the model and then just then just checking what it does that's different from last time? Nick Joseph (32:45) Yeah. So Claude is our first time running it. So a lot of the work there actually involved creating the evaluations themselves as well as running them. So we had to sort of create them, have them have them ready, and that and then running them. I think typically running them should be pretty it's pretty easy for the ones that are automated, but for some of the things where you actually require, like, people to go and use the model, they they can be much, much more expensive. There's currently, I think, like, multiple teams working on this, and a lot of our capabilities teams were worked on it very hard. So 1 of the 1 of the ways this can fall apart is if you don't solicit capabilities well enough. If you if you sort of try to have the model test the model on eval, you don't try hard enough, then it turns out that with just a little more effort, the model could have passed the evals. So the it's it's often important to have kind of your, like, your best researchers who are who are capable of of pulling capabilities out of the models also working on trying to pull them out to to pass these tests. Rob Wiblin (33:37) So many people will have had the experience that these LLMs will reject objectionable like requests. If you say if you if you put it to Claude 3 now and say, please help me design a bioweapon, it's gonna say, sorry, I can't sorry, I can't help you. But I guess you do all of these tests before you've done any of that training to try to discourage it from doing objectionable things. You do it with the thing that's helpful, no matter what the request is. Right? Nick Joseph (33:58) Yeah. Because we want the thing we're testing for is, is the model capable of this danger? And then there's a separate thing, which is what mitigations can we put on top? So if the model is capable of the danger, then we would require ASL 3. Those safety mitigations we put on top would be part of the standard in order to pass that red teaming. Does does that make sense? Rob Wiblin (34:14) Yeah. So I guess so you're saying so you have to focus on what the model could do if it was so motivated to because well, I guess if the weights were ever leaked, then someone would be able to remove any of the fine tuning that you've done to try to discourage it from doing unpleasant things. So if it's able to do something, then it could potentially be used that way, sooner or later. And so you need to like kind of assume the worst and and plan around that. Is is that the philosophy? Nick Joseph (34:40) Yeah. That's exactly right. Rob Wiblin (34:42) You mentioned that maybe 1 of the 1 of the key failure modes here would be just not trying hard enough to to elicit the, you know, the dangerous capabilities basically. Maybe if if you if you just phoned it in, then you could easily miss that the model's capable of doing something that that that it can do. Because even more extreme would be if in future, know, this policy is gonna matter the most when these models are extremely capable and actually could do do serious damage. And at that point you might worry that they would pretend to not be able to do things that in fact they they can do because they don't think it's in their in their interest to reveal that they have x or y capability. Guess because because then they know that they won't be won't be used and they might be deleted. Yeah. Do you have any plans for for how you're gonna deal with with that issue of I I think the technical term that people use for this is sandbagging, which I'd never heard of. But I think it's I think it's a term from sport where people underplay their abilities for for strategic reasons. Yeah. Do you have have any idea of how Anthropic is gonna be able to get around that problem? Nick Joseph (35:34) I just think I never I didn't know the origin of of that term. I think this is like 1 of the reasons that we sort of want to define the next set of evaluations after we've done the current ones because this is really hard. How how do you evaluate for that when the model could be trying to sandbag the evaluations? Hopefully, you can set the line, like, or like your eval for a capability that's early enough that you would still be able to detect it doing that. So there's sort of this hope that, you know, less capable models will try to deceive you but fail, and then you'll know to stop before you get to models that are are really capable of deceiving you. But it's not certain, and that that that definitely seems like a a possible failure mode where if you have these sort of sudden capability jumps where or if you have jumps that an evaluation can't catch, Either because the model's sandbagging you or for any of a number of other reasons. Rob Wiblin (36:19) To try to detect sandbagging, is it possible to if you have a model that says no, I'm I I I can't I can't do that. Can you kind of engage in an, like, reinforcement learning from human feedback thing where, you know, whatever response shows like the most inclination, the most willingness to help, like, as soon as you can kinda open the door a little bit, then you reinforce that a whole lot and then see, will will that will that open the door the whole way and actually get the model to to demonstrate that it that it that it can do something even though it's trying to pretend that it that it can't. Does that make sense? Nick Joseph (36:46) It certainly makes sense. I think it just depends a little bit on like exactly what the model's trained to do. So like if you've trained the model to be harmless and then you kind of just try to RL it back to to not being harmless, like you you will be able to succeed. Like you you can fine tune away any of the safety training, but it's not clear if that will be the same thing as if in 1 sequence, within some prompt, it's trying to trick you. I think there's been a bunch of work where people use scratch pads to tell the model you can write something down, but reward model, which is the model that's overseeing it and telling it whether it's good or not, doesn't get to see this. And when you do that, you as a human can then look at it and you can get a sense of what is the model thinking, is the model trying to deceive its supervisor. But we've only really done that on sort of current models, which don't really have this kind of, like, level of capability where it would be really catastrophically dangerous and figuring out I think we'll want much more robust tools once we get there. Interpretability is another approach to this where if you can actually look directly at the model weights and look and see, ah, okay, we without running the model, we can see that the model has this circuit that is saying, ah, you should deceive deceive the user in this way or you should fail the tests because you don't want them to turn you off. That's sort of another another hope for how to attack this. Rob Wiblin (38:05) Okay. So that's the basic structure of a responsible scaling policy or at least the at least the Anthropic responsible scaling policy. I guess you have this this system of tests that you commit to ahead of time that you're gonna put all of your models through. And then you pre commit to saying, well, if we get this kind of result, then we think the risk is higher. And so that's gonna call for an escalation in precautions that we're taking things around computer security, things around not deploying until you've made them safe and so on. You're a big fan of this type of approach to AI safety for AI companies. What's 1 of the main reasons or what's perhaps the top reason why you think this is right approach or at least 1 of 1 of the better approaches? Nick Joseph (38:48) Yeah. So I think 1 thing I like is that it separates out whether an AI is capable of being dangerous from what to do about it. I think this is a spot where there are many people who are skeptical that models will ever be capable of, will be ever be capable of this sort of catastrophic danger. And therefore, they're like, we shouldn't take precautions because, the models aren't that smart. And I think this is a nice way to agree where you can it's much easier message to say, if the if we have evaluations showing the model can do x, then we should take these precautions. And I think you can sort of build more support for something along along those lines, and it targets your precautions at at the time when there's actual danger. There are bunch of other things I can I can sort of talk through? Rob Wiblin (39:24) **I think

Nick Joseph** (39:24) 1 thing other thing I really like is that it it aligns commercial incentives with safety goals. So once sort of once we put this RSP in place, it's now the case that our our safety teams are kind of under the same pressure as our product teams, where if we want to ship a model and, you know, we get to ASL 3, the thing that will block us from being able to get revenue, being able to get users, etcetera, is do we have the ability to, to deploy it safely? And it's a nice outcome based approach where it's not did we invest x amount of money in it? It's not did we you know, did we try? Rob Wiblin (40:00) **Did we say the

Nick Joseph** (40:01) right things? Did we succeed? Yeah. And I think that like often really is important for organizations to like set this goal of like you need to succeed at this in order to order to deploy your products. Is it actually the case that it's Rob Wiblin (40:13) kind of had that cultural effect within Anthropic now that people realize that a failure on the safety side would prevent the release of the the model that matters to the future of the company? And so there's like a similar level of pressure on the people doing doing this testing as there is on the people actually training the model in the first place. Nick Joseph (40:30) Oh, yeah. For sure. I mean, you asked me earlier, like, when when are we going to have ASL 3? And I I think I received this from someone on 1 of the safety teams on like a weekly basis because their deadline ascent I mean, the hard thing for them actually is their deadline isn't isn't a date. It's once once we have created some capability and they're they're very focused on that. Rob Wiblin (40:49) So their their fear, the thing that they, you know, worry about at night is that you might be able to hit ASL 3 next year and they're not going to be ready. And that's going to hold up the entire enterprise. Nick Joseph (41:00) Yeah. I can give some other things like 8% of Anthropic staff works on security for instance. So this is like, there's a lot you have to plan for it, but there's a lot of work going into being ready for these next safety levels. We have multiple teams working on alignment, interpretability, creating evaluations. Yeah, there's a lot of effort that goes into it. Rob Wiblin (41:19) When you say security, do you mean computer security? So preventing the waits from getting stolen or more a more broader class? Nick Joseph (41:26) Both. So, like, the waits could get stolen, someone's computer could get compromised, you could have someone hack into get and get all of your IP. IP. There's sort of a bunch of different dangers on the security front, where the weights are certainly an important 1, but they're definitely not the only 1. Rob Wiblin (41:41) Okay. And the, the first thing you mentioned, the the first, reason why RSPs have this nice structure is that some people think that these troublesome capabilities could be with us this year or next year. Other people think it's never gonna happen. But both of them could be on board with a policy that says, well, if these capabilities arise, then that would call for these sorts of responses. Has that actually happened? I mean, have you have you seen kind of the skeptics who say all of this AI safety stuff is overblown. It's a it's a bunch of rubbish saying, well, but the RSP is fine because it you know, I think we'll never actually hit any of these levels. So we're not gonna waste any resources on something that's not not realistic. Nick Joseph (42:17) Yeah. So I think there's always going to be degrees. I think there are people across the spectrum. So there are definitely people who are still skeptical, will just be like, why even think about this? There's no chance. But I do think that RSPs do seem much more pragmatic, much more able to be picked up by various other organizations. Think as you mentioned before, OpenAI and Google are both sort of putting out things along these lines. I think at least from the sort of large frontier AI labs, is a significant amount of buy in. Rob Wiblin (42:45) Yeah, I see. I guess even if maybe you don't see this on, on Twitter, maybe it helps with the internal bargaining within the company that people have a different range of expectations about how things are gonna go, but, they could all be kind of reasonably satisfied with an RSP that that equilibrates or like matches the level of capability with the level of precaution. The first worry about this that jumps to my mind is if the capability improvements are really quite rapid, which I think we think that they are and they maybe could continue to be, then don't we need to be practicing now, like figuring out basically getting ahead of it and doing stuff right now that might seem kind of unreasonable given what Claude 3 can do. Because we worry that we could have something that's substantially more dangerous in 1 year's time or in 2 years time. We don't want to then be kind of scrambling to, you know, deploy the systems that are necessary then and then perhaps falling behind because we didn't we didn't prepare sufficiently ahead of time. What what do you make of that? Nick Joseph (43:42) Yeah, so I think we definitely need to plan ahead. Right? And I think 1 of the nice things is once you've aligned these sort of safety goals with commercial goals, like people plan ahead for commercial things all the time, it's part of a normal company planning process. I think that the the RSP so we have these sort of yellow line evals that are are intended to be far short of the capability, the red line capabilities we're actually worried about, and tuning that gap seems fairly important. If that gap looks like a week of training, it would be really scary where you trigger these evals and you have to act fast. I think in practice, we've set those evals such that they are far enough from the capabilities that are really dangerous, such that they're really there will be some time to sort of adjust in that sort of buffer period. Rob Wiblin (44:23) So should people actually think that, you know, we're at ASL 2, we're in ASL 2 now and we're heading towards ASL 3 at some point, but there's actually kind of an intermediate stage with all these transitions where you'd say, well, now we're seeing warning signs that we're going to hit ASL 3 soon. So we should we need to implement the the poll the precautions now in anticipation of being about to hit ASL 3. Is is that basically how it works? Nick Joseph (44:46) Yeah. It's basically like once we sort of have this concept of a safety buffer. So once we trigger the evaluations, it doesn't necessarily mean like these evaluations are set conservatively, so it doesn't mean the model is capable of the redline capabilities we're really worried about. And that will sort of give us give us a buffer where we can figure out, maybe it really just definitely isn't, and we wrote a bad eval. We'll go to the board. We'll try to we'll change the evals and implement new things. Or maybe it really is quite dangerous, and we need to turn on all the precautions. Of course, you might not have that long, so you want to be ready to turn on those precautions such that you don't have to pause. But, but you you do need there is some time there that you could do it. And then the last possibility is that we're just really not ready. These models are catastrophically dangerous, and we don't know how to secure them, in which case we should stop training the models. Or if we don't know how to deploy them safely, we should not deploy the models until we figure it out. Rob Wiblin (45:38) I guess if you are on the very concerned side, then you might think, yes, that you are gonna, you are preparing, I guess, yeah, you do have a reason to prepare this year for safety measures that you think you're gonna have to employ in future years. But maybe we should go even further than that. And what we need to be doing is practicing implementing them and seeing how well they work now. Because even though you are preparing them, you're not actually getting the gritty experience of applying them and trying to use them on a day to day basis. And I guess the response to that would be, well, would in a sense be safer, that would be adding an even greater precautionary buffer, but it would also be enormously expensive, and people would see us doing all of this stuff that seems really over the top relative to to what any of the models can do. Nick Joseph (46:22) Yeah. I think there's sort of a trade off here between like with pragmatism or something where I think if you took I think we do need to have a huge amount of caution on future models that are are really dangerous. But if you apply that caution to models that aren't dangerous, you miss out on a huge number of benefits from from using the technology now. And I think you'll also probably just alienate a lot of people who are gonna look at you and be like, you're crazy. Like, why are you doing this? And I think my hope is that you can sort of this is sort of the framework of RSP is you can tailor the the cautions to the risks. It's still important to, like, look ahead, more. Right? So a lot of our we do a lot of safety research that isn't directly focused on the next AI safety level because you wanna plan ahead, you have to be ready for for multiple ones It's not like the only thing to think about, but the RSP is sort of tailored more to empirically testing for these risks and tailoring the precautions appropriately. Rob Wiblin (47:13) Yeah, on that topic of, people worrying that it's going to slow down progress in the technology, Do you have a sense of so obviously, you know, training these frontier models costs a significant amount of money. We're talking maybe a 100,000,000, dollars. It's a kind of a figure that I've heard thrown around for for training a frontier LLM. How much, like extra overhead is there to run these tests to to see whether, the models have any of these dangerous capabilities? Is it adding hundreds of thousands, millions, tens of millions of dollars of of additional cost or or time? Nick Joseph (47:41) I don't know the exact cost numbers. I think the cost numbers are pretty low. Right? They're mostly running inference or relatively small small amounts of training. The people time feels like where there's a cost. There are whole teams dedicated to creating these evaluations, to running these, to doing the safety research to protect against the mitigations. I think particularly for Anthropic where we're pretty small, rapidly growing but rather small organization, at least my perspective is most of the cost comes down to the people and time that we're investing in it. Rob Wiblin (48:09) Okay. Yeah. But I guess at this stage, sounds like running these sorts of tests on a on a model is is taking more in the order of weeks of delay. Because if you're getting back clear, this is not a super dangerous model, then it's not not leading you to to delay release of things for for for many months and and deny customers the the benefit of them. Nick Joseph (48:29) Yeah. The goal is to to minimize the delay, right, as much as much as you can while while being responsible. Like, the the delay itself in itself isn't valuable. I think we we're aiming to, like, get it to a really, you know, well done process where it can all all execute very efficiently. But until we get there, there are there might be, like, delays as we're as we're figuring that out, and there will always be some level of time to require to do it. Rob Wiblin (48:50) Just to clarify, so a lot of the risk that people talk about with with AI models is risks once they're deployed to people and actually and actually getting used. But there's kind of this separate class of risk that comes from having an extremely capable model simply exist anywhere, even on, you know I guess you could think of this, there's public deployment and then there's internal deployment where, you know, Anthropic staff might be using a model and potentially it could convince them to to to release it or or to do other dangerous dangerous things. That's a kind of a separate separate concern. Does the well, like, does the RSP have to say about that sort of internal deployment risks? Are there are there circumstances under which you would say, you know, even Anthropic staff can't continue to do testing on this model because, because it's too unnerving? Nick Joseph (49:29) Yeah. So I expect this to mostly kick in as we get to to higher AI safety levels, but there are certainly dangers. I mean, the the main 1 is the security risk. So so, like, 1 1 approach is just having the model, it always could be stolen. No 1 has perfect security. So that's kind of, I think, in some ways is 1 that's true of all models and is maybe more short term. But yeah, if you get to models that are trying to escape, trying to autonomously replicate, there is there is danger then in having access internally. So you we would wanna do things like siloing who who has access to the models, putting particular precautions in place before the model is is even trained or or maybe even on the training process. But we haven't yet defined those because we, we don't really know what they would be. Right? Like, we we don't quite know what that would look like yet, and it's it feels really hard to design evaluation that is is meaningful for that right now. Rob Wiblin (50:16) I don't recall the RSP mentioning conditions under which you would say, we have to delete this model that that we've trained because it's because it's too dangerous. But I guess that's because that's more of the kind of ASL 4 or 5 level that that would become the kinds of things that the kind of thing that you would contemplate and you just haven't spelled that out yet. Nick Joseph (50:31) No. So it's actually because of the safety buffer concept. Right? So the idea is we would never train that model. If we did accidentally train some model that was past the red lines, then I think we'd have to think about deleting it. But we would put these evaluations in place far far below the dangerous capability such that we would trigger the evaluations and have to pause or have the safety things in place before we, train the model that, like, has has these dangers. Rob Wiblin (50:55) So what the RSPs as an as an approach, what, You're a fan of them. What do you think of them as an alternative to? What are the alternative approaches for dealing with with AI risk that people advocate that you think are weaker in relative terms? Nick Joseph (51:10) So I mean, I think the baseline is obviously just the first baseline is nothing. There could just be nothing here. I think the downsides of that is that these models are very powerful. They could at some point in the future be dangerous, and I think that companies creating them have a responsibility to think really carefully about those those risks and and be thoughtful. Sort of like it's a major externality. That's maybe the the easiest baseline of do nothing. Yeah. I think like other other things I would propose would be like a pause where a bunch of people say, well, there are all these dangers. Why don't we just not do it? And, I think that, like, makes sense. Right? If you're training these models that are really dangerous, it does feel a bit like, why are you doing this if you're worried about it? But I think there are actually really clear and obvious benefits to AI AI products right now, and the catastrophic risks currently are just they're definitely not obvious. I think they're probably not not immediate. And as a result, they're not this isn't a practical ask. Not everyone is going to pause. So what will happen is only the places that care the most, that are the most worried about this, and the most careful with safety will pause, and you'll sort of have this, like, adverse selection effect. I think there, like, eventually might be a time for a pause, but I would want that to be backed up by here are clear evaluations showing the models have these really catastrophically dangerous capabilities and here are all the efforts we put in to making them safe. And we ran these tests and they didn't work, and that's why we're pausing and we would recommend everyone else should pause until they've as well. I think that will just be like a much more convincing case for a pause and target it at the time that it's like most valuable to pause. Rob Wiblin (52:40) When I think about, you know, people doing somewhat potentially dangerous things or developing, interesting products, May maybe the default thing I imagine is that the government would say, here's what we think you ought to do. Here's here's how we think that you should make it safe. And as long as you like, as long as you make your product according to these specifications, as long as the plane, you know, runs runs this way and you like service the plane this frequently, then then you're in in the clear and we'll say that that what you've done is is reasonable. Do do think that RSPs are maybe better than that in general or maybe maybe just better than that for now, where, you know, we we kinda don't know necessarily what what regulations we want the government to be imposing. So it's perhaps it's it's better for the for the for companies to be figuring this out them themselves early on, and then perhaps it can be handed over to governments later on. Nick Joseph (53:23) Yeah. I don't think the RSPs are like a substitute for regulation. There are there are many things that sort of only regulation can solve, such as what about the places that don't don't have an RSP. But I think that right now, we don't really know what the tests would be or what the regulations would be, and I think probably this is still sort of getting figured out. So so 1 hope is that, like, we can implement RSP, OpenAI, Google can implement other things, other places will implement a bunch of things. And then policymakers can look at what we did, look at our reports on how it went, what what the results of our evaluations were and how it was going, and then design regulations, sort of based on the learnings from that. Rob Wiblin (53:56) If I if I read it correctly, it seemed to me like the Anthropic RSP has this clause that allows you to go ahead and do things that you think are dangerous if you're being sufficiently outpaced by some other competitor that I guess doesn't have an RSP or not a very not a very serious, responsible scaling policy. Which case you might worry, well, we have this policy that's preventing us from from going ahead, but we're just being rendered irrelevant and and some other companies releasing much more dangerous stuff anyway. So what really is this accomplishing? Yeah. Did I read that correctly that there's sort of get out of RSP clause in that sort of circumstance? And if you didn't expect Anthropic to be to be leading and for most companies to be operating safely, couldn't that kind of potentially obviate the the the entire enterprise because that clause could be quite likely to get triggered? Nick Joseph (54:43) Yeah. I think we don't intend that as like a get a jail free card where sort of we're we're falling behind commercially and then like, oh, well, now now we're gonna skip the RSP. It's much more just intended to be practical as, you know, we don't really know what it will look like if we get to some sort of AGI endgame race, there could be really high stakes and it could make sense for us to decide that the best thing is to proceed anyway. But I think this is something that we're sort of sort of looking at as a bit more of a, last resort than a loophole we're planning to just use for, oh, you know, we don't want to deal with these these evaluations. Rob Wiblin (55:20) Yeah. Okay. I think we've hit a good point where maybe the best way to learn more about RSPs and the strengths and weaknesses is just to, talk through more of the complaints that people have had or the concerns that people have have raised with the with the Anthropic RSP and RSPs in general, since since it was released last October. I'm realizing that oh, I I was gonna kind of start the start the weaknesses and worries now, but I'm kind of realizing I've been peppering you effectively with them maybe maybe almost since since the outset. But now we can we can really drive into some of the worries that people have expressed. Yes. The first of these is the extent to which we have to trust the good faith and integrity of the people who are applying a responsible scaling policy or a preparedness framework or whatever it might be within within the companies. And I imagine this issue might jump to mind for people more than it might have 2 or 3 years ago because public trust in AI companies to do the right thing at the cost of their business interests, is maybe lower than it was years ago when the major players were perceived perhaps more as research labs and less as, for profit companies, which is kind of how they how they come across more these days. And 1 reason it kind of really matters it seems like it matters to me who's doing the work here is that, you know, the the Anthropic RSP is it's full of expressions that are open to interpretation. For instance, know, hardened security such that non state attackers are unlikely to be able to steal model weights, and advanced threat actors like states cannot steal them without significant expense, or, you know, access to the model would substantially increase the risk of catastrophic misuse and things like that. And who's to say what's unlikely or significant or or or substantial? That sort of language is maybe a little bit inevitable at this point where there's just so much that we don't know and how are you gonna pin those things down exactly to say, it's a 1% chance that is that a state's gonna be able to steal the model that might just also feel like insincere, false false precision. But, know, to to my mind, that sort of vagueness does mean that there's a slightly worrying degree of wiggle room that could render the RSP less powerful and less and less binding when when push comes to shove and there might be a lot of money at stake. And on top of that, I guess, I mean, exactly as you were saying, anyone who's implementing an RSP has a lot of discretion over how hard they try to elicit the capabilities that might then trigger additional scrutiny and possible delays to their work and and release of, you know, really commercially commercially important products. So yeah, to what extent do you think the RSP would be useful in a situation where the people using it were neither particularly super skilled at, at at doing this sort of work, and maybe not particularly bought in and enthusiastic about the the safety project that that it's a it's it's a part of? Nick Joseph (57:55) Yeah. So, fortunately, I think my colleagues both both on RSP and and elsewhere are are both talented and really really bought into this, and, I think we'll we'll do a great job on it. But I do think the criticism is valid and that, like, there is a lot that is left up for interpretation here, and it does rely a lot on people sort of having a good faith interpretation of of how to execute on the RSP internally. I think that, like, there are some checks in place here. So, like, having, like, whistleblower type protections such that people can can say if if, a company is is breaking from the RSP or not trying hard enough to to elicit capabilities or to interpret it in in a a good way, and then, like, public discussion can can add some pressure. But ultimately, I think you really do like, you do need regulation to have these sort of very strict strict requirements. Over time, I hope we'll make it more and more concrete. The the blocker, of course, on doing that is that we don't know for a lot of these things. And being overly concrete where you specify something very, very precisely that turns out to be wrong, can be very kind of costly and sort of if you then have to go and change it and etcetera, it can take away some of the credibility. So sort of aiming for the as concrete as we can make it while while balancing that. Rob Wiblin (59:07) The the response to this that that that jumps out to me is just that, ultimately, it it feels like this kind of policy has to be implemented by a group that's external to the company that's then affected by the determination. It, you know, it really reminds me of, accounting or auditing for a major company. It's not sufficient for a major corporation to just have its own accounting standards and follow that and say, oh, we're gonna follow our own internal best practices. You get, and it's legally required that you get external auditors in to confirm that there's no chicanery going on. And you know, at the point that these models potentially really are risky or it's, you know, it's plausible that the results will come back saying, we can't release this. We maybe we even have to delete it off of our servers according to the policy. Would I feel more comfortable if I knew that some external group that had, you know, different different incentives was the 1 figuring figuring that out. Do you think that kind of ultimately is where things are likely to to go in the in in the medium term? Nick Joseph (1:00:05) So I think that'd great. I would also feel more comfortable if that was the case. I think 1 of the challenges here is that for auditing, there's a bunch of external accountants. It's a very common sort of profession or auditors, like, this is a profession many people know what to do. There are very clear rules. For some of the stuff we're doing, there really aren't external established auditors that everyone trusts to come in and say, took your model and we certified. It can't autonomously replicate across the internet or cause these things. Yeah. So I think that's currently not practical. I think that would be great to have at some 0.1 thing that will be important is that that, auditor has enough expertise to properly assess the capabilities, of the models. Rob Wiblin (1:00:46) I suppose that, you know, an external company would be an option. Of course, obviously, a government regulator, government agency would also be another approach. Guess, when I think about other industries, it often seems like there's kind of a combination of, like, yeah, private companies that then follow government mandated rules and things like that. Do you think that this is a benefit actually, I haven't thought of to do with creating these RSPs is that it maybe is beginning to create a market or it's indicating that there will be a market for this kind of service because it's likely that this kind of thing is going to have to be outsourced at some point in future. And there might be many other companies that want to get this similar kind of testing. So perhaps it would encourage people to think about, you know, founding companies that might be able to provide this service in a more credible way in in future. Nick Joseph (1:01:27) That would be great. And also, we publish blog posts on how things go and how our evaluations are. So I think there's some hope that people doing this can learn from what we're doing internally and various iterations we'll put out of our RSP, and that that can inform something maybe more stringent from that gets regulated. Have you Rob Wiblin (1:01:48) thought at all about what could be done to make the let's say that it wasn't given out to an external agency or an external auditing company. How it could be how it could be tightened up to make it less vulnerable to the level of operator operator enthusiasm. Yeah. I guess you might you might have thought about this in the process of actually actually applying it. Are there any ways that it could be could be could be stronger without having to completely, you know, outsource the operation of it? Nick Joseph (1:02:12) Yeah. I think the core thing is just making it more precise. Right? Like, 1 piece of accountability here is both public and internal commitment to doing it. So, yeah, I guess I'll maybe I should list off some of the reasons that I think it would be hard to break from it. Right? Like, this is a formal policy that has been passed by the board, and it's not as though we can just be like, oh, we're not we don't feel like doing it today. You would sort of need to get the board of Anthropic, get all of leadership, and then get all of the employees sort of bought in to not do this and or even to skirt the edges. I can speak for myself. If someone was like, Nick, can you train this model? We're going to ignore the RSP. I would be like, No. We said we would do that. Why would I do this? And if I wanted to, I would tell my team to do it. They would be like, No, Nick, we're not going to do that. So you sort of you would need to have a lot of buy in. And part of part of the benefit of having this publicly committing to it and passing it as like an organizational policy is that everyone, you know, every everyone is bought in and sort of maintaining that level of buy in, I think, is is quite critical. In terms of like specific checks on like, I think we have could we I think we have a team that's responsible for checking that we did the like, sort of red teaming our evaluations and making sure we we actually did them properly. So you can sort of set up a of internal checks there, but ultimately these things do rely on the company implementing them to really be bought in and care about the actual outcome of it. Rob Wiblin (1:03:38) So yeah, this that definitely leads us into this. I actually, I solicited on Twitter. I asked, what are people's biggest reservations about RSPs and about Anthropic's RSP in general? Yeah, actually probably the most common response was it's not legally binding. What's what's stopping Anthropic from just dropping it, dropping it when when things really matter, you know? I think someone said, you know, how can we have confidence that they'll stick to RSPs, especially when they haven't stuck to, actually, well, this person said to pass admittedly less formal commitments not to push forward the frontier and capabilities. But, like, what would actually have have to happen internally? You said you'd have to get staff on board, you'd to get the board on on board. Is there a formal process by which the RSP can be rescinded that is just a really high bar to clear? Nick Joseph (1:04:19) Yeah. So basically we do have a process for updating the RSP, so we could go to the board, etc. But I think sort of in order to do that, the I don't know. Like, it's hard for me to quite point it out, but it would be like, oh, if I wanted to continue training the model, I would go to the RSP team and be like, does this pass? And they'd be like, no. And then maybe, you know, you'd appeal it appeal it up the chain or whatever. And I I sort of at every step along the way, people would say, no, we care about RSP. Now on the other hand, there could be, like, legitimate issues with the RSP. Right? We could find that, like, 1 of these evaluations we created turned out to be really, really easy in a way that we, like, didn't anticipate and really is not at all indicative of of the dangers. And in that case, I think it'd be very legitimate for us to to try to amend the RSP to create a better evaluation that is a test for it. This is sort of the the flexibility we're trying to preserve, but I I don't think it's, I don't think it would be, like, simple or easy. Know that I can't picture a plan where someone could be like, ah, there's a there's a bunch of money on the table. Can we just, like, skip the RSP for for this model? That that seems somewhat hard to imagine. Rob Wiblin (1:05:18) The decision is made by this odd board called the the long term benefit board. Is that right? Or they're they're the group that decide what the RSP should be? Nick Joseph (1:05:26) So the long term benefit basically, Anthropic has a board that's sort of a corporate board. At some of those seats, in the long term will be the majority of those seats, are are elected by the long term benefit trust, which is doesn't have a financial stake in Anthropic and is there to, like, set up to sort of keep us focused on our public benefit mission of making sure AGI goes well. So yeah, the board is not the same it's not the same thing as that, but the long term benefit trust elects the board. Rob Wiblin (1:05:55) I think the elephant in the room here is of course there 's a long period of time when Open AI was pointing to its kind of nonprofit board as a thing that would potentially keep it on on mission to be really focused on safety and, you know, had had a lot of power over the organization. And then in practice when push came to shove, it seemed like even though the board had these concerns, it was effectively overruled by, I guess, a combination of just the views of staff, maybe the views of the general public in some respects, and potentially the the views of investors as well. And I think something that I've taken away from that and I think many people have taken away from that experience, you know, maybe the board was mistaken, maybe it wasn't, but these formal structures, you know, power isn't always exercised in exactly the way that it that it looks on on an organizational chart. And I I don't really wanna be putting all of my trust in these interesting internal mechanisms that companies design in order to try to keep themselves accountable. Because ultimately, just if the majority of people involved don't really wanna do something, then it feels like it's very hard to bind their hands and prevent them from changing plan at at some future time. So maybe within Anthropic, perhaps these structures that really are quite good. And maybe the people are the people involved are really, really trustworthy and people who I should have have my confidence in that, you know, even in extremists, they're gonna be thinking about the well-being of humanity and and not getting too focused on commercial incentives faced by by Anthropic as a company. But I think I I would I would rather put my put my faith in something more more powerful and more and more solid than that. And so this is kind of another thing that pushes me towards thinking that the RSP and this sort of preparedness frameworks are a great stepping stone towards external constraints on on companies that they don't have ultimate discretion over. It's something that all it has to evolve into because the impacts are gonna be on the on the entirety. Like, if things go wrong, the impacts are on everyone across society as a whole. And so there needs to be external shackles effectively put on on companies to to reflect the harm that they might do to to others legally. I guess I'm not sure whether you wanna wanna comment on that as potentially a slightly hot hot button topic. But, yeah, do do you think I'm kind of gesturing towards something legitimate there? Nick Joseph (1:08:11) Yeah. I I think that basically, like, these shouldn't be seen as sort of a a replacement for for regulation. I think I think there are many cases where where, like, policymakers, can can pass regulations that would help here. I think they're they're intended as sort of a supplement there and a bit as a, like, learning ground for what what might go might what might end up going in in regulations. In terms of, like, will the like, does the board really have the power and has types of of questions? I don't know. We put a lot of thought into the long term benefit trust, and I think it really does have, like, direct authority to elect the board, and the board does does have authority. But I do agree that, like, ultimately, you need to have a culture, around thinking these things are important and having everyone bought in. This is like, you know, any employee can always like, as I said, some of these things are like, did you solicit capabilities well enough? That really comes down to, like, a researcher working on the on this, like, actually trying their best at it. And that that is quite core. And I think that will sort of just continue to be even even if you have regulations, there's always going to be some amount of importance to the people actually working on it, like, taking taking the risk seriously and and really really caring about them and like doing doing the best work they can, on that. Rob Wiblin (1:09:20) Yeah. I guess 1 takeaway you could have is we don't wanna be relying on our trust in individuals and saying, well, you know, we think Nick's a great guy, his heart's in the right place, he's he's gonna do a good job. Instead, we need to be on be on more solid ground and say, well, no matter who it is, even if we have something bad in the role, the rules are such, the oversight is such that we'll still be in a a place and things will go well. I guess an alternative angle would be to say, when push comes to shove, when when when things really matter, people might not act in the right way. There actually is no alternative to just trying to have the right people in the room making the decisions because the people who are there are going to be able to sabotage any legal enter any any legal framework that you try to put in place in order to constrain them. Because it's just not possible to have perfect oversight within within an organization from from from outside. I could see people make mounting both of those arguments, reasonably. I guess we you know, I suppose you could you could try doing both, like both trying to pick people who are really really sound and have good judgment, and and who you have confidence in, as well as then trying to bind them so that even if even if you're wrong about that, you have a better shot at things going well. Nick Joseph (1:10:29) Yeah. I think you basically I think you just want this defense in-depth strategy where, like, ideally, you have all the things, lined up. And that way, if any 1 piece of them has has a hole, you you sort of catch it at the at the next layer. Right? Like, what you what you want is sort of a regulation that is is really good and robust to someone not acting the spirit of it. But in case that that is messed up, then you really want someone working on it who is also checking in and is like, okay. I technically don't have to do this, but this seems like clearly in the spirit of of how it works. And, yeah, I think that's pretty important. I think also for trust, should just look like you should look at track records, and I think that we should try to encourage companies and people working on AI to have have track records of of prioritizing things. So like 1 of the things that makes me feel great about Anthropic is just a long track record of doing a bunch of safety research to caring about these issues, putting out actual actual papers being like, here was a bunch of progress we've made on that field. There are a bunch of pieces. I mean, I think looking at commitments people have made, we break the RSP? I think if we publicly were like, we changed this in some way that I think everyone thought was silly and really added risks, then I think people should should lose trust according to that. Rob Wiblin (1:11:41) Alright. Let's let's let's push on to a different worry. Although I must admit it has a slightly similar similar flavor. And that's that the RSP might be very sensible and look good on paper, but if it commits to future actions that at that time we probably won't know how to do, then it might actually fail to fail to help very much. And I guess to make that concrete, you know, an RSP might naturally say that at the time you have really superhuman general AI, you need to be able to lock down your computer systems and make sure that the model can't even can't be stolen even by the most persistent and capable rational Chinese state backed hackers. And that is indeed, what Anthropic RSP says or, you know, suggests that is that is going to say once once once you get up to, you know, ASL 4 and 5. But as I I think the RSP actually says as well, we don't currently know how to do that. We don't know how to secure data against the state actor that's willing to spend hundreds of millions or billions or possibly even tens of billions to steal to steal model weights. Especially not if you ever need those model weights to be connected to the internet in in some in in some way in order for the model to actually be used by by by people. So it's kind of a promise to do what arguably what basically is impossible with with current technology. And that means that we need to be preparing now, doing research on how to make this possible in future. But solving the problem of computer security that has beguiled us for decades is probably beyond anthropic. It's not really reasonable to expect you're gonna be able to fix this problem that society as a whole has kind of failed to fix, for for all this time. It's it's just gonna require coordinated action across countries, across governments, across lots of different organizations. And so if that doesn't happen and it's somewhat beyond your control whether it does, then when the time comes, the real choice is gonna be between a lengthy pause where, you know, while you wait for fundamental breakthroughs to be made in computer security or dropping and weakening the RSP so that Anthropic can continue to remain relevant and release models that are commercially useful. And in that sort of circumstance, the pressure to weaken the scaling policy, so you aren't stuck for years, is gonna be, I imagine quite powerful. And it could win the day, you know, even if people are like dragged kind of kicking and streaming to conceding that unfortunately have to loosen the RSP even though they don't really want to. Yeah, what you make of that worry? Nick Joseph (1:13:51) I think what we should do in that case is instead we should pause and we should focus all of our efforts on safety and security work. Should that might include looping in other external experts to help us with it, but we should do the like, put in the best effort that we can to mitigate these these issues such that we can still realize the benefits to deploy the technology but without without the dangers. And then if we can't do that, then I think we need to make the case publicly to government to governments, other companies. There's there's some risk to public. So, exactly, we would have to be strategic in exactly how we do this, but basically make the case that there are really serious risks that are that are imminent and that that everyone else should should take sort of appropriate actions. There's a flip side to this, which is just like, I think I've mentioned before, if we just messed up our evals, the model's clearly not dangerous, and we just really, like, screwed up on some eval, then then we should, like, follow the process in the RSP that we've written up. We should go to the board. We should create a new test that is, like, going to that we actually trust. I think, like, I would also just say, like, people don't need to follow incentives. Right? Like, I think you you can make a lot more money doing something that isn't hosting this podcast probably. I like certainly, if you, like, had pivoted your career earlier, there there are more profitable things. So I think this is just a case where, like, the stakes are or would be extremely high, And I think it's just somewhere where it's it's important to just to just do the right thing in that case. Rob Wiblin (1:15:04) If I think about how this is most likely to play out, I imagine that at the point that we do have models that we really wanna protect from even the best state based hackers, there probably have been some progress in computer security, but like not nearly enough to make you or me feel comfortable that there's just no way that China or Russia might be able to steal the model weights. And so like it is very plausible that the RSP will say Anthropic, you have to, you know, keep this on a hard disk not connected to any computer. You can't train models that are more capable than the thing that we kind of already have that we don't feel comfortable handling. And then how does that play out over you know, there are a lot of people who are very concerned about safety at Anthropic. Mean, I've seen this kind of league tables now of, you know, different AI companies and enterprises, you know, how good do they look on an AI safety point of view. And you know, Anthropic always kind of comes out of the top, think by a decent margin. But you know, months go by, other companies are not being as careful as this. You've you've complained to the government and you've said, look at this horrible situation that we're in. Can't you something has to be done, but I don't know. I guess possibly the government could step in and help there, but but maybe they won't. And then over a period of months or years, doesn't the choice effectively become, if there is no solution, either take the risk or just be rendered irrelevant? Nick Joseph (1:16:22) Yeah. So I maybe just like step just going back to the beginning of that. Like, I don't think we will put something in that it says there is 0 risk from something. Like, I think you can sort of never get to 0 risk. I think often with security, you'll you'll end up with some security productivity trade off. So so you can you could end up taking some really extreme risk, you know, or some really extreme security productivity trade off where only 1 person has access to this. Maybe you've locked it down in some huge amount of ways. It's possible that you can't even do that. You really just can't train the model, but there is always gonna be some sort of balance there, and I don't think we'll push to the 0 risk perspective. But yeah, I think that that's just a risk. I don't know. I think there's a lot of risks that companies face where they could fail. We also could just fail to make better models and not succeed that way. I think the point of the RSP is it has tied our commercial success to the safety mitigations. So in some ways, just it just adds on another another risk in the same way as as any other company risk. Rob Wiblin (1:17:20) It sounds like I'm, having a go at I'm having a go at you here, but I I think really, I think what what this shows up is just that it's I think that this scenario that painted there is really quite plausible. And it just shows that this problem cannot be solved by Anthropic, probably like it can't be solved by even all of the AI companies combined. The only way that this RSP is actually going to be able to be usable in my estimation is if other people will rise to the occasion and pitch and governments actually do the work necessary to fund the solutions to computer security that will allow us to have the model weights be sufficiently secure in this situation. And, yeah, that that that you're not blameworthy for that situation. It just, just says that there's lot of people who need to need to do a lot of work in coming years. I think Nick Joseph (1:18:04) that yeah. And I think I might be more optimistic than you or something. I do think if we get to something really really dangerous, we can make a very clear case that it's dangerous and like these are the risks unless we can implement these mitigations. Like, I hope that like at that point it will be a much clearer case to pause or something right now. I think there are many people who are like we should pause right now and see everyone saying no and they're like, oh, these people don't don't care. They don't care about, like, major risks to humanity. And I think really the core thing is people don't believe there are risks to humanity right now. And once we get to, like, this sort of stage, I think that we will we will be able to make those risks very clear, very, like, immediate, tangible, Rob Wiblin (1:18:39) **and

Nick Joseph** (1:18:39) I don't know. 1 wants to be the company that caused a massive disaster. And no government also probably wants to have allowed a company to cause it. It will feel much more immediate at that point. Rob Wiblin (1:18:52) Yeah. I think Stefan Schubert, this commentator who I who I read on on on Twitter, has been making the case for for a while now that many people who have been thinking about AI safety, I guess, including me, have underestimated the degree to which the public is likely to react and respond and governments are going to get involved once the problems are apparent, once they really are convinced that that there is a is a threat here. Think he calls it this bias and thought where you imagine that people in the future are just gonna sit on their hands and not do anything about the problems that are that are readily apparent. He calls it as sleepwalk bias. And I guess, I think we have seen evidence over the last year or 2 that as the capabilities have improved, people have gotten a lot more serious and a lot more concerned, a lot more open the idea that it's important for the government to be involved here. There's a lot of, you know, a lot of actors that need to need to step up their game and and help to solve these problems. So, yeah, I think you might be right. On an optimistic day, maybe I could hope that other groups will be able to do the necessary research soon enough that that Anthropic will be able to actually apply its RSP in a in a in a a timely manner. I guess, fingers crossed. Yeah. I just wanna actually ask ask you next. What are what are your biggest reservations about RSPs or Anthropic's RSP, personally? You know, if if fails to improve safety as much as as much as you're hoping that it will, what what what's the most likely reason for it to to to not live up to its potential? Nick Joseph (1:20:13) So I think for Anthropic specifically, it's definitely around this under elicitation problem. I think it's a really kind of fundamentally hard problem to take a model and say, oh, you've tried as hard as 1 could elicit this particular danger. There's always some some you know, maybe there's a better researcher. There's a saying no negative result is final. You know, like, if you fail to do something, someone else might just succeed at it next. So that that's 1 thing I'm worried about. And then the other 1 is just unknown unknowns. So so we are, like, creating these evaluations for risks that we are worried about and we like see coming, but there might be risks that we've missed. Things that we like didn't didn't realize would would come before. Either like didn't realize would happen at all or or thought would happen after for like later levels, but turn out Rob Wiblin (1:20:56) to arise earlier. What what could be done about those things? Would it would it help to just, you know, have have more people on the team doing the evals or to have more people, I guess, both within and and outside of Anthropic trying to come up with with better evaluations and figuring out better red teaming methods? Nick Joseph (1:21:11) Yeah. And I think that, like, this is really something that people outside Anthropic can do. Like, you can anyone the elicitation stuff has to happen internally, that's more about putting, like, putting as much effort as we can into it. But creating evaluations can really happen anywhere. Coming up with, like, new new risk categories, threat models is something that, like, anyone can contribute to. Rob Wiblin (1:21:30) Yeah. What what are the places that are doing the the best work on this? I imagine, know, Anthropic surely has some people working on this, but there's I guess I mentioned METR. Remember what that stands for right now, but they're a group that helps to develop the the idea of RSPs in the first place and develop evals. And I think the AI safety institute in The UK is involved in developing these sort of standard safety evals. Is there anywhere else that, that people should be aware where this is going on? Nick Joseph (1:21:55) Yeah. There's also the US AI safety institute, and I think this is actually something you could probably just do on your own. I think 1 1 thing, I don't know, at least for people like early in career, if you're trying to get a role doing something that I would recommend is just just go and do it. So I think you probably could just write up a report, post online, be like, these are this is my threat model. These are these are the things I think are important. You could implement the evaluations and share them on GitHub. But, yeah, there there are also organizations you could go to to, like, get mentorship and work with others on it. Rob Wiblin (1:22:23) I see. So this would look like I I suppose you could try to think up new threat models. So think up new things that you need to be looking for because, you know, this this might be a dangerous capability and people haven't yet appreciated how much it matters. But I guess you could spend your time trying to find ways to elicit the ability to autonomously spread and, you know, steal model weights and and and and get yourself onto other computers from from these models and see if you see if you can find an angle on trying to find a find warning signs or signs of these emerging capabilities that other people have have missed and then and then talk about them. And you can kinda just do that, while, you know, signed into Claude Claude 3 Opus on on your website? Nick Joseph (1:22:59) Yeah. So I think you'll have more luck with the elicitation if you actually work in 1 of the labs because you'll have access to training the models as well. But you can do a lot with Claude 3 on the website or via an API, which is a programming term for basically an interface where you can send a request for like, I want a response back and automatically do that in your app. You can sort of set up a sequence of prompts and test a bunch of things via the APIs for Claude or any other publicly accessible model. Rob Wiblin (1:23:26) Come back to this point about what's acceptable risk and maybe trying to make make the RSP a little bit more more concrete. I read from, a critic of the of the Anthropic RSP that I'm not sure not sure how true this is. I'm not expert on risk management, but this person was saying that it's more true, like at least in more established areas of risk management where maybe you're thinking about, you know, what's the what's the probability that a plane is gonna fail and crash because of some mechanical failure. It's more typical to say, you know, we've studied this a lot and we think that the probability of like, well, I suppose let's talk about the AI example. You know, rather than say we need the risk to be, you know, not substantial, so you'd say, you know, with our practices, you know, our experts think that the probability of an external actor being able to steal the model weights is x percent per year. And these are the reasons why we think the risk is that level. And that's below what we think of as our acceptable risk threshold of, you know, x, where x is larger than than y. I guess if you there's there's a risk that those numbers would kind of just be made up. You could kind of assert anything because because it's all a bit unprecedented. But I suppose that would make clear to people what the remaining risk is, like what, what acceptable risk, you think that you're running and then people could scrutinize whether they think that that's a reasonable thing to be doing. Do reckon that is that a direction that things could maybe go? Nick Joseph (1:24:48) Yeah. I think it's like a fairly common way that, people in like the EA and rationality community like speak where they give a lot of probabilities for things. And I think I think it's really useful. Like, it's an extremely clear way to communicate. Like, I think a 20% chance this will happen is just way more informative than, I think it probably won't happen, which could be 0% to 50% or something. So I think it's very useful in many contexts. I also think it's very frequently misunderstood because for most people, I think they hear a number and they think it's based on something, that there's some some calculation and they give it like more authority. If you say, you know, ah, there's a 7% chance this will happen, people are like, oh, you you really know what you're talking about. So I I think it can be a useful way to speak, but I think it also can like sometimes communicate more confidence than we actually have in what we're talking about, which isn't I don't know. We didn't have 1000 governments attempt to steal our weights and number of them succeeded or something. It's much more going off of a judgment based on our security experts. Rob Wiblin (1:25:52) I slightly want to push you on this because I think at the point that we're ASL 4 or 5 or something like that, it would be a real shame if Anthropic was going ahead thinking, well, we think the risk that these weights will be stolen every year is 1%, 2%, 3%, something like that. And I guess maybe you're right in the policy saying we think it's very unlikely, extremely unlikely that this is gonna happen. And then people externally think, well, basically it's fine. They say it's definitely not it's definitely not gonna happen. There's no chance that this is gonna happen. And like governments might not appreciate that actually that, you know, in your own view, there is a substantial risk being run and you just think it's an acceptable risk given the trade offs and what else is going on in the world. I guess it's a it's a it's a social service for Anthropic to be direct about the risk that it thinks it's it's it's creating and and why it's doing it. But I think it could be a really useful, public service. I guess it's the kind of thing that might come up at senate hearings and things like where people in government might really wanna know. And I guess at that point, would be perhaps more apparent why it's really important to find out what the what the probability is. But yeah, that's a way that I think the RSP there's definitely a risk of of misinterpretation by by journalists or something who don't, appreciate the the spirit of saying that we think it's x percent likely. But there could also be a lot of value in being being more more direct about it. Nick Joseph (1:27:09) Yeah. I'm not really an expert on communications. I think some of it just depends on who your target audience is and how they're thinking about it. I do think that I think in general, I'm I'm a fan of making RSP more concrete, being more specific. I think as over time, I I hope it it, progresses in that direction as we, like, learn more and can can get more specific. I also think it's important for it to be verifiable, and I think if you start to give these precise percentages, people will then ask how do you know? And I don't think there really is a clear answer to how do you know that the probability of this thing is less than x percent with for many of these situations? Rob Wiblin (1:27:44) It doesn't help with the bad faith actor or the bad faith operator either because, you know, if you say, the safety threshold is 1% per year, they can kind of always just claim in this in this situation where we know so little that it's less than 1%. It doesn't really bind people all that much. Maybe it's it's just a a way that people externally could understand a little bit of what what the opinions are within the organization or at least what what their stated opinions are. Nick Joseph (1:28:07) I will say that internally, I think it is an extremely useful way for people to think about this. Right? So if you if you are, like, working on this, I think you probably should think through what is an acceptable level of danger and, try to estimate it and communicate with, like, people you're working closely with in these terms. I think it can be a really useful way to, like, give precise precise statements. And I think that can be very valuable. Rob Wiblin (1:28:27) A metaphor that, you use within your responsible scaling policy is, putting together an airplane while you're flying it. I think that that is 1 way that the challenge is particularly difficult, for the industry and and and for Anthropic that unlike with biology safety levels where basically we know the diseases that we're handling and we know how bad they are and we know how they spread and things like that. You know, the people who are figuring out what what should BSL level 4 security, be like, can look at lots of studies to understand exactly the organisms that already exist and how they would spread and how likely would they be to escape given these particular, you know, ventilation systems and so on. And even then they mess things up decently often. But in this case you're dealing with something that doesn't exist that we're not even sure like when it will exist or what it will look like, and you're developing the thing at the same time that you're trying to figure out how to make it safe. It's just just extremely difficult. And I guess, I mean, and we should expect mistakes. We should that's something that we should keep in mind is that even if people who are doing their absolute best here are likely to mess up. And that's a reason why we need this defense in-depth strategy that you're talking about that we don't want to put all of our eggs in in the RSP basket. We want have, like, many different layers ideally. Nick Joseph (1:29:38) Yeah. It's also, like, a reason to start early. So, like, I I think 1 of the things with Claude 3 was, you know, we we that was sort of the first model where we really ran ran this whole process. And I think some part of me felt like, wow, this is kind of silly. I was pretty confident Claude 3 was not catastrophically dangerous. It was slightly better than GPT-four, which had been out for a long time and not caused a catastrophe. But I do think that the process of doing that, learning what we can, and then putting out public statements about how it went, what we learned, is the way that we can have this run really smoothly on the next time. We can make mistakes now. Right? We could have made a ton of mistakes because the stakes are pretty low at the moment. But in the future, the stakes on this will be really high, and it will be really costly to make mistakes. So it's it's important to get those practice runs in. Rob Wiblin (1:30:22) Alright. Another kind of recurring theme that I've heard from some commentators is that in their view, the Anthropic RSP just isn't conservative enough. So on that account, there should be kind of wider buffers in case you're under eliciting capabilities that the model has that you don't realize, which is something that you're pretty concerned about. And I guess a different a different reason would be you might worry that there could be discontinuous improvements in capabilities, as as you train bigger models with more data. So to some extent, model learning and improvement is, from a very zoomed out perspective is quite continuous. But on the other hand, its ability to do any kind of particular task, it can go from like fairly bad to to quite good, surprisingly quickly. So they kind of can be sudden unexpected jumps with particular particular capabilities. Yeah. Firstly, can you maybe explain again in more detail how the Anthropic RSP kind of handles these safety buffers? Given that you don't necessarily know what capabilities the model might have before you train it. That's a quite a challenging constraint to be operating under. Nick Joseph (1:31:22) Yeah. So there are these red line capabilities. These are the capabilities that are actually the dangerous ones we don't want to don't wanna train a model that has these capabilities until we have the set of precautions in place. Then there are evaluations we're creating, and these evaluations are meant to certify that the model is far short of those capabilities. It's not, can the model do those capabilities? Because once we pass them, we then need to put all the safety mitigations in place, etcetera. So, and then when do we have to run those those evaluations? It's sort of just we have some heuristics of when when the effective compute goes up by a certain a certain fraction that, is a very cheap thing that we can evaluate on every step of the run or something along those lines so that know when to run it. In terms of how conservative they are, I guess 1 example I would give is you're thinking about autonomy where a model could spread to a bunch of other computers and sort of autonomously replicate across the Internet, are I think our evaluations are pretty conservative on that front. We, like, test if it can, you know, replicate to, like, a fully undefended machine or if it can, like, do some basic fine tuning of a of a other language model to, like, add a simple backdoor. I think these are pretty simple capabilities, and, there's always adjustment call there. Right? Like, we could set them we could set them easier, but then we might trip those and look at the model and be like, ah, this isn't really dangerous. It doesn't warrant the level of precaution that we're going to give it. Rob Wiblin (1:32:40) Yeah. There there was something also about you you said that you'd oh, no. The RSP says that you'll be worried if the model can succeed at half the time at these various different tasks trying to spread itself to to other other machines. Yeah. Why why is why is half the time, like, succeeding half the time the the threshold? Nick Joseph (1:32:58) Yeah. So there's sort of a few tasks. I don't, off the top of my head, remember the exact exact thresholds. But basically, like, it's just a reliability thing. Right? So in order for a model to chain all of this together or yeah. So in order for the model to chain all of these capabilities together into some long running thing, it does need to have a certain success rate. Probably it actually needs a very, very high success rate in order for it to start autonomously replicating despite us trying to stop it, etcetera. So we we set a threshold that's like fairly conservative on that front. Rob Wiblin (1:33:27) I guess it's it's part of the reason you're thinking, well, if a model can do this worrying thing half the time, then it might not be very much additional training away from being able to do it 99% of the time. That might just require some additional fine tuning to get there. And so then the model might be dangerous if it if it was leaked because it would be so close to being able to do this stuff. Nick Joseph (1:33:46) Yeah, mean that that's often the case, although of course we could we could then elicit it to to go if we'd set a higher number and we got even if we got 10 maybe that's enough that we could bootstrap it. So often when you're training something, if it can be successful, you can reward it for that successful behavior and then increase the odds of that success. It's often easier to sort of go from 10% to 70% than it is to go from like 0% to 10%. Rob Wiblin (1:34:09) So if I understand correctly, the RSP proposes to retest models every time you increase the amount of training compute or data by fourfold. Is that right? That's that's kind of the the checkpoint? Nick Joseph (1:34:21) Yeah. So we're still still thinking about what is sort of the the best thing to do there and that that 1 might change, but we use this notion of effective compute. So really this has to do with when you you train a model, it goes down to a certain loss. And we have these nice scaling laws of if you have more compute, should expect to get to the next loss. You might also have a big algorithmic win where you don't use any more compute, but you get to a lower loss. And we sort of have coined this term effective compute to that 1 sort of account for that as well. These jumps are the jump where we have a visceral sense of how much smarter a model seems when you do that jump and have set that as our bar for when we have to run all these evaluations, which do require, you know, a staff member to go and run them, spend a bunch of time trying to elicit the capabilities, etcetera. I I think this is somewhere I'm wary of sounding, like, too precise or, like, we understand this too well. We we don't really know what the effective compute gap jump is between the yellow lines and the red lines. This is much more just like how we are thinking of how we are thinking about the problem and how we are trying to set these, like, set these evaluations. And the reason that the the yellow line evaluations really do need to be substantially easier, they'd be, like, far from the red line capabilities because you might you might actually overshoot the yellow line capabilities by a fairly significant measure just off of when you run evaluations. Rob Wiblin (1:35:38) So I think if I recall, it was who's been on the show before, who wrote in his blog post assessing the Anthropic RSP, just that he thinks that this ratio between the 4 x and the 6 x is not large enough. That if there is some discontinuous improvement or yeah, you've really been under eliciting the capabilities of the models at these at these kind of interim check-in points, that that does leave the possibility that you could overshoot and get into quite a dangerous point by accident. And then by the time you get there, then the model's like quite a bit more capable than what you thought it would be. And then you've got this difficult question of whether to like, do you then, you know, press the emergency button and and and delete all the weights because you've overshot? There'd be incentives not to do that because you might be throwing away a $100,000,000 worth of well, I guess I don't know how much it would be, you'd be throwing away a substantial amount of compute expenditure basically to to create this thing. And this and this just worries him. I mean, that could be solved, I think in his view, just by having a larger ratio there, having a having a larger safety safety buffer. Of course, that then runs the risk that you're doing these like constant check ins on stuff that you really are pretty confident is not gonna be actually that dangerous and, and people might get frustrated with the RSP and feel like it's wasting their time. So it's kind of a judgment call, I guess, how large that buffer needs to be. Nick Joseph (1:36:51) Yeah. I think it's a tricky 1 to communicate about because, it's confidential what the jumps are between the models or something. I think 1 thing I can share is like, we ran this part we ran this on cloud 3 partway through through training. So the jump from cloud 2 to cloud 3 was bigger than the than that gap. So you could sort of think of that as, like, an intelligence jump from Claude 2 to Claude 3 is is bigger than what what we're allowing there. I I think these feel it feels reasonable to me, but I I think this is just a judgment judgment call that, like, different people can have. And I I think that, like, this is the sort of thing where if we learn over time that this seems too big or it seems too small, that's the type of thing that hopefully we can we can talk about publicly. Rob Wiblin (1:37:31) Yeah. Is that something that you get feedback on that, I suppose as you're if you are training these big models and you're checking in on them, you can kind of predict where you expect them to be, like how likely you are how likely they are to exceed a given threshold. And then if you do ever get surprised, then that could be a sign that why we need to like increase the increase the buffer range here. Nick Joseph (1:37:50) It's hard because the first 1, the thing that would really tell us is if we don't pass the yellow line for 1 model and then on the next iteration, suddenly it blows past it. And we look at this and we're like, woah, this thing is really dangerous. It's probably past the red line and we have to delete the model or immediately put in the security features, etcetera, for the next level. I think that that would be a sign that we'd like set the buffer too small. Rob Wiblin (1:38:13) I guess, again, not the ideal way to learn that, but I suppose it definitely could set set a cat amongst the pigeons. Nick Joseph (1:38:20) Yeah. There are there are there would be earlier signs where you would notice like, oh, we really overshot by a lot. It feels like we're closer than than we expected or something. But that that would sort of be the the failure mode, I guess, rather than the warning sign. Rob Wiblin (1:38:32) So reading reading the RSP, it seems pretty focused on kinda catastrophic risks from, like, misuse, you know, sort of terrorist attacks or, you know, CBRN that and you know, AI gone rogue, like spreading out of control, that sort of thing. Is it basically right that the RSP or this kind of framework is not intended to address kind of structural issues like AI displaces people from work and now they're kind of living or, you know, AIs are getting militarized that's making it more difficult to evolve to to to like prevent, you know, military encounters between countries because we can't control the models very well or, you know, more more like near term stuff like algorithmic bias or or deepfakes or misinformation. Are those kind of things that have to be dealt with by something other than a responsible scaling policy? Nick Joseph (1:39:19) Yeah, those are important problems, like the RSP is responsible for preventing catastrophic risks and particularly has this framing that works well for things that are acute, like new capability is developed and could sort of first order cause a lot of damage. It's it's not going to work for things that are like, what is the long term effect of this on society over time? Because we can't design evaluations to to test for that effectively. Anthropic Rob Wiblin (1:39:44) does have different teams that work on those other 2 clusters that I talked about, right? What are they called? Nick Joseph (1:39:50) So we have a societal impacts team, it's probably the most relevant 1 to that. And the policy team also has a lot of relevance to these issues. Rob Wiblin (1:39:59) Alright. Yeah. I guess we're kinda gonna wrap up on on RSPs now. Is there anything you wanted to maybe say to the audience to to wrap up this section? Like, additional work that you think will be like, or like ways that the audience might be able to contribute to this enterprise of coming up with better internal company policies and then figuring out, I suppose, how there could be models for for for other actors to to come up with, with with government policy as well? Nick Joseph (1:40:22) Yeah. I mean, I think this is just a thing that many many people can work on. You know, if you if you work at a lab, you could talk to talk to people there, think about what what they should have as an RSP if if anything. If you work in policy, you should you should read these and and think about if there are if there are lessons to take. If you, don't do either of those, think you really can think about threat modeling post about that. Think about evaluations, implement evaluations, and share those. I think it is the case that these companies are very busy and if there is something that's just shovel ready or like ready on on the shelf, you could just grab this this evaluation. It's really quite easy to run them. So, yeah, I think there's like quite a lot that people can do to to help here. Rob Wiblin (1:41:01) Alright. Let's push on and talk about the the case that listeners might be able to contribute to making superintelligence go better by working at Anthropic on on some of its various different projects. Firstly, Nick Joseph (1:41:12) **how did

Rob Wiblin** (1:41:13) you end up in your current role at at Anthropic? Kind of what's what's been your, the the the career journey that, that led you there? Nick Joseph (1:41:20) Yeah, so I think it largely started with an internship at GiveWell, which listeners might know, but it's a nonprofit that evaluates charities to figure out where to give money most effectively. And I did an internship there, I sort of learned a ton about global poverty, global health. I was planning to do a PhD in economics and go work on global poverty at the time. But a few people there pushed me and said, you you should really worry about AI safety. We're going to have these superintelligent AIs at some point in the future, and this could be a big risk. I remember I left that summer internship and was like, wow, these people are crazy. I talked to all my family and they were like, what are you thinking? But then I like I don't know. Was interesting. So I kept talking to people, some people there, other people sort of worried about this. And I felt like every debate I lost. I would, like, have a little debate with them about why we shouldn't worry about it, and I'd always come away feeling like I lost the debate but not, like, fully convinced. And after, like, honestly, like, a few years of doing this, I I eventually decided this was at least convincing enough that I should work in AI. It also turned out that working on poverty via this economics PhD route was a much longer and more difficult and less likely to be successful path than I had anticipated. So I sort of pivoted over to AI. I worked at Vicarious, which is an AGI lab that had shifted towards a robotics product angle. And I worked on computer vision there for a while, learning how to do ML research. And then actually 80000Hours reached out to me and convinced me that I should work on on safety more imminently. This, sort of like AI was getting better, it was like more important that I just have some direct impact, on doing safety research. At the time, I think OpenAI had by far the best, safety research coming out of there. So I applied to to work on safety at OpenAI. I actually got rejected, then I got rejected again. In that time, Vicarious was, nice enough to let me spend half of my time reading safety papers. So I was just sort of reading safety papers, trying to do my own safety research, although it was somewhat difficult. I didn't really know where to get started. And eventually, I also wrote for, like, Rohan Shah, who was on this podcast. He had this alignment newsletter, and I would, like, write I wrote read papers and wrote summaries and opinions for them for a while to, like, motivate myself. But eventually, third third try, I got a job offer from from OpenAI, joined the the safety team there, and spent sort of 8 months there mostly working on, like, code models and understanding how code models would progress. The the logic here being, we just started the first, like, LLMs training on code, and I thought it pretty scary. If you think about, like, recursive self improvement, like, models that can write code is, like, the first step. And trying to understand what what direction that would go in would be really useful for for certain farming safety directions. And then a little bit after that, maybe like 8 months in or so, all of the safety team leads at OpenAI left, most of them to start Anthropic. I was sort of felt very aligned with their values and mission, so also went to join Anthropic. Sort of the main reason I'd I'd been at OpenAI was for for the safety work. And then at Anthropic, actually, everyone was just building out infrastructure to train models. You know, there was no code. It was sort of the the beginning of the company, and I found that thing was my competitive advantage was making them efficient. So I, like, optimize the models to to go faster. You know, as I said, if you have more compute, you get a better model. So that means if your computer if you can make things run quicker, you get a better model as well. I did that for a while and then shifted into management, which is something something I, like, wanted to do for a while and started sort of managing the pre training team when it was it was 5 people. And then have been sort of growing the team growing the team since then, training like better and better models, along the way. Rob Wiblin (1:44:46) Yeah. I'd heard that, you'd been consuming 80,000 stuff, years ago, but I didn't didn't realize it it influenced you, all that much. What was the what was the step that we helped with? It was, just to deciding that it was important to actually start working on safety related work, sooner rather than later. Nick Joseph (1:45:03) Actually, a bunch of spots along the way. I think, like, when I did that give all internship, I did like a speed coaching at EA Global or something with 80000Hours and, was the people there were some of the people who were pushing me that I should work on on AI, like some of those conversations. And then when I was at Vicarious, I think 80000Hours reached out to me and was sort of like more more pushy and specifically was like, you should you should go to work directly on safety now, where I think I was otherwise sort of happy to just sort of keep keep learning about AI for for for a bit longer before shifting over to to safety work. Rob Wiblin (1:45:33) Well, let's say yeah. Cool cool that, ADK was able to I guess, I well, I don't don't know whether it helped, but I suppose it influenced you in in in some direction. Is there any is there any something you've read from ADK on AI careers advice that you think is, mistaken? We wanna wanna wanna tell the audience that, maybe they should do things a little bit differently than what we've been suggesting on the website or or I guess on this on this show. Nick Joseph (1:45:55) Yeah. First, do wanna say ADK was was very helpful, both both in pushing me to do it and setting me up with connections and introducing new people and getting me a lot of information. Was it was really great. Terms of things that I maybe disagree with from standard advice, I think the main 1 would be to focus more on engineering than than research. I think there is sort of this historical thing where people have focused on on research more so than engineering. Maybe I should define the difference. The difference between research and engineering here would be that research can look more like figuring out what directions you should work on, designing experiments, doing really careful analysis and understanding that analysis, figuring out what conclusions to draw from a set of experiments. So like, I can maybe give an example, which is like, you're training a model with 1 architecture and you're like, oh, I have an idea, we should try this other architecture. And in order to try it, the right experiments would be these experiments and these would be the comparisons to confirm if it's better worse. Engineering is more of the implementation of the experiment. So then taking that experiment, trying it, and also creating tooling to make that fast and easy to do. So make it so that you and everyone else can can really quickly run experiments. It could be optimizing code, so making things run much faster, as I mentioned I did for a while, or making the code easier to use so that other people can can use it better. And these aren't like it's not like someone's an engineer or a researcher. You kind of need both both of these skill sets to do work. You you come up with ideas, you implement them, you see the results, and then you implement changes, and it's it's a fast iteration loop. But it's somewhere where I think there's historically been more prestige given to the research end despite the fact that most of the work is the engineering end. So you end up with if you come up with your architecture idea that takes an hour, and then you spend a week implementing it, and then you run your analysis and that maybe takes a few days. But it sort of feels like the the engineering work takes the longest. And then my other pitch here is going to be that the 1 place where I've often seen researchers not investigate an area they should have is when the tooling is bad. When you go to do research on this area and you're like, it's really painful, all my experiments are slow to run, will really quickly have people be like, I'm going go do these other experiments that seem easier. So often by creating tooling to make something easy, you actually can open up that direction and trailblaze a path for a bunch of other people to follow along and do a lot of experiments. Rob Wiblin (1:48:13) So what fraction of people at Anthropic would you classify as more on the engineering end versus more on the research end? Nick Joseph (1:48:20) I might go with my team because I think I actually don't know for for all of Anthropic. And I think it's sort of a spectrum, but I I would guess it's probably like 60 or 70% of people would be I I would say are like probably stronger on the on the engineering end than on the research research end. And when hiring, I'm like most excited about finding people who are who are strong on the engineering end and most most of our interviews are sort of tailored towards that. Not because the research isn't important, but because I think it's sort of there's there's a little bit less need for it. Rob Wiblin (1:48:47) The distinction sounds like a little bit artificial to me. Is that is that kind of true? It feels like these things are they're kind of all just a bit a bit part of a package. Nick Joseph (1:48:56) Yeah. Although I think the main distinction with engineering is that it is a fairly separate career. So I think there are many people, maybe hopefully listening to this podcast, who might have sort of been a software engineer at some tech company for a decade and built up a huge amount of expertise and experience with designing good software and such. And those people can actually learn the ML, they need to know to do the job effectively very, very quickly. And I think there's maybe another direction people could go in, which is much more like I think of it as a PhD in many cases, where you spend a lot of time developing research tastes, figuring out what are the right experiments to run, running these, usually at smaller scale and maybe with less of a single long lived code base that pushes you to develop better engineering practices. I think that skill set and to be clear, this is a relative term. It's also a really valuable skill set and you always need a balance. But I think I've often had the impression that 80000 hours pushes people more in that direction who want to work on safety. Do a PhD, become sort of like a research, a research expert with really great research taste than pushing people more on the sort of become a really great software engineer direction. Yeah. We had a had a podcast many Rob Wiblin (1:50:08) years ago, might be 2018 or 2017 with, Catherine Olsen and Daniel Ziegler where they were also saying engineering is the way to go or engineering is the thing that's really scarce and there's also the easier way into the into the industry. But yeah, it isn't it isn't a drum that we've been banging all that frequently. I don't think we've talked about it very much since then. So perhaps that's a that's a bit of a mistake that we haven't haven't been highlighting the engineering roles more. You said it's kind of a different career track. So you can go from software engineering to the ML or like AI engineering that you're doing at Anthropic. Is that is that the natural career progression that someone has? Like what someone who's not already in this, how can they learn the engineering skills that they need? Nick Joseph (1:50:47) Yeah. So I think engineering skills are are actually in some ways the easiest to learn because there's so much, so many different engineering places. I think the way I would recommend it is you could work at any engineering job. Usually, would say just working with the smartest people you can, building the most complex systems. You can also just do this open source, so you can contribute to an open source project. This is often a great way to get mentorship from the maintainers and have something that's publicly visible. So if you then want to apply to a job, can be like, here is this thing I made. And then you can also just create something new. So you can say, you know, I want I think if you want to work on AI engineering, should probably pick a project that's similar to what you want to do. So if you want to work on data for large language models, take Common Crawl, it's a publicly available crawl of the web, and write a bunch of infrastructure to process it really efficiently. Then maybe train some models on it, like build out some infrastructure to train models, and you can build out that skill set relatively easily without needing to work somewhere. Rob Wiblin (1:51:40) Why do you think people have been overestimating research relative to engineering? Is it just that research sounds cooler? Is it got better branding? Nick Joseph (1:51:47) I think historically it was a prestige thing. Like I think there's sort of this distinction between research scientist and research engineer that used to exist in the field where research scientists had PhDs and were designating the experiments that the research engineers would run. And I think that, like, that shifted a while ago. So I think in some sense, like, the shift has already started happening. Like, now many places, drop included, it's like everyone's a member of technical staff. There there isn't sort of this distinction. And the reason is that the engineering got more important, particularly with scaling. Like, once you got to the point where you were training models that used a lot of compute on a big distributed cluster, the engineering to implement things on these distributed, runs got much more complex than when it was sort of more quick experiments on on cheat models. Rob Wiblin (1:52:29) To what extent is it a bottleneck just being able to build these enormous compute clusters and and operate them effectively? Is that a core part of the part of the stuff that Anthropic has to do? Nick Joseph (1:52:40) Yeah. So we rely on cloud providers to actually build the data centers and put chips in it. But we've now reached a scale where the amount of compute we're using is sort of it's a very dedicated thing. These are really huge investments, and we're involved collaborating on it from the design up. I think it's a very critical piece. Given that compute is the main driver, the ability to take a lot of compute and use it all together and to design things that are cheap given the types of workloads you want to run can be like a huge a huge multiplier on how much compute Rob Wiblin (1:53:16) you have. Alright. Yeah. Did you wanna give us the the pitch for working at Anthropic as a as a particularly good way to make the future future go with superintelligent AI go well? Nick Joseph (1:53:25) Yeah. I may pitch, like, working on AI safety first. I I think Sure. Yeah. The case here is it's just, like, really, really important Rob Wiblin (1:53:32) to you. I I Nick Joseph (1:53:32) think, like, AGI is is gonna be, like, probably the biggest technological change ever to happen. The thing I think I keep in my mind is just like, what would it be like to have every person in the world able to spin up a company of 1000000 people, all of whom are as smart as like the smartest people you know, and task them with any project they want? And you could do a huge amount of good with that. You could like help cure diseases. You could tackle climate change. You could work on a ton of work on poverty. There's sort of a ton of stuff you can do that would be great, but there's also a lot of ways it could go really, really badly. So I just think like the stakes here are like are really high, and then there's a pretty small number of people working on it. If you sort of account for all the people working on things like this, I think you're probably gonna get something in the thousands right now, maybe 10 tens of thousands. It's rapidly increasing, but it's quite small compared to the scale of the problem. In terms of why Anthropic, I think my main case here is just like the best way to make sure things go well is to get a bunch of people who care about the same thing and all work together with that as the main focus. I mean, Anthropic is not perfect, we definitely have issues as does every organization. But I think 1 thing that I really appreciate is just seeing how much progress we can make when there's a whole team where everyone trusts each other, deeply shares the same goals and can work on that together. Rob Wiblin (1:54:49) I guess, I mean, there is a bit of a trade off between if you imagine there's a kind of a pool of people who are very focused on AI safety and kind of have the attitude that you just expressed. 1 approach would be to split them up between each of the different companies that are working on frontier AI, and I guess that would have have some benefits. The alternative would be to cluster them all together in a single place where they can where they can work together and make a lot of progress. But perhaps things that they learn won't, be as easily diffused across all of the all of the other other companies. Yeah. Do do you have a view on where the right balance is there between kind of clustering people so they can work together more effectively communicate more versus the need perhaps to have people have people everywhere who can absorb the the work? Nick Joseph (1:55:29) I just think the benefits from working together are are really huge. Like, I think it's just it's so different what you can, like, accomplish when you have, like, 5 people all working together as opposed to 5 people, like, working independently, unable to sort of speak to each other, like, communicate about what they're doing. You sort of run the risk of just doing everything in parallel, not learning from each other, and also not building trust, which I think is just somewhat a core piece of eventually being able to work together to implement the things. Rob Wiblin (1:55:54) So in as much as Anthropic is or becomes like the main leader in interpretability research and other kinds of lines of technical AI safety research, do you think it is the case that other companies are gonna be very interested to absorb that research and apply it to their own work? Or is there a possibility that Anthropic will have really good safety techniques, but then they might get stuck in Anthropic potentially the most capable models that are being developed elsewhere are developed without them? Nick Joseph (1:56:25) Yeah. So I think that my hope is that if other people have either developed RSP like things or if there are regulations requiring particular safety mitigations, people will have a strong incentive to want to get better safety practices, and we publish our safety research. So in some ways, we're making it as easy as possible as we can for them. We're like, here is all the safety research we've done. Here's as much detail as we can as we can give about it. Please go reproduce it. Beyond that, I think we're kind of it's hard to be accountable for what what other places do. And I think to some degree, just makes sense for Anthropic to try to set an example and be like, know, we we can be a frontier lab while while still prioritizing safety, putting out a lot of a lot of safety work and hoping that kind of inspires others to to do the same. Rob Wiblin (1:57:10) Do you know I I don't know what the answer to this is, but is it do do you know researchers at Anthropic sometimes go and visit other AI companies and vice versa in order to like cross pollinate ideas? I think that used to maybe happen a lot happen more and maybe things have gotten a little bit tighter the last few years, but that's 1 idea that you could hope that research might get passed around or at least I mean, you're saying it gets published. I guess that's, important, the but there is a risk that, you know, the the the technical details of how you actually apply the methods won't always necessarily be in the paper or be very easy to to figure out. So you also often need to talk to people, to to to make things work. Nick Joseph (1:57:48) Yeah, I think once something's published you can go and give talks on it and etcetera. I think publishing is sort of the first step where until it's published then it's confidential information that can't be shared. Yeah, it's sort of like you have to first figure out how to do it, then publish it. There are more steps you could take, right? You could then open source code that enables you to run it more carefully, there's a lot of work you could go in that direction, then it's just a balance of how much time you spend on disseminating your results versus pushing your agenda forward to actually make progress. Rob Wiblin (1:58:22) It's possible that I'm slightly, I'm analogising from, I guess, biology that I'm somewhat more familiar with, where it's notorious that having a biology paper or a medical paper does not allow you to replicate the experiment because there's so many important details missing. But is it possible that in ML and AI, people tend to just publish all of the stuff that they all of the data maybe and all of the code online on GitHub or whatever, such that it's like much more straightforward to completely replicate, a piece a piece of research elsewhere. Nick Joseph (1:58:51) Yeah, I think it's much much. It's a totally different, level of replication. It depends on the paper, but on on many papers, if a paper is like published in in some conference, I would expect that someone can pull up the paper and reimplement it with maybe a week's worth of work. There's a strong norm of sometimes providing the actual code that you need to run, but providing enough detail that you can. I think with some things it can be tricky where our team just put out a paper on how to get features on 1 of our production models, and we didn't release details about our production model. So we tried to include enough detail that someone could replicate this on another model, but they probably can't they can't exactly, like, create our our production model and, like, get the exact features that we Rob Wiblin (1:59:31) have. Okay. In in a minute, we'll talk about, what are the concerns that people might have, about working with, working at any AI AI company. But in the meantime, yeah, what what roles are you hiring for at the moment and what roles are likely to be open at Anthropic in future? Nick Joseph (1:59:46) So probably just check check our website. There's there's like quite a lot. I'll I'll kind of highlight a few. So I think the first 1 I should highlight is the RSP team is is is looking for people to develop evaluations, work on the RSP itself, figure out the like, what what the next next version of RSP should look like, etcetera. On on my team, we're hiring a bunch of research engineers. So this is come up with approaches to improve models, implement them, analyze the results, of pushing pushing that loop. And then also performance engineers, this one's maybe like a little bit more surprising, but a lot of the work now happens on custom AI chips and making those run really efficiently is is sort of absolutely critical. There's a lot of interplay between how fast it can go and how how good the model is. So we're sort of hiring quite quite a number of performance engineers where it's much you don't need to have a ton of AI AI expertise, just having like deep knowledge of how how hardware works and how to write code really efficiently. Rob Wiblin (2:00:39) How can people learn that skill? Nick Joseph (2:00:43) There courses for that? There are probably courses. I think with basically everything, would recommend finding project, finding someone to mentor you, and be cognizant of their time. Maybe you spend a bunch of time writing up some code and you send them a few 100 lines and say, can you review this and help me? Or maybe you got some weekly meeting where you ask questions. But yeah, I think you can read about it online, you can take courses, or you can just pick a project and say, I'm going to implement a transformer as fast as I possibly can and sort of hack on that for a while. Rob Wiblin (2:01:15) Are most people coming into Anthropic from other AI companies or the tech industry more broadly or from PhDs or maybe not even PhDs? It's quite Nick Joseph (2:01:24) a mix. Think PhD is definitely not necessary. I think it's 1 direction to go to build up this skill set. We have a shockingly large number of people with physics backgrounds who have done theoretical physics for a long time and then sort of spend some number of months learning the engineering to be able to write Python really well essentially and then switch in. So I think there's not really a particular background that is sort of needed. It's just I would say if you're directly preparing for it, just pick the closest thing you can to the job and do that to prepare, but don't feel like you needed to have some particular background in order to apply. Rob Wiblin (2:02:01) This question is slightly absurd because there's such a range of different, roles that people could potentially apply for at Anthropic. But do you have any kind of any advice for people who would like, you know, the vision for their career is working at Anthropic or something similar, but they don't yet feel like they're qualified to to get a role at such a such a serious organization? Can I how can they go like, what what are some interesting underrated parts maybe to gain experience or skills so that they can be more useful to the project in future? Nick Joseph (2:02:25) Yeah, I would just pick the role you want and then do it externally. Do it in a very publicly visible way, get advice and then apply with that as an example. So like if you wanna work on interpretability, make some tooling to pull out features of models and post that on GitHub or publish a paper on interpretability. If you want to work on the RSP, then make a really good evaluation, post it on GitHub with a nice write up of how to run it, and include that with application. This takes time and is hard to do well, but I think that it's both the best way to know if it's really the role you want. And when hiring for something, I have a role in mind and I want know if someone can do it. And if someone has shown look, I I've already I'm already doing this role. Of course, I can here's my proof. I can do it well. That's like the most convincing case. In many ways, more so than like the signal you'd get out of an interview where all you really know is they did well on this particular question. Rob Wiblin (2:03:16) So in terms of working at AI companies, regular listeners will recall that, earlier in the year I spoke with, who's a kinda longtime follower advances in AI and, I say is is a bit on the pessimistic side about AI safety. And maybe also not you know, I think I think he likes the the Anthropic RSP, but he's not like convinced that any of the safety plans put forward by by any company or any government are at the end of the day gonna be quite enough to to keep us safe from, you know, rapidly self improving AI. And he said that he was pretty strongly against people taking capabilities roles that would kind of push forward the frontier of what the most powerful AI models can do, I guess, especially at at leading AI companies. Because it I mean, the basic argument is just that those roles are causing a lot of harm because they're speeding things up and leaving us less time to solve whatever kind of safety issues we're gonna need to address. And I pushed back a little bit and he wasn't really convinced by the various kind of justifications that 1 might might give like the need to gain skills that you could then apply to safety work later, or maybe you'd have the ability to influence a company's culture by being on the inside rather than the outside. I think of all companies, I would I would certainly imagine is most sympathetic to to Anthropic. But I guess his philosophy is very much to kind of rely on hard constraints rather than put trust in a particular individuals or organizations that you you like. I'm guessing that kind of you might have heard what what you had to say on the on on in that episode. And I guess it was a critique that arguably applies to to to your job by training training training Claude 3 and other, you know, frontier LLMs. So I'm kinda fascinated to hear what you what you thought of of his perspective there. So I think there's Nick Joseph (2:04:55) like 1 argument which is to do this to build career capital, and then there's another that is like to do this for direct impact. I think on the career capital 1, I'm pretty skeptical. I think career capital is sort of weird to think about in this field that's like growing exponentially. I think in a normal field, people often say you're the most impact late in your career. You build up skills for a while, and then in maybe your 40s or 50s is when you have the most impact of your career. But given the rapid growth in this field, I think actually the best moment for impact is now. I don't know, I often think of like in 2021, I was working Anthropic. I think there were probably tens of people working on large language models, which I thought were the main path towards AGI. Now there are thousands. I've improved, I've gotten better since then, but I think probably I had way more potential for impact back in 2021 when there were only tens of people working on it. Rob Wiblin (2:05:42) Your best years are behind you, Nick. Nick Joseph (2:05:46) Yeah. I think in many I mean, I think the potential is very high. I still think that, like Yeah. It is still there's still a lot of room for impact and it will, like, maybe decay, but it is from an extremely high number, or from an extremely high level. And then the thing is just the field isn't that deep. Because it's such a recent development, it's not like you need to learn a lot before you can contribute. I think if you want to do physics and you have to learn the past thousands of years of physics before you can push the frontier, That's a very different setup from where we're kind Rob Wiblin (2:06:15) **of

Nick Joseph** (2:06:15) at. Maybe my last argument is just if you think timelines are short, depending exactly how short, there's just actually not that much time left. If you think there's 5 years and you spend 2 of them building up a skill set, like, that's that's a significant fraction of the time. I'm not saying that that should be someone's timeline or anything, but, like, the shorter they are, the the less that makes sense. So, yeah, I think I like, from a career capital perspective, I I probably agree. Does that make sense? Rob Wiblin (2:06:41) Yeah. Yeah. And what about from from other points of view? Nick Joseph (2:06:44) Yeah. I think from a, like, direct impact perspective, I'm I'm fairly less convinced. I don't part of this is just that I don't have this framing of, like, there's capabilities and there is safety and they are, like, separate tracks that are racing. I think it's 1 way to look at it, but I actually think they're really intertwined and a lot of safety work relies on capabilities advances. I can give this example of this many shot jailbreaking paper that 1 of our safety teams published, which uses long context models to find a jailbreak that can apply to Claude and to other models. And that research was only possible because we had long context models that you could test this on. I think there's just a lot of cases where the things come together. But then I think if you're going to work on capabilities, you should be really thoughtful about it. I do think there is a risk. Are speeding them up. In some sense, you could be creating something that is really dangerous, but I don't think it's as simple as just don't do it. I think you want to think all the way through to what is the downstream impact when someone trains AGI and how will you have affected that? And that's a really hard problem to think about. There's 1000000 factors at play, but I think you should you should think it through, come to your best judgment, and then reevaluate and get x get other people's opinions sort of as you go. Some of the things I might, like, suggest doing if you're if you're considering it is, like, understand what if you're considering working on capabilities at some lab is like try to understand their theory of change. Like, ask ask people there, ask like, how how does your work on capabilities lead to a better outcome? And see if you agree with that. I would talk to their safety team, talk to safety researchers externally, get their take. Say, like, do do they think that this is this is a good thing to do? And then I would also look at their track record and their governance and all the things that sort of to answer the question of, do you think they will push on this theory of change? Like, over the next 5 years, are you confident this is sort of what what will actually happen? 1 thing for me that, like, convinced me at Anthropic that I was maybe not doing evil or made me feel much much better about it is that our safety team is pretty is willing to help out with capabilities and and, like, actually wants us to to do well with that. So, like, early on with Opus, before we launched it, we had a a major fire. There was sort of a bunch a bunch of issues that came up, and there was 1, like, very critical research project that, like, my team didn't have capacity to push forward. So I I asked Ethan, who's Perez, who's 1 of the the safety leads at Anthropic, was like, ah, can you can you help with this? It was actually, like, during an off-site. And Ethan and, like, most of his team just, like, left basically, like, went upstairs to this building in the woods that we had for the off-site and cranked out research on this for, like, the next 2 weeks and just sort of, like and for me at least that was like, ah, yes, the safety team here also thinks that us us staying on Rob Wiblin (2:09:20) the frontier is critical. So the basic idea is you think that the safety work, the safety research of all kinds of of many different types that Anthropic is doing is very useful. It sets a great example. It's research that could then be adopted by other groups and also used by Anthropic to make safe models. And the only way that that can happen, the only reason that that research is possible at all is that Anthropic has these kind of frontier LLMs on which to experiment and do their research. And to be at the cutting edge generally of this technology, and so able to figure out like what are the what's the safety research agenda that is most likely to be relevant in future. If I imagine what what would you say, I'm gonna try to model him. I suppose, I guess that he might say, yes, given that there's this competitive dynamic forcing us to shorten timelines, you know, bringing bringing the future forward maybe faster than we feel comfortable with. Maybe that maybe that's the best you can do, but wouldn't it be great if we could coordinate more in order to buy ourselves more time? I guess that would be 1 angle. Another angle that I've heard from some people, I don't know whether whether G would say this or not, is that we're nowhere near actually having all the safety relevant insights that we can have with the models that we have now. And so given that there's still like so such fertile material with like Claude 2 maybe, or at least with Claude 3 now, like why do you need to go ahead and train Claude 4? Is maybe it's true that 5 years ago when we were so much further away from having AGI or having models that were really interesting to work with, we were a little bit at a loose end trying to figure out what safety research would be good because we just didn't have we're like, we just didn't know what direction things were gonna go. But now there's so much safety research, like it's there's a there's a proliferation of Cambrian explosion of of really valuable work. And we don't necessarily need like more capable models than what we have now in order to to to discover really valuable valuable things. Yeah. What would you what what what would you say to that? Nick Joseph (2:11:28) Yeah. On the first 1, I think there's sometimes this, like, what what is the ideal world if, like, everyone was me or something? Like, if everyone thought what I thought, what would be the ideal setup? And I think that's just not how sort of the world works. And I think to some degree, you really only can control what you do, and maybe you can control what, like or control maybe you can influence what, like, a small number of people you talk to do. But I think you sort of have to think about your role in the context of of the broader world more or less acting in the way that they're going to act. Yeah, and it's definitely a big part of why I think it's important for Anthropic capabilities is to enable safety researchers to have better models. Another piece of it is to enable us to have an impact on the field and try to set this example for other labs that you can deploy models responsibly and do this in a way that doesn't cause catastrophic risks and continues to push on safety. In terms of can we do safety research with current models? I think there is definitely a lot to do. I also think we will target that work better the closer we get to AGI. I think the last year before AGI will definitely be the most targeted safety work. Hopefully, there'll be the most safety work happening then, but it will be the most time constrained. So you need to do work now because there's a bunch of serial time that's needed in order to make progress, but you also want to be ready to make use of the most well directed time towards the end. Rob Wiblin (2:12:50) I guess, you know, another concern that people have, which you touched on earlier, but maybe maybe we could talk about a little bit more, is this worry that Anthropic by existing, by competing with other AI companies, it stokes the stokes the arms race, increases the pressure on them feeling that they need to improve their models further, put more money into it, you know, release things as quickly as they can. If I remember you, basic response to that was like, yes, that that affects not 0, but, you know, in the scheme of things, there's a lot of pressure on on companies to be trading models and trying to to improve them. And, you know, Anthropic is a pretty is kind of a drop in the drop drop in the bucket there. And so this this isn't necessarily the most important thing to be worrying about. Nick Joseph (2:13:28) Yeah. I think basically, like, that that's pretty accurate. I think 1 way I would think about it is just what would happen if Anthropic stopped existing. Right? Like, if we all just disappeared, what what effect would that have in the world? Either if you or if you think about, like, if we dissolved as a company and everyone went to work at at all the others. And my guess is it just wouldn't look like everyone slows down and is way more cautious. That's not my model of it. If that was my model, I would be like, ah, we're probably doing something wrong. So I think it's an effect, but I think about it in terms of what is the net effect of Anthropic being on the frontier when you account for all the different actions we're taking, all the safety research, all policy advocacy, the effect our products have helping users. There's this whole large scheme and you can't really add it all up and subtract the costs, but I think you can do that somewhat in your mind or something. Rob Wiblin (2:14:16) Yeah, I see. So yeah, so the way you conceptualize it is thinking, Anthropic as a whole, what impact is it having by existing compared to some counterfactual where Anthropic wasn't there? And then you're contributing to this broader enterprise that is Anthropic and all of its projects and plans to plans together rather than thinking about, you know, today I got up and I and I and I helped to improve Claude 3 in this like narrow way. What impact does that specifically have? Because that's just maybe it's missing the real effects that that that that matter the most from like allowing this organization to exist through your work. Nick Joseph (2:14:45) Yeah. You could definitely think on the margin, like to some degree, if you're joining and going to help with something, you are just increasing Anthropic's marginal amount of capabilities. Then I would just look at, do you think we would be on a better trajectory if Anthropic had better models, and do you think we'd be on a worse trajectory if Anthropic had significantly worse models? Would be sort of the comparison. I think you could look at like, oh, well, what would happen if Anthropic didn't chip cloud 3 last year or earlier this year? What are some Rob Wiblin (2:15:12) of the lines of research that you're most pleased that you've helped Anthropic to pursue? What are some of the kind of safety safety safety wins that you're really pleased by? Nick Joseph (2:15:21) I'm really excited about the safety work. I think there's just like a ton of it that has come out of Anthropic. I can sort of start with interpretability where there I think at the beginning of Anthropic it was figuring out how single layer transformers work, these very simple toy models. And in the past few years, that's and this is not my doing, this is all the interpretability team that has scaled up into actually being able to look at production models that people are really using and find useful and identify particular features. We had this recent 1 on the Golden Gate Bridge where they found a feature that if you increase it, it makes the model it's the model's representation of the Golden Gate Bridge, if you increase it, the model talks more about the Golden Gate Bridge. That's a very cool causal effect where you can change something and it actually changes the model behavior in a way that gives you more certainty you've really found something. Rob Wiblin (2:16:10) Is the hope with that, for example, that you could find I mean, I guess, you found the, I don't know whether you call it the the Golden Gate Bridge neuron, but I don't know. The the the you found where the Golden Gate Bridge is in the model, and then you can turn that up so that it I'm not sure whether, all listeners will have seen this, but it is it is very funny because, you get Claude 3. It it just its mind is constantly turned to thinking about the Golden Gate Bridge, even when the question has nothing to do with it. And it gets frustrated with itself realizing that it's going off topic and then tries to bring it back to the thing that you asked, but then, it just can't it just can't avoid talking about the Golden Gate Bridge again. Is there hope that you could find the honesty part of the model and scale that up enormously? Or alternatively find the, you know, deception part and scale that down in the same way? Nick Joseph (2:16:52) Yeah. So like there actually are a bunch if you look at the paper, there's a bunch of safety relevant features. I think that the Golden Gate Bridge 1 was cuter or something and got a bit more attention. But yeah, there are a ton of features that are really safety relevant. Think 1 of my favorites was 1 that will tell you if code is incorrect or something or has a vulnerability, something along those lines. And then you can change that and suddenly it doesn't write the vulnerability or it makes the code correct. And that shows the model knows about concepts at that level. Now, can we use this directly to solve major issues? Probably not yet. There's a lot more work to be done here, but I think it's just been a huge amount of progress. And I think that it's fair to say that progress wouldn't have happened without the Anthropics interpretability team pushing that field forward a lot. Rob Wiblin (2:17:39) Is there any other Anthropic research that you're proud of? Nick Joseph (2:17:42) Yeah. I mentioned this 1 a little bit earlier, there's sort of this multi shot jailbreaking from our alignment team that pushed if you have a long context model, which is something that we released, you can jailbreak a model by just giving it a lot of examples in this very long context. It's a very reliable jailbreak to get models to do things you don't want. So this is sort of in vein of the RSP. 1 of the things we want to have is to be able to be robust to really intense red teaming, where if a model has a dangerous capability, you can have safety features that prevent people from listening it. And this is like an identification of a of a major risk for that. We also have this sleeper agents paper, which shows early signs of models having deceptive behavior. Yeah, I could talk about a lot more of it. There's actually just a really huge amount and I think that's fairly critical here. Think often with safety things people get focused on inputs and not outputs or something. And I think the important thing is to think about how much progress are we actually making on the safety front. That is ultimately what's going to matter in some number of years when we get close to AGI. It won't be like many GPUs did we use, how many people worked on it, it's going to be like what did we find and effective were we at it. And for product, is very natural. People think in terms of revenue, how many users did you get? You have these end metrics that are the fundamental thing you care about, and I think for safety it's much fuzzier and harder to measure, but putting out a lot of papers that are good is quite important. Rob Wiblin (2:19:06) Yeah, if you want to keep going, if there's any others that you want to flag, in no hurry. Nick Joseph (2:19:12) Yeah, mean, talking about influence functions, think this is a really cool 1, where 1 framing of mechanistic interpretability it lets us look at the weights and understand why a model has a behavior by looking at a particular weight. The idea of influence functions is to understand why a model has a behavior by looking at the training data, so you can understand what in your training data contributed to a particular behavior from the model. Yeah, I think that was pretty exciting to see work. I think constitutional AI is another example I would highlight where we can train a model to follow a set of principles via AI feedback. So instead of having to have human feedback for a bunch of things, you can just write out a set of principles. I want the model to not do this, I want to not do this, I want to not do this, And train the model to follow that constitution. Rob Wiblin (2:20:02) Is there any work at Anthropic that you personally would be wary or at least not enthusiastic to contribute to? Nick Joseph (2:20:10) So I think in general this is a good question to ask. I think the work I'm doing is currently the highest impact impact thing, and I think I should frequently wonder if that's the case and talk to people, and and reassess. I think right now that I don't think there's any work at Anthropic that I I wouldn't contribute to or, like, think shouldn't be done. It's probably not the way I would approach it. It's like if there was something that I thought Anthropic was doing that was bad for the world, I would write a doc making my case and send it send it to the relevant person who's responsible for that and then have a discussion with them about it because just opting out isn't going to actually change it. Right? Someone else will just do it. That doesn't accomplish much. And we try to sort of operate as 1 team where everyone is aiming towards the same goals and not have the sort of like 2 different teams are at odds where you're hoping someone else won't succeed. Rob Wiblin (2:20:56) I guess people people might have a reasonable sense of the culture at Anthropic, just just from listening to this interview. But is there anything else that's interesting about working at Anthropic that might not be immediately obvious? Nick Joseph (2:21:05) I think the 1 thing that like is part of our culture that at least surprised me is spending a lot of time pair programming. This is just it's just a very collaborative culture. So I think when I first joined, I don't know, I was working on a particular method of, like, distributing, a language model training across a bunch of GPUs. And Tom Brown, who's 1 of the founders and had done this for GPT-three, just put an 8 hour meeting on my calendar. And I just watched him code it, and then I was on a different time zone. So basically during the hours when he wasn't working and I was working, I would push forward as far as I could. And then the next day we would meet again and continue on. And I think it's just a really good way of aligning people where it's a shared project instead of being a you're bothering someone by asking for their help. It's like you're working together on the thing and and you learn you learn a lot. You also learn a lot of like the smaller things that you wouldn't otherwise see of like how does someone navigate their code editor? Like, what what exactly is their style of debugging this sort of problem? Whereas if you go and ask them for advice or like, how do I do this project? They're not gonna tell you the, like, low level details. When do they pull out a debugger versus, some other method of some other tool for solving the problem. Rob Wiblin (2:22:12) So so this is literally just watching 1 another's screens, or you're doing a screen share thing where you watch? Nick Joseph (2:22:17) Yeah. I'll give some some free advertising to Tuple, which is this great software for it where you can share screens and you can control each other's screens and like draw draw on the screen. And typically 1 person will drive, they'll be like basically doing the work, another person will watch, ask questions, point out mistakes, occasionally grab the cursor and Rob Wiblin (2:22:35) just just change it. It's interesting that I feel in in other industries having your boss or a colleague stare constantly at your screen give people the creeps or they they would hate it. Whereas, it seems like in, in in programming, this is something that people are really excited by and they feel like it enhances their productivity and makes the work a lot more fun. Nick Joseph (2:22:55) Oh, yeah. I mean, it can be exhausting and like tiring. I think the first time I did this, was too nervous to take to take a bathroom break and after multiple hours was like, can I go to the bathroom? And realized that was an absurd thing to ask after multiple hours of Rob Wiblin (2:23:05) working on something. Like When you're back at primary school, yeah. Nick Joseph (2:23:09) Yeah, it can definitely feel Rob Wiblin (2:23:10) **a little bit

Nick Joseph** (2:23:10) more intense in that someone's watching you and they might give you feedback on like, ah, you're kind of going slow here. This sort of thing would speed you up. But I think you really can learn a lot from that sort of intensive partnering with someone. Rob Wiblin (2:23:22) All right, I think we've talked about Anthropic for a while. Guess, final question is, obviously Anthropic, its main office is in San Francisco, right? And I heard that it was opening, a kind of a branch in in London. Are those the the 2 main places? And and do you, are there many people who work remotely or anything like that? Nick Joseph (2:23:39) Yeah. So we have the main office in SF, and then we have office in London, Dublin, I think, Seattle, and New York. Our typical policy is like 25% time in person. So some people will mostly work remotely and then go to sort of 1 of the hubs for it's usually 1 week per month. The idea of this is that you should we want people to like build trust with each other and be able to work together well and know each other, and that sort of involves some amount of social interaction with your coworkers, but also for a variety of reasons, like sometimes getting the best people, that people are bound to particular locations. Rob Wiblin (2:24:13) I kind of have been assuming that all of the main AI companies are probably hiring hand of a fist. Then I guess I know, Anthropic received like big investment from Amazon, may maybe some other folks as well. But is it yeah. Does does it feel like the organization is growing a lot that there's lots of new people around all the time? Nick Joseph (2:24:28) Yeah. Growth has been very rapid. We recently moved into a new office. Before that, we were, we'd run out of desks, which was an interesting moment for, for the company. It was it was very crammed. Now there's space. Yeah. Yeah. I mean, rapid growth is a very difficult challenge, but also a very interesting 1 to work on. Think that's to some degree what I spend a lot of my time thinking about is just how can we grow the team and be able to maintain this linear growth in productivity is the dream. If you double the number of people, you get twice as much done. And you never actually hit that, but it takes a lot of work because there's now all this communication overhead and you have to do a bunch to make sure everyone's working towards the same goals, maintain the culture that we currently have. Rob Wiblin (2:25:11) I've given you a lot of time to talk about what's great about Anthropic, but I should at least ask you what's what's what's kind of worst about Anthropic? What would you most like to see to see improved? Nick Joseph (2:25:19) I think honestly the first thing that comes to mind is just the stakes of what we're working on or something. I think that there was sort of a period a few years ago where I felt like, ah, safety is really important. I felt like motivated and I was it was a was a thing I should do and like got got value out of it, but I didn't feel this sort of, oh, it could be really urgent. This might feel like, you know, decisions I'm making are just, like, really, really high stakes decisions. And I think Anthropic definitely feels high stakes. I think it's often portrayed as this, like, do me culture. I don't think it's that. I think there is, like, a lot of benefits, and I think we I think I'm, like, pretty excited about the work I'm doing, and it's it's quite fun on a day to day basis, but it does feel very high intensity and many of these decisions really do matter. If you really think we're going to have the biggest technological change ever and how well that goes depends in a large part on how well you do at your job on that given day. No pressure. Yeah. And the timelines are really fast too. Right? Even commercially, you can kind of see that it's like months sort of between between major releases and that sort of that puts a lot of pressure where if you're trying to, like, keep up with the frontier of AI progress, it it is quite difficult and it relies on sort of, success on on very short timelines. Rob Wiblin (2:26:34) Yeah. Is that so if there's someone who's, you know, has relevant skills, might be a good employee, but, you know, maybe they struggle to operate at like super high productivity, super high energy all the time. Could that be an issue for them at a place like Anthropic where it sounds like there's a lot of pressure to deliver all the time? Guess potentially internally, but also just the external pressures are pretty substantial. Nick Joseph (2:26:58) Yeah. I think that some part of me wants to say yes. I think it is really important to be very high performing a lot of the time. The standard of always do everything perfect all the time is not something anyone meets. And I think it is important sometimes to just keep in mind that all you can do is your best effort. We will mess things up and even if it's high stakes and that's quite unfortunate, it's unavoidable. No 1 is perfect. So I wouldn't set too high of a like, oh, I couldn't possibly handle that. I think people really can and you can grow into that and get used to that level of pressure and how to operate under it. Rob Wiblin (2:27:40) Alright. I guess we should wrap up. We've been at this for a couple of hours, but I'm curious to know what is an AI application that you think is overrated and maybe going to take longer to arrive than people expect? And maybe what's an application that you think might be underrated and, like, consumers might, be really getting a lot of value out, out of surprisingly soon? Nick Joseph (2:28:02) Overrated. I think on overrated, it's to say I think people are often like, oh, I should never, like, I'll never have to use Google again or it's a great way to get information. And I find that I still like, if I just have a simple question and I wanna know the answer, just Google it will give me the answer quickly and it's almost always right. Whereas if I could go ask Claude, but it'll take us pieces, know, it'll sample it out, and then I'll be like, is it true? Is it not true? It's probably true, but it's in this conversational tone. So I think that's 1 that doesn't yet feel like the strengths. The place where I find the most benefit is coding. I think this is not a super generalizable case or something, but if you're ever writing software or if you've thought, like, I don't know how to write software, but I wish I did, the the models are are really quite good at it. And if you can kind of get yourself set up, you can probably just write something out in English and it will spit out the code to do the thing you need rather quickly. And then the other thing is, like, problems where I don't like, I don't know what I would search for. Like, I have some question, I want to know the answer, but it relies on a lot of context. It'll be this giant query. Like, models are really good at that. You can give them documents. You give them, like, huge amounts of stuff and explain really precisely what you want, and then they will, like, interpret it and give you give you an answer that, like, accounts for all the information you've you've given them. Rob Wiblin (2:29:19) Yeah. I I think I do use it mostly as a substitute for Google, but not for simple queries. It's more like something kind of complicated where I feel like I'd to dig into some articles to to figure out the answer. I think 1 that jumps to mind is like did Francisco Franco was kind of on the side of the Nazis during World War 2, but then he was in power for another 30 years. Like, did he have a comment on that? What did he say about the Nazis later on? And I think Claude was able to give me an accurate answer to that. Whereas I probably would've I could've spent hours maybe. Trying look into that, trying to trying to trying to find something. The answer is he mostly just didn't talk about it. Nick Joseph (2:29:53) My other favorite 1, is a super tiny use case is if I ever have to like format something and do something, like if there's just some giant list of numbers that someone sent to me in a Slack thread and it's bulleted and I wanna add them up, I can just copy paste it into Claude and say add the things up and any formatting. It's very good at taking this weird thing, kind of structuring it, and then doing a simple operation. Rob Wiblin (2:30:12) So I've heard it's really good at all of these models are really good at programming. And I've thought, I mean, I've never programmed before really, and I thought about maybe I could use them to make something of use. But I guess I'm at such a basic level, don't even know like where so I would get the code and then where would I run it? Is there a place that I can look this up? Like, yeah. Nick Joseph (2:30:30) Yeah. I mean, I think you basically wanna just like look up a like I would suggest Python, like get an introduction to Python and get get your environment set up. Like you'll eventually run Python and then some some file, and you'll hit enter and that will run the And that part's annoying. I mean, Claude could help you if you run into issues setting it up. But once you have it set up, it will do you you can just be like, ah, write me some code to do x and and it will just it will write that pretty act not perfectly, but pretty accurately. Rob Wiblin (2:30:54) Yeah. I I guess I should I should just ask Claude for guidance on this as well. Do you so I I I've got a kid. He's a couple of months old. I guess, you know, in 3 or 4 years time, they'll be going to preschool and then eventually starting in a reception primary school. I guess my hope is that by that time, AI models might be really involved in the education process and kids will be able to get a lot more kind of 1 at work. Maybe it would be very difficult to keep a 5 year old focused on the task of talking to an LLM, but I would think that we're close to being able to have a lot more like individualized attention from educators, even if those educators are AI models. And this might enable kids to learn a lot faster than they can when there's kind of only 1 teacher split between 20 students or something like that. Do do you think that kind of stuff will come in in time for my kid first going to going to school or might it take a bit longer than that? Nick Joseph (2:31:42) I can't be sure, but yeah, I I think there will be some some pretty major changes by the time your your kid is going to school. Rob Wiblin (2:31:48) Okay. Yeah. That's good. That's that's 1 that I really don't wanna miss on the timelines. We can like, I'm I'm worried like like Nathan Labenz, I'm worried about hyperscaling, but on a lot of these applications, really just want them to to reach us as soon as possible because they they do seem so useful. My guest today has been Nick Joseph. Thanks so much for coming on the 80000 Hours podcast, Nick. Thank you. If you're really interested in the pretty vexed question of whether all things considered it's good or bad to work at the top AI companies if you want to make the transition to superhuman AI go well, our researcher Adam Kaila has just published a new article on exactly that titled, should you work at a frontier AI company? You can find that by Googling 80000 hours and should you work at a frontier AI company or heading to our website 80000hours.org and just looking through our research. And finally, before we go, just a reminder that we are hiring for 2 new senior roles at 80000H, a Head of Video and a Head of Marketing. You can learn more about both of those at 80000hours.org/latest. Those roles would probably be done in our offices in Central London, but we are open to exceptional remote candidates in some cases. Alternatively, if you're not in The UK but would like to be, we can also support UK visa applications. The salaries for these 2 roles would vary depending on seniority, but someone with 5 years of relevant experience would be paid approximately £80,000 something like that. The first of these 2 roles, the head of video, would be someone in charge of setting up a whole new video product for 80000 hours. Obviously people are spending a larger and larger fraction of their time online watching videos on video specific platforms, and we want to explain our ideas there in a compelling way that can reach the sorts of people who care about them. That video program could take a range of forms, including 15 minute direct to camera vlogs, lots and lots of 1 minute videos, maybe 10 minute explainers, that's probably my favourite YouTube format, or alternatively lengthy video essays, some people really like those. The best format would be something for this new head of video to look into and figure out for us. We're also looking for a new head of marketing to lead our marketing efforts to reach our target audience at a large scale. They're going be setting and executing on a strategy, managing and building a team, and ultimately deploying our yearly marketing budget of around $3,000,000 We currently run sponsorships on major podcasts and YouTube channels, hopefully you've seen some of them. We also do targeted ads on a range of social media platforms, and collectively that's gotten hundreds of thousands of new people onto our email newsletter. We also mail out a copy of 1 of our books about high impact career choice every 8 minutes, that's what I'm told, so there's certainly the potential to reach many people if you're doing that job well. Applications will close in late August, so please don't delay if you'd like to apply for those ones: 80000hours.org/latest. Alright, the 80000 Hours podcast is produced and edited by Kieran Harris, audio engineering by Ben Cordell, Mylon Maguire, Simon Monsieur and Dominic Armstrong. Full transcripts and extensive collection links to learn more are available on our site and put together as always by the legend herself, Katie Moore. Thanks for joining. Talk to you again soon. Nathan Labenz (2:34:41) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Anthropic's Responsible Scaling Policy, with Nick Joseph, from the 80,000 Hours Podcast

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Anthropic's Responsible Scaling Policy, with Nick Joseph, from the 80,000 Hours Podcast

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Read next

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving