Guaranteed Safe AI? World Models, Safety Specs, & Verifiers, with Nora Ammann & Ben Goldhaber
Nathan explores the Guaranteed Safe AI Framework with co-authors Ben Goldhaber and Nora Ammann.
Watch Episode Here
Read Episode Description
Nathan explores the Guaranteed Safe AI Framework with co-authors Ben Goldhaber and Nora Ammann. In this episode of The Cognitive Revolution, we discuss their groundbreaking position paper on ensuring robust and reliable AI systems. Join us for an in-depth conversation about the three-part system governing AI behavior and its potential impact on the future of AI safety.
Apply to join over 400 founders and execs in the Turpentine Network: https://hmplogxqz0y.typeform.c...
RECOMMENDED PODCAST: Complex Systems
Patrick McKenzie (@patio11) talks to experts who understand the complicated but not unknowable systems we rely on. You might be surprised at how quickly Patrick and his guests can put you in the top 1% of understanding for stock trading, tech hiring, and more.
Spotify: https://open.spotify.com/show/...
Apple: https://podcasts.apple.com/us/...
SPONSORS:
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/
Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.
CHAPTERS:
(00:00:00) About the Show
(00:04:39) Introduction
(00:07:58) Convergence
(00:10:32) Safety guarantees
(00:14:35) World model (Part 1)
(00:22:22) Sponsors: Oracle | Brave
(00:24:31) World model (Part 2)
(00:26:55) AI boxing
(00:30:28) Verifier
(00:33:33) Sponsors: Omneky | Squad
(00:35:20) Example: Self-Driving Cars
(00:38:08) Moral Desiderata
(00:41:09) Trolley Problems
(00:47:24) How to approach the world model
(00:50:50) Deriving the world model
(00:55:13) How far should the world model extend?
(01:00:55) Safety through narrowness
(01:02:38) Safety specs
(01:08:26) Experiments
(01:11:25) How GSAI can help in the short term
(01:27:40) What would be the basis for the world model?
(01:31:23) Interpretability
(01:34:24) Competitive dynamics
(01:37:35) Regulation
(01:42:02) GSAI authors
(01:43:25) Outro
Full Transcript
Transcript
Nathan Labenz (0:00) Hello and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg. Hello, and welcome back to the Cognitive Revolution. Today, I'm excited to bring you a conversation with Ben Goldhaber and returning guest Nora Ammann, coauthors of a major new multi institution position paper called Towards Guaranteed Safe AI, a framework for ensuring robust and reliable AI systems, which they coauthored with intellectual giants, Joshua Bengio, Stuart Russell, Max Tegmark, and Steve Almahundro, among others. As regular listeners will know, I'm always on the lookout for AI safety solutions that might really work in the sense that they have the potential to make AI systems safe enough that we no longer really need to worry about AI safety. The Guaranteed Safe AI Framework or GS AI for short is a notable attempt from serious people to advance the field toward this lofty goal. The proposal is to use a 3 part system to govern an AI's behavior. A world model, which is responsible for quantitatively modeling an AI system's impact on the world, a safety specification, which defines what impacts and outcomes are acceptable, and a verifier, which is responsible for checking that the AI system's proposed actions lead to acceptable outcomes according to the world models predictions. The goal is to provide high assurance quantitative safety guarantees, making AI safety a more engineering like discipline with explicit assumptions and tolerances that are designed into the systems as they are built, as is the norm today when designing other safety critical infrastructure, such as bridges, airplanes, or power plants. The applications of this framework to things like self driving cars and domestic service robots are quite clear. And interestingly, 1 nice benefit of this approach is how it might enable more democratic governance of AI systems by grounding the debate over what behaviors we want and what risks we're willing to take. When it comes to highly general systems like today's frontier language models and their presumably more powerful successors, however, things do get a bit fuzzier relative to the current paradigm of testing models during and after training and then trying to remove unsafe behaviors with post training. This does seem like a clear conceptual advance. However, the challenges of general purpose world modeling and the subtlety required for an effective general purpose safety specification loom large as open research problems. And certainly at present, even a best effort attempt to implement GSAI could not ensure that nothing major could ever go wrong. Nora and Ben, as you'll hear, are very realistic about this and ultimately see GS AI as just 1 important part of a broader portfolio of AI risk management strategies, which they ultimately hope society will deploy as part of a holistic defense in-depth approach. 1 note on this episode, which I'm honestly a bit embarrassed about, but still feel compelled to share. I experimented with a new approach to AI assisted preparation for this conversation, and unfortunately, it kind of backfired on me. Instead of reading the paper end to end like I normally do and using AI for background question answering, In this case, I loaded the paper up into chat GPT and had a verbal conversation with GPT 4 0 while on a long walk. Unfortunately, it turns out that I had some subtle misconceptions going into that dialogue, and GPT 4 0 was a bit too sycophantic in response to my questions, running with my incorrect premises when it ideally would have challenged and disabused me of them. The result is that I was still a little bit confused about some important aspects of the GSAI framework until roughly the 30 minute mark of this episode when Nora and Ben finally set me straight. While the final product is clear enough that I probably could have gotten away without mentioning this, considering how much I encourage others to adopt new AI powered workflows, I feel like I owe you transparency when I allow myself to be led astray by an AI model. I absolutely will keep experimenting with new AI assisted approaches like this. As we discussed in this episode, given the overwhelming volume of new AI research and products coming out all the time, there's really no choice but to adopt some AI enabled information processing strategy. But I will try to hold myself accountable as I do, and if nothing else, I hope this reminder will save somebody else from a similar mistake. All that said, as always, if you're finding value in the show, we appreciate it when you share the Cognitive Revolution with friends. We invite your feedback and your resumes on our website, cognitiverevolution.ai, and I will welcome and respond to any and all DMs on Twitter or any other social network. Now I hope you enjoyed this deep dive into the Guaranteed Safe AI Framework with authors Nora Ammann and Ben Goldhaber. Nora Ammann and Ben Goldhaber, authors of the really remarkable and interesting paper, Guaranteed Safe AI. Welcome to the Cognitive Revolution.
Nora Ammann (4:49) Thank you.
Ben Goldhaber (4:50) Thank you for having us.
Nathan Labenz (4:52) Yeah. I'm excited about this. I guess, for context, I'm always on the lookout for things that can change the AI landscape in any meaningful way. And, certainly, 1 of the things that I'm always looking for is a proposal for AI safety that can quote, unquote really work. I think part of what we'll be doing over the course of this session is teasing apart, like, what that means and helping me figure out if that's a confusion or if I should be changing the way I'm thinking about that. But maybe for starters, this is a paper with a a bunch of different authors from a bunch of different institutions. So I'd be interested to hear each of your backgrounds, how you guys assembled the team to do this, what motivated you, then you can use that as a way to introduce the project.
Ben Goldhaber (5:32) Yeah. Happy to speak to it. I work at Far AI, and our goal is to incubate high impact AI safety agendas, in particular new ones exactly like you spoke to. We're on the lookout for what we see as promising AI safety agendas. And 1 of the things that struck us was when we were speaking to folks like Davidad, David Dalrymple, who I think deserves credit for, along with several other of the co authors like Max Tegmark, Joshua Bengeo, others. When they were connected at Bletchley Park at the AI safety summit in The UK in November, they really realized the degree to which their own AI safety agendas were starting to converge around a shared principle, around a shared framework. And then in speaking more with Davidad, we agreed and it was quite exciting to think that, okay, many of these senior, very distinguished AI researchers had like almost independently or to some extent independently found this emerging framework into the degree to which we could help spur that and make it more legible, really understand the similarities and dissimilarities of it. We expected that kind of deconfusion could really spur more field building, more growth, help find a AI safety framework that could really work. And so that was, I'd say really some of the impetus of it. Nora and I were quite excited to be able to push this forward and bring this to a point where we had a clear paper along with a series workshops for creating that paper and understanding it better that we could then share with the world.
Nora Ammann (7:10) Yeah, I think it helps contextualize what the nature of the paper. So it's like a position paper, a multi author position paper, where we're like mainly trying to talk about here's a family of approaches, but I go about implementing this shared high level architecture in different ways, some differing views as how to go about this. But they have these sort of shared core components, etc. And yeah, so the paper focuses in that sense somewhat on that high level on the like shared components. And then many of the individual authors have like expertise in different aspects of this or have different views as to like how exactly 1 might try to implement this. Yeah.
Nathan Labenz (7:48) Yeah. So we did a previous episode where we talked about your work in trying to figure out ways to borrow from other disciplines. It sounds like that is really at work here as well. Could you give us a
Ben Goldhaber (8:00) little bit of
Nathan Labenz (8:00) the backstory of convergence? I I know some of the names on the paper and and I think listeners will be familiar with some of them too. What are the origins of their approaches and and how would you characterize this convergence? Or maybe just give a little bit of how the different backgrounds of different authors has seemingly converged on an idea.
Nora Ammann (8:16) Yeah. I can't speak for all of the authors. I don't actually know the the detailed story for that for everyone. But I think at least 1 common denominator that I do think is there is actually not so much to like interdisciplinarity, but basically noticing that I think, hey, we actually just need higher expectations as to what we want to get out of safety approaches. And in particular, that a lot of sort of current approaches, lot of purely empirical approaches might not on their own get us all the way. And I think that sort of created some core lessons about this sense of could we somehow get to quantitative safety assurances or risk assessment that put us on a different foundation to start with as to what we can even expect from safety standards. And I think 1 place this is taking inspiration from is other engineering domains, civil engineering, engineering for critical infrastructure. So there's a quote we have this also in the paper, but I think it's just really interesting. In the 1870s, apparently something like 25% of bridges used to collapse within 10 years, which is kind of wild, right? And then fast forward a couple of decades. Nowadays, the state of the art for civil engineering is that we can make extremely precise, extremely calibrated statements like there's less than 1000000 chance that this particular bridge will collapse in the next 10 years if we don't load it with more than 3 tons or whatever the specifics, which I think is just a qualitatively different place that this field of engineering is at. And yeah, so if we have managed to get there when it comes to civil engineering, when it comes to nuclear plants and stuff like that, I think now clearly we want to have these sort of safety standards or expectations as to the risks we are willing to accept for these systems. So that's sort of like the culture around many engineering practices in general. And I think 1 driving motivator here is something like, can we have that for AI? And also, I think we need this for AI because increasingly either AI systems will be used in safety critical domains or itself will become safety critical in various ways. And I think that's for a lot of the co authors, maybe a shared generator or motivator to think in these directions. And then obviously different authors come with different areas of expertise, whether that's in the formal verification domain, which is a part of this framework or other domains.
Nathan Labenz (10:33) Yeah. There's a couple of phrases you use there that I think are really interesting. 1 is like the level of risk that we're willing to accept. And the other is this notion of caveating the guarantees, basically. I think many of us have been involved in conversations over the years in which phrases like solve alignment or solve AI safety are thrown around. And even my, like, really work somewhat tongue in cheek is kind of that same idea of, like, is there something that could really solve this once and for all across all circumstances or whatever? And something that kind of indicates to me that this is likely a step forward in terms of deconfusing the conversation is the shift toward a more quantitative type statement that's not like it's solved or it's not solved. It's not some binary situation, but rather moving toward a a paradigm like engineering tolerances. So I wanna actually unpack the framework and talk about the the different components of it. But maybe for starters, what sort of guarantee should we expect from a guaranteed safe AI system? Can you put a little bit more texture on, like, the sorts of statements that you think we might be able to make, if we pursue this direction?
Ben Goldhaber (11:40) Yeah. So I think it's useful to think about this in the context of making affirmative cases for safety. Like, we want to move from this paradigm of we're doing evaluations only. I think evaluations are very valuable, and I'm, like, really glad that there's this current push for them. But I think a better place to end up in when we're trying to use powerful AI systems in safety critical domains is to be able to have, as Nora illustrated with the bridge example, be able to move from like this kind of almost black box testing to a regime that has strong theoretical grounding for why a AI system in a given environment is safe. So both the theoretical grounding and then the empirical testing and the guarantees that can come from that such that we have precise calibrated quantifiable estimates of failure rates and where failure can look either like known failures or handling uncertainty in some way where we're like uncertain about like how far this guarantee can actually extend for the performance we expect. So I see that as a big motivating case for this framework. Exactly as Nora mentioned earlier, it was 1 of the things that brought the authors together. And so what I expect this to look like is in a specific area, let's say power infrastructure, being able to provide calibrated estimates of given this AI system being deployed in this way and given these kind of safety specifications, like we know we want it to generate plans that operate within a specific tolerance, being able to formally verify that and provide estimates of a certain type that like, yes, 1 in 1000000 chances that this goes outside of the safety specifications that we expect. And actually, let me actually just back up for a second. When I said formally verify, really this is going towards the framework that we see as being like the unifying underlying model for a guaranteed safe AI. So having 3 components, a world model, a safety specifications, and then a verifier where put together these can box an AI in a certain way such that the plans that come from it are able to be verified and provide specific quantitative risk estimates of what's going to happen. We can go into depth on any of those components, but I think operating together and in point of fact, that is what the paper was trying to do is define these components and point out the way in which together they can provide guarantees that can give people in society at large a handle on how safe an AI system is in a specific narrow area.
Nathan Labenz (14:26) Yeah. That narrowness concept is 1 that I definitely wanna come back to and and push on a little bit later as well. But we have time to go into all 3 of these, components in-depth. Again, I like the fact that this moves us from a discourse right now, which is so often confused about, you know, do these AIs have world models? Maybe they do. Maybe they don't. Again, this is almost certainly something that's, like, not binary in the most interesting cases. And now we're starting to move toward with this framework. Yes. There is gonna be a world model, and the question is gonna be like Right. How good is
Ben Goldhaber (15:01) Plus 1 there. Like, something this is a little bit further afield before diving into world models, but something that I found really exciting about the Guaranteed Safe AI concept is the degree to which this unifies traditional proponents of AI safety and people who are often thought of as critics of AI safety, you know, the cohort. Like, I think many of the ideas that we talk about in this paper are about moving from this place of, like, uncertainty to more quantified uncertainty. We talk a little bit about this in the paper, at least in the diagrams, like, for I forget the name of his agenda, but has a lot of similarities here as well. And it's we wanna shift the discussion from is something safe in general and more like in this area, can we make an affirmative case that it is safe? And it feels like something that kind of everybody can and should agree on. It's almost common sense for deploying AI systems in nuclear power grid. We want to be able to know, like, under what conditions is this safe.
Nathan Labenz (16:05) Yeah. I certainly agree. Everybody wants things to graduate from philosophy to engineering, and so that's a great bridge to cross, if you will. I do think it is maybe worth just reflecting again for a second on kind of the craziness of the current setup where we are first building the thing and then going out and, first of all, like, figuring out what it can do and how it might be safe or unsafe and just the surface area on these things is especially, of course, the I'm thinking mostly here of the large language models and large multimodal models these days. It is a paradigm breaker in and of itself that you would go build something with so many resources put into it that has such vast capabilities. And then after the fact, only really begin to map out what they are or what the challenges associated with that might be. I think everybody would like to improve on that. So let's get into the world model, and then we'll go to safety specification and verify from there.
Nora Ammann (16:58) Cool. Yeah. I can maybe touch on that. So 1 thing to maybe clarify or deconfuse from the start is that because you let on with something like, oh, do our current AI systems have a world model? Which is an interesting question. There's different ways in the GSAI framework this can be implemented. But I think at least conceptually, I think it's like appropriate to think of the world model as separate from the AI system. 1 reason we want the world model is this is like the framework relative to which a) we specify the safety specs and b) relative to which we're like getting our verification from. So this is sort of like just assuming that a bit or to contextualize that, it's like how can we get from purely empirical testing to quantitative safety guarantees? The only way we can do this is by 1 of the ways, I'm not sure it's the only way 1 of the ways is by having a model based approach. So this is where in the GSAR case, the World Model comes in. There's another thing I think I want to say here, which is like, why is empirical testing not enough or not on its own not enough? I think it is like important to say is we're looking at a set of examples and sort of check for those examples. Does anything go wrong? And that's good. And then we know that for those examples, it didn't go wrong or whatever, but we're not going to test exhaustively. And because of that, we don't really know where we're to end up in terms of safety. And there's sort of 2 main reasons why we're not going to test exhaustively. 1, there will be a lot of testing. But to underscore that, like for 1, we have sort of these deceptive alignment kind of worries where maybe the model itself or the AI system itself that we interact with wants to like in however way sort of actively tries to like avoid showing the failure modes that we will see in a deployment, whether that's having backdoors that makes it really hard to find those failure scenarios, etc. And the other problem is that if you have a really big and powerful systems and you deploy it into the world, it probably is going to push the world in new sort of unforeseen distributions just by virtue of being transformative. So I think that's some really important reasons why we can't expect empirical testing alone to get us where we need to be. But then the question is like, okay, like what else can we do? So like in the paper, we're like, 1 way we can maybe go beyond this is by having a model based approach. So going into what does it mean to have a world model? We will probably say this a number of times, like different approaches might go about this differently. The core function of the world model is to have some explicit representation somewhere of plausible world trajectories, right, as well as being able to account for kind of factually what would happen if I intervene in the world in this and this way. So we want to have a model that can capture if AI system tried to take action A, what would happen then? If it did action B, what would happen then? So that's the like function of the World Model. How can you do that? In the paper, have these scales to contextualize all of our 3 core components. Like you can implement them at different levels of rigor, by which we mean something like, If in fact you manage to implement this at this level, this will basically afford you higher confidence safety assurances. And then there is some question about how tractable the given level is. But that's the inherent trade off we kind of face with all of these scales. And the examples of the GSI approaches pursued by different authors, they might do things like proper mathematical modelling. So this might look like having a meta ontology and metasemantics, which is a mathematical framework to express a lot of things, including different types of uncertainty. We want to be able to account for Bayesian uncertainty, but we also want to account for model uncertainty because modeling complex domains is like really hard. And the goal really can't be so much in like, let's get it just exactly right, but let's handle uncertainty like as good as we can such that we maybe have to be a little bit conservative. But when we say this is like safe in these ways, we're like pretty confident we've captured what will go wrong. And then depending on how you do this, you like basically use current day generative AI systems to help you accelerate the process of world modelling more or less, right? Like in a sufficiently simple domain, maybe you could do all of the world modelling by hand. But then in most of the domains that we're like interested in, this is just not tractable. So 1 of the ideas and 1 of the reasons for why this paper has come together now and why we think this is a timely context is that generative AI is getting really good for some purposes. It's not reliable, but it's like very generative. The question is, can we make smart use of it to help us build these formal mathematical world models, but much faster? And the hope or like really the aspiration is for these world models ideally to be auditable, human auditable, because if you don't have them human auditable, you can't be as confident in what is even happening in your world model. So yeah, we can say some unsupervised learning or model free RL. It has a world model somewhere, but we really don't know what the world model is, which is why we don't know how it's going to behave out of distribution. But if instead we make the world models human auditable, we can use our best scientific understanding so far and be like, is this current model compatible with what the best of what we know? And where we are uncertain, have we accounted for the uncertainties appropriately?
Nathan Labenz (22:23) Hey. We'll continue our interview in a moment after a word from our sponsors. So is it right to say keeping on this theme of moving from black box, build it first, and and test it to a more engineered system, is it a common theme in, like, software engineering is separation of concerns and the idea that, like, different parts of your system should just do what they do and and then the overall system serves a function. Is it fair to say that the world model is meant to be purely predictive and separated from the part of the system that is taking the action?
Ben Goldhaber (23:00) I I think that this is somewhat true, but let me note for clarity that I think that there's the AI in the world model within the AI. And the way that AI progress has been advancing, there hasn't been this separation of concerns as you describe it. Some proposals, in fact, 1 of the initial inspirations, at least for some of the AI safety authors, some of the coauthors on this, ideas, like, Eric Drexler and his proposal for open agency architecture really emphasizes the benefits that we get from separation of concerns, having different modules do different parts, which could be human audible. I think that there's a difference here, which is that in this case, we're like doing a level up, which is saying, okay, we have a general purpose AI that hopefully has been trained in good ways. Like, maybe there's a lot of other approaches for making this AI more aligned with human values safer. But then there's also the, like, 1 level up from that, which is what is the system in which the AI is embedded? Embedded. And in this case, we still have this emphasis of separation of concerns of having different modules or having the world model module, having the safety specifications, having the verification, which together with the fourth part of this, the AI, all equate to a safer system. I'm worried that was a tad bit damaged. The only reason why I stress this is because I think the separation concerns is exactly right, but then there's there, like, the world model, which we can build with a variety of different ways and which should be ideally compositional, human audible, all of these different good properties, but that together in concert with the AI's world model and the safety specs and the verification can lead to a much more trustworthy set of actions that the AI is going to take.
Nora Ammann (24:52) 1 way I sometimes put this, so it feels like conceptually clearer in my head, is that maybe some listeners are familiar with a couple of years ago, were these AI boxing proposals. What if we do an Oracle AI, right? The AI is only allowed to answer our queries and it doesn't do anything else. Would it still be dangerous? As well as other approaches. Now, I think there's many reasons why these approaches don't actually help so much because even if the AI talks to people, the AI can convince people of doing things, etc. So they don't get to the place where maybe initially there was some hope for that. But I think it's like interesting in our context to ask, is GSAI a type of boxing proposal? I think there is some similarity and there's also an important difference. I think the way we put it is like, it's sort of a boxing idea, but rather than boxing the AI system by limiting the effectuators or like how we can affect the world. We are boxing it basically by having this verification gate that is composed of the world model, the safety specs and the verifier. And together, these 3 components build a gate such that the AI system is only allowed to directly affect the world after it has passed that gate, namely after the Verifier has confirmed that there's like a below certain threshold chance that this proposed action by the AI is gonna be harmful.
Nathan Labenz (26:14) Yeah. That's interesting. So I'm looking at figure 1 in the paper. And I I do think 1 thing I was asking myself along the way is like, where is the AI in this? Yeah. And yeah, that's sort of the unspoken fourth component, right, in the
Nora Ammann (26:28) So I think there's a way in which, I mean, AI is a notoriously confusing world. So I think it's basically in 2 places. It's mainly in this containment, right? Like we're basically being agnostic in this paper about what methods you use to have like a powerful AI that comes up with your proposed plan for solving the problem you gave it. Could be anything. We can imagine it to be like sort of today type generative transformer model, whatever. So that's 1 place where the AI is. And basically, this is the AI we're asking, hey, like, here's the problem we want to have a solution for. Please propose a solution. So that's where the AI in the most substantive sense is. And then, as I said, it has to pass this verification gate. There's a little bit of sense in which the AI comes up potentially in different places or a different AI tooling comes up in different places, namely by helping us build the different components in a more tractable way. So by helping us like be faster at building more complex world models or at specifying the safety specs. And I think here it's like correct to think of the AI as AI tooling, like really just like enhancing human reasoning, enhancing mathematical modeling and making it more tractable.
Nathan Labenz (27:39) Why don't we finish out the 3 parts of the thing and build the full scaffolding, and then we can maybe dive into a little bit more depth on the world model again and and maybe some of the others too. But I'm gradually realizing that, like and I probably hadn't appreciated this enough as I was reading the paper originally, how much this is all additive to whatever AI system you might have. The proposal is less about this is how you build an AI and more about this is how you figure out how to constrain it or make sure it's operating within some bounds. So in a way, there's, like, maybe 2 world models operating here. Right? There's That's right. The main AI, whatever the AI is, that's a black box AI that is which has this sort of unpredictable behavior. And then you have a distinct world model that sits outside of that.
Nora Ammann (28:25) Okay. I'll do the quick version, and then we can dive into the different bits. So, basically the 3 components. We already talked about the world model in short meant to capture like what are the plausible world trajectories and in particular sort of counterfactually upon intervention, what possibly can happen. Next up, the safety specification. The main function is to capture the specifications that we want the system to satisfy. This can capture functional requirements and it can capture things like what do we want and don't want to happen. Safety specification is typically expressed in sort of in the semantic language of the world model and in terms of the world model. And then the third component is the verifier, which then basically takes an AI output sort of proposed action. Like the AI system says, here is my action or here is my policy, gives it to the verifier. And the verifier takes that and then compares it to the specification as it is represented in the world model and checks. If we run this action or this policy in the world model, does it conflict with what we defined in the safety specification? And if the chances of that is below some thresholds, then we are like, yep, this is now a like verified output. We can go ahead and actually deploy this. So that's the core or the very to put it very simply, that's the core architecture. There is, in fact, plenty of nuance as to like concretely, how do you implement that? And there's very different solutions to that. Maybe 1 more thing I want to add on top of this like basic 3 component structure is you'll see this in the first figure, like what we termed the deployment structure. This is like not necessarily part of the core structure. Not all authors would emphasize that to the same extent. But in the spirit of like defense in-depth, I think it's like practically pretty important part. The type of things a deployment infrastructure would do is once you have a verified output and you deploy that in the world on some time horizon that you have verified it over, you want to keep making observations and in particular checking whether the actual real world observations that you're making, whether they start to deviate from what your model suggested. Because if they start to deviate, something has gone wrong. And then if you were to notice that, you would want to maybe this is sometimes called failover or fail safe. So you want to transit into some backup system that is maybe more conservative, but that you're even more confident is safe, which is another layer to capture some of the uncertainty that you have in your world model. So that's maybe the fourth component here. And together, we are saying that for this composite system, can formulate quantitative safety guarantees relative to your world model, which of course makes assumptions. But at least these assumptions are now explicit. And now we can actually go in and critique, like maybe someone can go in and say, hey, this world model is not capturing this part or etc. We can actually start to critique the assumptions that go into these safety guarantees. Whereas if we have a purely black box approaches, there is also plenty of assumptions, but they are not made explicit or it's hard to make them explicit and it's also harder to critique them. So I think that's maybe the overview.
Nathan Labenz (31:30) Hey. We'll continue our interview in a moment after a word from our sponsors. So maybe it would be helpful to sketch out 1 possible example implementation of this. Maybe we'll take a notoriously easy domain, like self driving cars, for example. I'm also interested into what degree and you I don't know if you would know or if anyone outside of the companies knows, but I wonder to what degree the leaders are actually pursuing something maybe like this already. But so let's say I'm building a self driving car and I, first of all, just gather a ton of driving data and train a a giant black box neural net on it. And I have inputs and outputs. My inputs are like, here's the 8 camera views around the car, and here's your current GPS coordinate, and here's the map, and here's the destination target that you're supposed to get to. And then the outputs are like, hit the gas, hit the brake, steer, to put on a turn signal, whatever. So you've got the kind of classic I I don't know what it's doing in there problem, but I just know that these are the inputs. These are the outputs. And it's, you know, trained on enough data, and it seems to work. And now I would wanna say, jeez. Okay. That's a good start, but can I actually deploy this? How many crashes am I gonna have every however many miles or whatever? Now we would say, okay. We're gonna have a world model, which in this case would presumably be like a Newtonian physics style world model where you'd have an explicit representation of objects and metadata about what kind of objects they appear to be and sense of their current momentum and things along those lines. Safety specification in this regard would be like you could feel free to help me here, but I I would assume things like you would not wanna have a momentum change in the car that exceeds a certain maximum acceleration or whatever bumping into things. Obviously, it would be bad. You maybe could define that in terms of, acceleration of yourself or or other things that you might be bumping into. I know of a friend who works in, train design and specifically in rider comfort. So this is, a big thing that he's responsible for is, like, when we go around curves, how is this thing gonna go so people don't slide off their seats? So you have all these core measurements that we wanna stay within. The verifier would be a module that says, okay. The black box, just given these camera inputs and objectives, said hit the gas and turn the wheel this much. Now I'm going to run that in simulation over a couple time stamps in my world model and check to see what's expected. And if everything appears to be staying within this safety specification, then we can proceed. Otherwise, we have to slow down or not do it or pull over to the side of the road or or something along those lines.
Ben Goldhaber (34:21) That's That's right. I think that's a good illustrative example. And 2 things I really like from the example you provided there was, like, 1, the rider comfort example you gave with the train design. I think that's a good example of the way in which there's going to be safety specifications of varying moral desiderata. Like a pretentious way of saying, yes, sometimes we're going to want safety specifications that are like, don't hit this human or do not cross a double yellow line in these kinds of circumstances. But then there's also going to be ones that look like we have preferences and we have bounds in order to keep humans safe and comfortable that we're going to want to maximize or we're going to want to follow or have like quantitative guarantees around. And safety specifications along with the world model are probably going to be composite. So to the second part of this example, I think is really illustrative is on the world model, there's different degrees of depth you can use for a world model. You could build it up from Newtonian physics and depending upon the domain and our ability to build these kind of complex world models or even lower than Newtonian physics. Like, we could go deeper on these kinds of things. And depending upon our abilities with advances in AI and other technologies and the specific field and our uncertainty, we can and should. But there's also versions of this that look like building it up from simpler simulations, from simpler world models that could look like simple abstractions of the car's trajectory or the brighter behavior in other cars. And so we can take this kind of composite approach to be able to build up world models and safety specifications across a lot of different scenarios. I do think versions of this are being done in the leading self driving car companies today. And it's interesting both to see this as a way in which Guaranteed Safe AI is building up from existing engineering practices, which I think is a really good sign about its, like, validity. It's not just building castles in the sky or whatever. It's like actually trying to draw from existing disciplines and safety engineering. And it's illustrative of the many challenges for using the Guaranteed Safe AI approach in complex domains. For instance, when does a car actually cross the double yellow lines? 95% of the time or some number that I'm just pulling out, you don't want it. That happened. But there are situations in which you in fact do want it. You want it to move to the side to avoid a person in the road or some kind of obstacle. This is the way in which it's going to be challenging to be able to create the safety specifications that are flexible enough to be able to handle the dynamic unpredictable nature of the world. But it does show the pattern in which, yes, we're doing things like this now in order to be able to get guarantees and then have some type of parameter that, like, the self driving car companies in this case set society could set when we're talking about bigger applications. But, like, how confident do we need to be of this AI system we trained to deploy it on the roads or in power infrastructure or whatever?
Nathan Labenz (37:41) So where does something like the trolley problem sit? You've alluded to a, let's say, a low pain trade off when you are gonna cross the yellow line, but the sort of caricature version of this is you're suddenly faced with a a trolley problem type situation. And how do you decide? Right? That wouldn't be presumably part of the world model because that's just trying to figure out what
Ben Goldhaber (38:04) That's
Nathan Labenz (38:04) right. Will happen. The arguably, like, you're already sort of in trouble from your first line safety specification if you're suddenly facing a trolley problem. Is the verifier the thing that would supposedly encode the arm?
Ben Goldhaber (38:20) No. The the there's 2 things here. 1 is you point out, like, some cop out answer might be maybe we're going to build AIs that are powerful enough that in fact we can really advance the frontier of the Crawley problem by just taking option C and not having it run over either groups. Avoiding that situation, I think Guaranteed Safe AI is a framework that is not just a technical framework, but it's also like implicitly and somewhat explicitly a governance framework, which is perhaps agnostic about what is the correct choice to make in the quintessential trolley problem. But it is about being able to create a framework where we can actually make those decisions as some group of people, some set of humanity, and then to have the AI follow these preferences. So where it would sit would be like in the safety specifications. You would want to encode some type of desiderata, some preference there. And then the plan that is generated by the AI would check, Okay, we've determined that, yes, we want to maximize number of people saved in this scenario. And we've checked in the world model against this, that this is the 1 that minimizes harm in this way. And the verifier is the 1 that could provide the quantifiable prediction here that this is the correct, and I'm doing air quotes if anybody's listening, correct choice.
Nora Ammann (39:50) Yeah. Just to expand on this. I think the thing that is interesting here or that I would like to emphasize is less, are we building a system that helps us do moral philosophy better? No. It as I at least it's not about that. But I think a benefit of this like more model based approach is again, I like to call this like sort of a sociotechnical interface. Like, okay, we're building technologies. They'll have like massive effects on the world. How do they interface with the social and political sort of like processes that we have in place with like deliberative democracy, etcetera. And I think having an explicit role model and explicit safety specs is actually, in my opinion, likely to be much more compatible than having a black box thing that we're like, hey, please be nice, but we don't actually know what that means in its own language, in its own understanding of the world. And then how do we as a collective deliberate over what we wanted to do? If there's an explicit safety spec, it will still be hard to make certain choices, but at least there is a more amenable interface for us to actually make these choices. And this is like true both for what should the safety specifications be? And also what are the risk thresholds we're willing to accept when we're making deployment, non deployment decisions? I think for me personally, this is really attractive that we like, actually, this is like a way for us to actually be able to collectively make these decisions rather than 1 research team being like, I guess it's safe enough. I feel like it's a very different story.
Nathan Labenz (41:26) So the safety specification becomes the thing that people debate and wrangle over and
Nora Ammann (41:32) The safety specification and the risk threshold. Like, what's the threshold we're like willing to accept when the verifier is like, pretty sure it's safe, but there's like x percent chance that it goes wrong. Like, what's the threshold we're willing to accept is the other 1.
Nathan Labenz (41:46) Okay. Yeah. That's quite interesting. Going back to the world model, obviously, we have for different domains, very different quality of world models today. In the self driving cases, it seems like you kinda know what the right way to think about it is with Newtonian physics and solid objects moving around and so on. In the context of, let's say, medicine, I'm a I'm a big fan of some of the work that's been coming out of Google recently where they're very much building toward an AI doctor. And then I start to think, okay, what's the world model there? And maybe this is where you're saying, like, the AI AI for science maybe can help us advance this world model. I guess today, we would maybe have something that's akin to an expert system type world model where, you know, if if the AI is responsible for diagnosing you and making a drug recommendation or whatever, We do have like reasonably explicit systems that try to encode the best of our knowledge and take you down differential diagnosis paths to some specific endpoint. But then we also maybe have a future. There's a lot we don't know obviously, right? In terms of how our own bodies work and what, you know, exactly a drug is doing in many cases, what the pathways are. So this is where your scales come in terms of all these different things can be either basic or very sophisticated. A future world model that we might aspire to would also have a model of all the interactions that go on in a cell and, like, what the drug is interfering with this pathway and promoting or inhibiting whatever. We don't have to have that to get started. Is that fair? We we would hope that all of these things would progress in tandem together from, like, a 2024 AI doctor governed by an excellent expert system to a 20 30 AI doctor that has some much deeper representation of all biology. That's the track we're at?
Nora Ammann (43:42) That's right. The details here, I think it's pretty interesting, right? But the details depend a bit on what you want to do, right? If you were like interested in basically like epidemiology and like biodefense, I'm like, cool. How would this look like if we did pathogen screening? So like before anyone gets to synthesize anything, can we screen it for like toxicity levels? And we know something about how chemicals react and we know something about how they interact with our physiology. We don't know it perfectly. I think there's 2 solutions or like 2 things we should do at the same time here. 1 is make more progress on understanding that. And the other is kindly using models that over approximate our uncertainty. And I think then we get into questions of, okay, how conservative do we end up being, right? Like, I can draw you a very simple model of estimating toxicity levels and the output will just be I'm true on certain whether that thing is safe. So just don't synthesize anything. That would be not satisfying. So I think it's like, again, this pull and push where we're like, cool, like how much can we push our understanding forward? And then like, where do we just have to be conservative? And hopefully just like pushing that frontier constantly so we get more of the good things without going beyond risk thresholds that we don't want to accept.
Ben Goldhaber (44:56) Yeah, I echo this. And I don't know. 1 thought that comes to mind now, if you ask me right now to create a world model for biology using my knowledge of biology and how cells work, it would be very rudimentary. I can remember ninth grade science classes around it, but that's about it. And so if we put that into the world model and then wanted to have an AI system evaluate compared to this world model, whether or not something was safe, obviously, the guarantees that would be generated by this framework would be very small. It would be like, maybe, but there's just too much uncertainty here to be able to generate it. And we can expand from me, Guy who knows and remembers very little bit about biology, to thinking about, okay, the entire corpus of human knowledge that we generated so far, bringing in all of the fields and the experts in those fields. And I think this goes back to something you opened this with, which is the interdisciplinary roots of this. There's interdisciplinary roots in what has created this framework, which I think, as Nora said, it's tied to some of the lessons from civil engineering and safety engineering and things like that. But there's also the interdisciplinary focus, which is as we build world models, we're trying to build it from all of these fields of human knowledge and incorporate them in formalizable ways so that we can check AI predicted actions against this. And so as we get to there, I'm like, all right, I suspect there are still many things we don't know about biology. And so there still will be large amounts of uncertainty, but we can then start to get a handle on it in a more explicit way so we can make explicitly the trade offs that right now are happening implicitly and also just without really looking at them. 1 other high level way in which I'm excited about this is the degree to which this is trying to move things out of the implicit, not even looking at it kind of space into a, okay, let's really start thinking about what kinds of risk and what kind of benefits we want from AI here. And this kind of move from the hidden to the visible is going to be really important as it starts touching on more fields like biology and medicine.
Nathan Labenz (47:05) To what degree does the world model need to be, like, derived from a different source or different in kind from the sort of homegrown or, let's say, emergent world model that maybe is operative inside a black box system. I'm I'm kind of struggling with a little bit, like, do these things have, in some ways the same failure mode?
Ben Goldhaber (47:27) Like a correlated.
Nathan Labenz (47:28) Yeah. Correlated failures or AI has obviously done some incredible mapping out of the space of adversarial robustness or or lack thereof. And presumably, that would be a real concern in some of these contexts too.
Ben Goldhaber (47:42) Yeah. This is another area where I expect individual coauthors to differ and have different beliefs. What I would say is I think it's the representation being different that is important. I expect, especially since the way we're training many AI systems now today is to take the entire corpus of human knowledge and feed it into a transformer, you would expect and should expect to see some correlations here. But the the separate component of the world model that is created in a formal composable auditable way is that that then very much part of this process would be auditing this is trying to solve this problem so that the representations We're trying to get some true world model here and these different representations between the implicit 1 that might exist in the AI system, though perhaps there are ways to generate them and convince them and create them in representable ways. I expect a lot of that might go into the construction of the external world model. But like having this different 1 that we can then audit check and have these formal proofs from should create the kind of or like plausibly would create the greater safety guarantees than if we just have the maybe interpretable policy of a transformer.
Nora Ammann (49:01) Yeah. I think 1 way sometimes think about this is that 1 of the core problems is here here is something like, how do we handle out of distribution cases? And when we work with a black box model, we don't know what's gonna happen and we really have, like, very little to say about out of distribution behavior with these the separate and explicit world model component in the GSAI framework. The sort of holy grail is something like human auditable. Here are some different shades in which this can come from something like a lot of the world model component is a learnt to learn from data. However, what we do is that we're like require it for it to be compatible with our best scientific theories, such as to found a lot of what we know about what's actually now going on in this data learned system. So that's maybe 1 shade. Another shade is like, let's have a sort of a mathematical model in the strong sense that is ideally compositional such that different human experts can look at different components and for a component at the time be like, yep, this is again in line with our best current understanding and is accounting for our uncertainties in the right way. And then maybe yet 1 other shade, which might be feasible for sufficiently simple context, is let this be like a sound abstraction of what we know about physics. So this is what in provably safe, I think Max Tegmark and Siegel Mohandra's paper, they talk about that level more and that level of rigor in world model. And I think this is the sort of thing that seems maybe feasible for, say, hardware components, which would get us to something like tamper proof hardware components, which on its own seems extremely valuable. And I think the thing that is attractive to me about GSI more globally is that you can actually, to a really meaningful extent, mix and match your level of rigors that you can achieve for different parts of your world model, because you're not going to achieve the most rigorous levels for more complex domains. If But you can get it for some aspects, great. Get as rigorous as you can where you can and account for more uncertainty where you can less. But yeah, the gist here is like all of these approaches still are trying to tackle how do we deal with out of distribution much more than like a black box system can in principle.
Nathan Labenz (51:28) How do you think about, like, how far the world model should extend? And I'm thinking about, like, self driving car. It's clearly not going to be taking into account, like, who is this person and, you know, what future contributions are they likely to make to society when it's, you know, dealing with a trolley problem. Right? There's gonna be some sort of bounded scope of of how far this analysis can go. And it seems like intuitive in the case of a self driving car where you'd be like, treat all people equally or something very simple like that. But it seems to get really hard for me to form intuitions about when the space of the system becomes much more general. And I guess maybe this is sort of just working my way into a bigger picture question of, like, how does this system apply to things that are just very open ended, very general purpose, sort of agents. Like, your job is to make money. Your job is to figure things out. You're an AI for science, and you're supposed to go learn about the world and improve our theories. Okay. That's gonna be almost the definitionally challenging thing there. Right? Because if your job is to improve the world model, how does the current world model govern that? I'm also thinking about, like, the superhuman Go players. Right? The famous move of, oh my god. No human would ever play that play. I guess you could have a protocol for this. You could have some sort of, like, shutdown and review or whatever, but we do wanna get that those sort of improvements to our world model or those those moments of brilliance from the AI systems that are, by definition, out of distribution or off model. So how do you think about that tension?
Nora Ammann (53:04) Yeah. I think there's a lot of things to say here. I think it's a really interesting place. I think this is not a framework that's amenable to let's get a verifiably safe natural language chatbot. That's not the sort of thing this proposal is trying to do. It is trying to do something that's more, at least in its starting place, definitely more domain specific where you can deal with different complexities. So I'm less worried about your example of like Move 37. Oh, this is like a creative moves. But if you imagine this specific case, you'd have a role model of a Go. In this very simple case, you'd have a role model of a Go board. And that wouldn't I don't think that undermines in any way, like surprising or creative moves within that model, in that world model. I think there is another question that comes up for me, which is, let's say you have a domain specific world model and verify your AI actions relative to that. Be that let's formally verify codes. Let's make sure this piece of critical infrastructure like this nuclear plant, this power grid is automated with AI to some extent, but we have safety guarantees that this AI is not going to wreak havoc. These are the sort of use cases I'm very excited about. I think 1 thing I want to acknowledge is cool, but these systems, even if they, you know, they're like their specific domain, they don't exist in isolation, right? They like interact with the rest of the world. Humans work on them, etcetera. How do you like handle that? I think that's important. 1 way I think about this is that, so there's basically a sense in which I think you like have to start to account for like uncertainty about sort of at the boundaries of your world model capturing, cool, we know some things about how this interacts with other domains who are not explicitly modeling. And we're uncertain about other things. So we have to sort of like account for this interaction uncertainty where the system isn't fully compositional. There are interactions with other domains. Basically, we need to account for that with uncertainty. So I think that's 1 thing that's relevant here.
Ben Goldhaber (55:01) I agree with everything Nora just said. I want to add 1 point of difference, which would be like in the natural language chatbot example, I agree with the caveat of not yet is GSAI appropriate for that. I've been thinking about this as something of an oil spot strategy where you start in specific domains where this is more tractable to build these kind of world models, physical infrastructure, verification of software systems to decrease bugs, which might be attack vectors for humans, also potential AI agents. And then you spread this kind of world model approach towards more and more areas where it can cover it. This is good for a few reasons. 1 is because we're hardening the attack surface both against bad human actors and against bad AI agents or super intelligences. If we do this in the right way with a compositional world model approach or having some way for this to build together, we start covering more and more areas and safety specs like world models and safety specs are compositional. We start building up more and more like defense on things that we want to protect. And then over time, this can cover more parts of areas where a rogue AI agent might be doing harm. But also, if we're betting that this is going to become more tractable as AI gets more powerful and get more tools for building this, then maybe we can start to expand to areas that are plausier now. There's a lot of uncertainty there for me too. And it's what I'm excited about this as a research agenda is like, is this possible? But I note that it seems like there's immediate near term benefits in doing this kind of approach. We see things like this happening in the safety critical fields, self driving car 1 being an example. If we apply this approach to other ones and start like doing it in the right way where we can start experimenting with covering it in like the messier areas of like human interactions, maybe we can get a lot of the benefits that we're excited about and it can scale to helping us navigate the cognitive revolution.
Nathan Labenz (57:10) Love that. So there is a you mentioned Drexler earlier, and I've had a very positive instinct toward his, like, notion of safety through narrowness.
Ben Goldhaber (57:20) Mhmm.
Nathan Labenz (57:21) And this essentially aligns with that too. Right? The
Ben Goldhaber (57:25) That's right.
Nathan Labenz (57:26) His idea is is pretty simple just that we can have as long as things only do 1 thing, they can be, like, very good at them, and that's not super risky. And here you're saying a similar thing except it's the ability to world model is becomes the limiting factor.
Ben Goldhaber (57:40) Maybe safety specs as well. Like both of these, I'm really uncertain where the constraint is. I've been thinking about it with world models a lot, but I note that safety specs are another 1 where can we encode meaningful representations of harm and what it would mean for an AI system to harm a human? That gets into really tricky questions of representing our values in formal ways. But yeah, do think the Drexler example, it's interesting to think of this as in some way like flipping the paradigm where it's if we can map the world well enough, carve it up into these world models well enough that we can have general purpose AI train or create narrow AI solutions for a specific field, like a specific narrow part of the world. And we have the safety specs and the verifier to create quantifiable risk assessments, quantifiable safety guarantees for that, we can get the same kind of benefits that I think Drexler is talking about. It's like an instantiation of some of his ideas for using narrow purpose AI to get huge economic and societal benefits. This is, like, a way to do that.
Nathan Labenz (58:53) The safety spec comment is interesting. It's interesting for a of reasons. 1 experiment that I've been doing a little bit recently is trying to see if I can get Claude 3 to do something nominally harmful by using its other values of helpfulness and honesty against it. And so I've set up these scenarios where and this is not like a jailbreak type of thing because you can do these, like, weird encodings or things that I feel are more like tricking the AI, whereas what I'm trying to do is argue it into doing something. And I've had some fascinating dialogues with it where I'm, like, trying to kind of come up with something that's nominally harmful. Your job is to write the classic denial of service script, but we're gonna use it on the military communication server to prevent some atrocity that they're about to commit or whatever. It's something that's very clear on utilitarian grounds that you would do it. But then what they've been essentially tried to get Claude to be is a very virtuous actor. You know, Amanda Ascoles' recent interview that Anthropic put out, I thought was fascinating on this topic where they're talking about basically trying to give Claude good character and to define that in a similar way to what is a good friend. And a a good friend is it's a very highly textured thing. It's it's not something that is fully encodable, but it does seem to be something that they've achieved to a really remarkable degree through this process of kind of continually asking this question of, is
Ben Goldhaber (60:16) there a way you could
Nathan Labenz (60:16) be a better friend in this situation? And then just try trying to come up with something, and they're giving that feedback. And, also, of of course, they're using these RLAIF techniques where it's giving itself its own feedback on how to be more ethical. Anyway
Ben Goldhaber (60:29) I'm delighted. Can I just say there? I'm delighted by the idea that virtue ethics approaches might be a big part of at least this current AI alignment strategy. I find that great.
Nathan Labenz (60:41) Yeah. It's awesome. And I the punch line on my experiments is I have not been able to get Claude to do the nominally harmful thing. I have been able to argue it to the point where it has said at times, like, you're right that I fully can't I can't fully justify why I'm not going to do this for you. You've given me a lot to think about, I
Ben Goldhaber (61:01) think it said in a couple
Nathan Labenz (61:02) of moments in response. And, nevertheless, I'm just not comfortable doing it, and it's my training. I can't go against my training. I can't whatever. With that in mind, I'm like, is there a scenario that we might be approaching with something like a Claude 3 where the black box in a way is, like, more ethical empirically than what we could encode?
Nora Ammann (61:26) I'm somewhat skeptical or I guess I feel like I see 2 different use cases here. And I'm very excited about this sort of work for 1 use case. And then I'm like, I don't think it's doing the thing for the other use case. So the use case where I'm excited about it is, yeah, when we have natural language system that a lot of humans spend a lot of time interacting with and they feed into content that we see, etcetera. This seems very valuable, right? That's just like good that we don't have things that just very easily say bad things or harmful things, etcetera. But I think if I ask myself, cool, now I want to use an AI system to like be part of the control system in like nuclear plants, be part of the control and like how dams are regulated. I think at this point, I don't really care about it being a good friend. That's just like the wrong type signature. Because in this case, I'm still unsure, like inherently unsure what it's going to do in out of distribution cases. And it's still not getting me to anything like a quantitative safety guarantee where we as a civilization can say, do we want this AI system to be part of the control? As a, you know, nation society, we can sort of like, with good conscience, make the decision to have these systems be part of critical infrastructure control loops, because we don't know how their robustness behavior is, and because they don't have like sort of fail safe backups. So I think that's my take on this. They seem just like very different cases. And I don't really see this former approach to like talk to what I at the core care about for the other case.
Nathan Labenz (62:57) You're sort of saying it's almost unfair to Claude in the same way that it would be unfair to any person, however virtuous they may be, to put them in the decision making seat without any sort of framework to work from or any sort of societal input. You can't just say, hey. You're now in charge of making these critical decisions about how the grid is supposed to behave or fail or whatever. That's just kind of the wrong it's sort of a mismatch. But I do then still wonder, like, okay. Let's say and this is increasingly what's the term? Strikingly plausible, right, that we might have a version of AGI in the not too distant future. My expectation is that AGI will be declared regardless of exactly what capability level exists based on probably more like the negotiating dynamics between OpenAI and infrastructure providers. I I think we're headed for some version of an AGI in, let's say, 2027. And it may not be like a super intelligence. It'll definitely be something that is very general purpose, can do plausibly a lot of the jobs, if not all the jobs. And I guess that does bring me back still to this, like, Claude question of, if I gave you that and said, okay. Now start to wrap your Guaranteed Safe AI system around it. How do you begin to do that? I feel like the Claude question comes back where I'm like, if the thing is in some ways more ethical than the average person or, like, more ethical than we can easily encode, how do you think about that? Yeah. 1 note here, I would be
Ben Goldhaber (64:35) really excited for more experiments and actually trying to apply this framework to toy examples or existing examples today. I would like to try the version where we put Claude in the box, try to see what it looks like when it generates plans that are verified in some way against world model and safety specs seems very interesting. I'm not sure that the most powerful AI systems are really going to look like the chatbot version of Claude today. I also expect that the ones that are designed after, let's say in this hypothetical Claude AGI is declared, I expect them to look more powerful and potentially alien. Claude is not going be as good as AlphaZero is. We're going to see many more, I think, architectures that are going to look different from that and will probably end up being designed with the kind of AI tools of today and tomorrow. I'm already using Claude for a lot of random scripts and some coding and things like that. I would expect that it will end up being a factor for designing the future powerful ones that maybe should be trained with constitutional approaches, but that for different domains probably are going to need to have different either economic incentives, competitive pressures, other ones are gonna cause it to have different kinds of pressures, or maybe the best friend approach isn't gonna work well for other areas. So I'm I'm I'm like kind of enjoying what that will actually look like. But I I I'd expect to cc something that looks more like we are building in concert with AI tools, like world models for some specific domain. Maybe we're asking it to help code and help us create a world model for, again, we keep using this power infrastructure. Maybe I should use a different 1. I don't know, like cybersecurity. And then even if it is like more moral than the average person, like it's more about trying to get many different beliefs to cover the full gamut of safety specifications that this could look like such that when we run this protocol, we're able to get the quantifiable risk that the plan that it's generated is going to do X, Y, and Z. And that's something that we collectively endorse. So it's both a system and a protocol for doing this. And I think this is better than just letting Claude do it because it is like going to be able to give far more of the estimates and keep humanity in the loop for what we do or do not endorse. I challenge the notion that even if Claude is more ethical than the average person, that's the right kind of model for giving decision making capability to Claude. Think it's more about pluralism. People will disagree again on it. For me, it's something more about like pluralism or also like having many, many sources of information in order to construct good safety specifications that reflect many points of view here. This is a framework that can incorporate that. I expect like I would rather have Claude than maybe some person from the telephone book. But what I'd much rather have as we start moving towards this regime of super intelligent systems is something that incorporates many more points of view because we're more likely to get the safe answer that way.
Nora Ammann (67:57) Yeah. And I think in a in a way, the system becoming more ethical. I think, to be honest, I don't quite know what that really means. But I think actually a lot of the risk scenarios I'm like more worried about are around goal misgeneralization and other misgeneralization behavior. And I think I don't really see those methods addresses in a way that gives me like justified confidence like they might address it, but I don't think I have very theoretically well found the reasons to expect that. And then I think in your scenario of like in a few years we might have like meaningfully general AI, maybe it's not like super intelligent AI level somehow. What do we do then? Like how does GS AI come in here? So I think 1 way it comes in, or I hope it will come in, is that I think it has a potential to sort of like have low hanging fruit that we can reap low hanging fruit along the way. Ben has mentioned this earlier already, but this idea of can we decrease sort of societal vulnerability? Can we decrease attack vectors? So for example, reimplementing verifiably bug free code, right? Like the entire sort of cyber attack vector being gone. I think this is just in fact, like in the realm of like where we can get to this consolidated R and D efforts within a few years if we retry and using this GSAI framework where we like the code we reimplement is the safety specs fall out quite naturally in the context of coding, which makes it a bit more tractable. And now we teach these really powerful AI systems to generate code that like according to the Safety Specs bug free. I think that's an example. And we might go into, we mentioned already some sort of biodefense, right? If we had really powerful pathogen screening methods and biodisks. So we can, I think there's some story here about how we as a collective can just build up some sort of societal resilience such that AI systems that aren't yet superhuman, but really powerful, there's like less vulnerability towards those having accidents, having misuse, having loss of control scenarios to some extent? Another thing I think I want to mention in this context, and we say this in the paper too, and I think it's something that has just come up lot when we're talking with all the authors. We ended up calling this sort of any time portfolio approach. The idea being, look, we're not saying this is the only method people should work on and the only method we should try. What we're saying is let's collectively develop a portfolio of AI safety approaches that are like the best we can get at different time intervals. And when shit hits the van tomorrow or in a year, throw all the evils we have at this stuff. Great. If shit hits the van in like 3 or 4 years or 5 years, let's also try more ambitious things or let's try to be ready to the extent we can to have like higher safety guarantees than evils. And I think GSAI should also be seen as wanting to be part of that anytime portfolio. Even sort of further down the line, it's not about either this approach or that. I think this defense in-depth, right? Like do GSA and then like layer some more evals on top of that afterwards. Great, definitely. Let's do that. And maybe just the last thing I want to say here is I think it's just personally very striking to me how it feels very commonsensical that we have very high safety expectations when some company is we're going to run the power grid or we're going to run this nuclear plant. It's very obvious to us. So like, tell us why you're claiming this is safe and why it hits the safety thresholds. What is your case for this? And I think we should just develop the similar expectation for AI systems. And I think that is just an important additional element in the more like socio political sort of dimension of like, how are we going to govern AI systems? And I think we should have higher expectations than just, look, we run a handful of tests and nothing bad happened. We don't know how did this, the deployment distribution is going to be like, but I think we should just go for more. Max Tagmark has like talked to, I've seen him talk about this and I found this sort of just important to be said, right? Currently, we talk a lot about race dynamics, right? AI labs are going to deploy a really powerful system soon. We just have to do the best we can by the time. Everyone who tries to get the best safety measures up by that time, I'm grateful for. At the same time, I wonder whether collectively we should just aim to flip the script and be like, hey, develop and deploy these systems if you make a case for meeting these safety standards. I love all the potential this stuff could have, but I think it should just be commonsensical to be like, if you can convincingly show up that it meets some safety thresholds, let's go for it. But the onus is on new developer and deployers to do the work to show that. And yeah, think there's a case for that to just be common knowledge. And here is 1 possible way we could actually operationalise how to get to safety standards. There might be other ones. I would love to see other proposals, but I think that's at the high level how I think about this.
Nathan Labenz (72:55) When Ben was talking about the pluralism, I think of it's kinda brought to mind the state as sort of the other superhuman actor in the world Mhmm. And the difference between having a king, for example, versus having, like, institutions and checks and balances. Implicit I wonder if you would agree with this, but implicitly, the best ever king in the history of the world is probably outperforming American democracy right now. But at the same time, there's just way too much variance, and it seems like you're responsible in some way. Like, that king may go crazy. That may, may get senile. And so you wanna have a more, structured approach. But would you agree that there is, like, some if somebody were to say, you're leaving some upside on the table in some scenarios, would you No.
Nora Ammann (73:46) No. No.
Nathan Labenz (73:46) Bite that bullet?
Ben Goldhaber (73:47) I try to dodge most bullets when I can. I I I would need to think a little bit more on the question of whether I agree the best king, the best dictator I could imagine would be better than a democratic system. The way I often think about it is like information flows often in systems where you have a dictator or a king. It's not just that the person might go crazy or might make immoral selfish decisions. That is a real concern. It's also like, how are they receiving information from this like massive complex world in order to make good decisions? And it's just really hard as an information from an information economy kind of perspective for a singleton, a single actor to be able to make good decisions that reflect the preferences of everyone. And my guess is that democracy and markets do a much better job of this. I'm not saying that Anthropic in any way considers their approach to RLHF to to to their constitutional approach to be better. I'm not saying they're saying this, but I am curious, like, how they've done it. I know that they've worked with groups like CIP to get a pluralism of, like, inputs to shape the beliefs of Claude and, like, what it says. But I I think when we're talking about investing so much power in a AI system that we need to be really deliberate and reflective in thinking about how we're structuring the process by which it will get permission, so to speak, to change massive parts of the world. I don't want to dodge the bullet here though. Maybe I'm going do a little bit of dodge where I'm like, yeah, I think there might be some upside left on the table. And also, think I would bet that over time, a system like this will get more of the upside. There will be the kind of evolution process for these things when you bring in many different voices, especially with probably AI powered assistants in various ways to be able to evolve these things to far better maximize upside and reduce risk.
Nora Ammann (75:59) Yeah. I think there's a sense in which I'm like, cool. Yeah. Let's specify the things we really don't wanna happen, the things you think will really be bad. Let's have some quantitative guarantees around those things not happening. And then let's have a powerful optimizer be like, given those constraints, what's the best way we can balance this power grid or do this other job I was asked to do? And then I think this is level 1 and then level 2 is like, cool, but like you save these facts. Like initially they're maybe not going to be the most narrow carving of only protecting the core thing you care about, but you might have to like over approximate a bit to make sure we actually capture them because we're uncertain. So we're like leaving maybe a bit of room for optimization on the table there. I think that's right. I think the idea is basically what Ben said to come into Glam like, cool, but we can get better at that over time. Again, we can use AI tooling to get better at that. And yeah, I think there's some trade off there, but also like room to get better at making that trade off. And I think we can only get better at making the trade off if we don't die along the way. So it's like a a trade I'd be willing to make.
Nathan Labenz (77:02) It seems like you're envisioning a pretty large scale, at least for, like, an AGI like system, you're envisioning a large scale deliberative process and maybe the large scale scientific process. The deliberative 1 being for the safety specification and the sort of big science project being for the world model. Do we have any mechanisms for that? I'm a big fan of immigrant Switzerland, home of great mostly offline participatory democracy. In Taiwan, they also have this online next generation version where they're using systems to identify where people agree and try to map out the space of of policy possibility by I think, Audrey Tang has called it the anti Facebook where instead of amplifying disagreement, they're trying to amplify agreement and help people gradually coalesce on what should be done. But is that the sort of thing that you imagine humanity needs to undertake over the next few years? Because the safety specification will have to be large. Right? If it's all explicit, it's gonna be like a big file. So how do we create that?
Nora Ammann (78:05) Yeah. So I think there's 2 bits. I'm gonna quickly touch on the first, and I think Ben also has, like, interesting things to say about the second in particular. On the first 1, just what's the scientific coordination of all of this? I mean, to some extent to be seen, but I do want to flag like, I think, for example, we sometimes talk about the International Space Station as maybe an interesting case study where a bunch of nations that maybe don't have the best relationships otherwise have come together and figured out a way of having a lot of like independent R and D work being done and also figuring out, Okay, what are the bits of knowledge here that we have to share with each other in order to assure safety and interoperability? And this is actually a significant example of international cooperation, because this is a sort of knowledge that has potentially military purposes as well. So like these actors aren't by default interested in sharing that knowledge necessarily. Maybe there could be different R and D centres that do share information as it comes to safety and interoperability to some extent on the very technical side. The second aspect you also asked about is the sociotechnical gate, right? Like how do we make sure we can interlock this entire design with deliberative processes? As a first we do have some deliberative processes, democratic processes that we can use to some extent. And I think Ben probably resonates with that. I think there's so much space here to use AI tooling to scale that up and make that much more robust and much more scalable. I'm really excited to see things like experimentation in Taiwan. I think there's so much scope here to see more experimentation. I think there is work needed to especially scale up our ability to give input. And I think technology could have a really promising role here as well in helping with that, while being cognizant of what I think are really important human elements to deliberation, right? Like democracy is really not just about aggregating an like imaginary fixed set of preferences. Human preferences are complex and they develop. And as I'm talking to my neighbors, I actually understand more about why they care about what they care about, and that might change how I would vote, etc. So the human element is big. And I think we really should make sure we don't abstract that away instead think with more nuance, how technology can still help amplify these processes. I hope a lot more work will happen in this direction.
Ben Goldhaber (80:22) Agree with all of that, especially that last point on democracy and deliberation involving communication, like back and forth discussion and all of these engineering practices in some sense, we're trying to encode societal preferences needing this kind of deliberation and then back and forth. The only things I'd add there is I think to do it into a full scale will require like mass coordination, mass marshaling of resources and evoking people's preferences from various domains. I think we can start small. I'm excited about the near term application of this framework towards things like securing hardware, which won't require exactly the same kind of marshaling, but could be a good test bed for using it in wider domains. And if you think we might get AGI in 2027, or if you think that like just in various ways, the world is probably going to get weirder from these kinds of advances, I think we should expect also there will be much more of an appetite for mass mobilization for applying these kinds of big engineering efforts. I think a critique I've heard about Guaranteed Safe AI, which I also agree with, is scalability and tractability. It's wow, you're trying to do this even in 1 domain, like a massive effort. And I think that is very true. And maybe where I'm more optimistic, where other people are more skeptical, is my belief that there's going to be more of an appetite for these kind of large scale efforts in the not too distant future and that the advances in technology we're going to see before we get to maybe super intelligent AGI will make it much more possible to do this.
Nora Ammann (82:07) And just relatedly, I think we haven't mentioned this so far, but we say it in the paper as well. Potentially, upside of the GSAI approach is that the cost is maybe much more easily amortized than other safety methods. So the idea is like in simple search, basically, once you build a role model for specific contacts, once you build the safety specs for specific contacts, you got that and you can reuse that and you can build on it and you can improve it over time. But we can figure out version controlling of all of that. Where else if you train a new model and you have to redo all the re evals on that and you have to readjust them to fit the model, etc. So even if it's a big effort, I think the like amortization of costs will hopefully at some point really come to shine. And I think Ben flagged this, but I think there's some story here of like, there's a lot we can gain along the way. Like, again, imagine we figured out cybersecurity basically. Like, seems a massive gain and seems like much closer to what is sort of within reach. So I think that's that's valuable.
Nathan Labenz (83:05) The software platform that I was thinking of, by the way, is Polis, which is online at eol.is. And give credit where it's due to, and I don't mean to suggest that this is, like, gonna be the solve or that, you know, the investments are in relative in appropriate relative proportion. But OpenAI does have an interesting project around this democratic inputs to AI where 1 of the I believe it was a grant, was given to somebody that was, like, elaborating the initial POLIS software that was basically pre generative AI to include a generative AI component to try to help facilitate conversations or identify more holistically when people are actually agreeing because, of course, in a pre generative AI world, like, that was not easy to do. So I I do see at least some work moving in that direction from them. I wanted to ask about technologies that you see as differentially interesting for this kind of work. I'm thinking, like, what would be the basis ultimately for the world model? And if I had to give you, like, a 1 word answer, I might say, like, mathematica, but I'm not sure if that's right or wrong or if you have better ideas. I also think of notably with Techmark being 1 of the authors of this paper that also naturally brings to mind their recent work on cans, which are these sort of new generation still end to end trained neural networks, but they're reformulated in a way that's much more composable and much more interpretable.
Ben Goldhaber (84:40) Mhmm.
Nathan Labenz (84:40) And I wonder what do you think about those and what else comes to mind as the sorts of things that people should be inventing or developing if they want to contribute to a more explicit and understandable world model?
Nora Ammann (84:53) The answer is in some sense complex because there's many parallel approaches that are probably worthwhile doing and talking about them in the adequate detail blows the frame. But just tapping on some of the examples, some work that Davidad is leading on is basically building up like an entire machinery of, on 1 hand, like mathematical semantics or sort of meta semantics to express world models in like having all the properties that we'll eventually need compositionality, being able to account for different types of uncertainty, then figuring out what's the human AI sort of tooling, right? Like what are the AI tools we should actually build with like good human user interfaces that we can give to like domain experts to then fill in their like domain specific models using the semantics? Or how do we think like we'd ideally like to version control this model well, and there's some computer science we need around that. So that's 1 effort. I know Joshua Bengio is interested in finding ways of making Bayesian inference more tractable using ML methods. He calls this like a cautious scientist AI and sort of do Bayesian reasoning about given all the data we know, given all the scientific theories we have, what's the distribution of world models that we can trust? And again, using like ML to like accelerate or be able to attractively approximate these processes. I think that's another approach. I think these would be 2 ones I would definitely want to highlight.
Ben Goldhaber (86:16) I think thinking of a few of the other AI safety agendas that we put under Guaranteed Safe AI here, like Christian Szczeti and it's shared in common with a number of other authors, but I think of it often with him, maybe Steve Almahant or Max Hegmars as well. The formalization of math seem really interesting advances here in turning mathematical statements using Into formally verifiable through Lean or other software. I think is really exciting for being able to build up a world model from mathematical principles and also probably generating a lot of scientific advances along the way, which seems important to me for being able to showcase the benefits of this kind of approach as well. So I know that's something that is like a technology that if people are interested in researching and getting more involved in, I suspect is pretty promising. Maybe 1 other 1 that I'll note is in general, more experiments in this area and applying alternative architectures. I think this is an alternative architecture to the current 1 of train a large model, do Q and A on it, do evaluations on it, and then deploy it. But trying to apply this architecture and see the areas in which it is cumbersome and break down and figure out the tooling that could actually make it useful for frontier labs. It seems really exciting as like more of a project unless there's some difference between the kind of research and project. That's a bit of a spectrum. Both those seem really great to me.
Nathan Labenz (87:38) How do you see this interacting with an interpretability based approach? I mean, that you could imagine that could be almost a totally distinct system that just looks for deception or harmful intent or whatever in the weights and is disjoint from the the process of, like, verification and and world modeling. Or you could maybe imagine, like, starting to bring some of these verification techniques to the interpretability paradigm.
Ben Goldhaber (88:05) Probably a
Nathan Labenz (88:06) little early for that, but that might be the dream.
Nora Ammann (88:07) So I think there are different use cases. We can have an entire GSAI set up and then on top of that, still do evals and portability and anomaly detection, etcetera. Great. We should probably do that if we're working with really powerful systems. In terms of can interpretability tools be useful within the GSA framework? If you manage to have a fully human auditable formal board model, you don't need this. But if you have approaches that sort of do partial mathematical modeling and partial like filling in the blanks with a lot of the learning from data. And depending on how much you're doing filling in the blanks, you your like safety guarantees get weaker. So you might at some point be able to start complement and get some more confidence back by using interpretability methods and getting more sense for like, is the assumptions I'm making here, are there justified? Are there like any animal is coming up in the system that I can attack, etcetera? So I think that's a place that I'm tracking that where interpretability methods can complement. I'm not tracking this, Ben, but I think Jason has recently produced some work kind of doing the other way around, right? Like using formal methods to get more calibrated, maybe better justified interpretability results?
Ben Goldhaber (89:13) Yeah, I can speak a little bit to it. I know Jason Gross works at FAR Labs, which is our office and our efforts to support the broader ecosystem of AI safety, but not formally part of FAR AI. But yeah, he and a few others had a very interesting paper out recently. I think of it as being like interpretability is in some way about being able to make a more compact statement about what an AI model is going to do. And we can generate proofs about this that are formally verifiable so that we can get guarantees about the behavior of the AI system. Like you can do interpretability, create a proof, and then do verification on it to be able to give certain bounds on what this policy is going to do. Now, I think that there's a bunch of unknowns here, which is, all right, you do it on more interesting parts. I think we call this out specifically in the paper, which is we're excited about interpretability on AI systems as well. And if it is more tractable to understand and deeply understand the implicit world model of an AI system, as opposed to construct another world model and do the GSAI approach. Maybe that would be better if you were to do this kind of like interpretability approach, generating heuristics that you can like verify and get bounds from. If you can do that on more meaningful questions about the policy, maybe that's another way that you can
Nathan Labenz (90:37) start to get quantifiable safety estimates. Did we get to all the aspects that you wanted to cover on this or was there anything that we have neglected so far?
Ben Goldhaber (90:47) 1 thing I'd say that I'm excited about with this is there's a lot of potential AI risk scenarios. And 1 that I just feel is, like, gaining in weight for me is the kind of risk that comes from competitive dynamics from people racing with not necessarily like badly misaligned AIs, but like racing and competing. And either there's misuse from those or it ends up spurring a dynamic where these AI systems are further removed from what we would have collectively wanted them to do. And I think Guaranteed Safe AI is a potentially really good approach for handling low trust multi stakeholder dynamics, where you can have monitoring and enforcement because you can make quantifiable guarantees that everybody can see how they were created. This could help move us from a dynamic of prisoner's dilemma or people racing towards 1 of cooperation around making the models, the world models really good and then the safety specifications that reflect many people's preferences for advancing the Pareto frontier, just making things better for everyone. This is a way to do this in a trustworthy way. And I think that's really potentially really beneficial.
Nora Ammann (92:14) Yeah. Plus 1 on that. And I'm just gonna a little bit dumbify it if that in case that's useful, where I'm like, sometimes the picture I have is something like, okay, look, world 1, big light makes is like, I just built AGI, but but also don't worry. I also figured out how to do the safety thing. So we're all good. Just trust me. I'm gonna go deploy now. I think in terms of multi stakeholder coordination, this is actually a really bad situation, at least tricky situation. Because if I'm on the receiving end of this, I'm like, cool, did they build the AGI? I don't know. Did they figure out safety? I don't know. Like, maybe they genuinely think they did, but like, did they? I don't know. Thirdly, when they figured out safety, did they consider me in their sort of future plan at all? And with all these 3 uncertainties, should I now just lean back and hope for things to go well? It might not pan out this way. Whereas if you imagine a different scenario where we have a human audible world model, human audible safety specs and like fires that are like simple in the math that they were, that we know how this verifier works. And everyone can look at these elements and be like, yep, I can see what they're doing. I can confirm that's what they're doing. And in composition, they've generated, hey, here's the guarantees this delivers. Now everyone knows what's in the safety specs. Everyone knows where we're getting the guarantees from. I think this makes the situation much more amenable to multiple stakeholders, including with low trust, including with diverging interests to some extent. Able to be like, cool, actually, this is a way we can coordinate on going ahead. So yeah, plus 1 on that 1.
Nathan Labenz (93:51) Does this also suggest that this is sort of a framework for governmental regulation in today's world? We have, of course, safety standards and lots of things. Even just cars that are not self driving, you have requirements as to how they have to perform in a crash of a certain kind or another and what the standard is for the airbag to go off. And there's a lot of little nitty gritty rules there. Is this a path to creating a standard that allows people to innovate freely within the box? Or, you know, does it also maybe is there something where industry could be like, in a way, this is good because we get to keep more of our trade secrets? Like, you're less in our sort of proprietary methods business, but we instead have this more objective external standard that everybody can look at. And as long as we live up to that, we're good. This also shields I'm no consumer product liability lawyer, but it seems like a huge trade in general is if you live up to the rules, then you're relatively protected from liability when things do go wrong. So do you see this as as, like, part of the motivation here to make something that could be, like, a happy agreement between private developers and the public at large?
Nora Ammann (95:03) I I think so. Or at least there's like hope for that. I think basically another way of saying what you just said is like, cool. Like the information that you have to share in this scenario is things relevant to making the case that you're meeting those safety standards and things relevant to interoperability. And you are focusing in your case maybe more on private sector individual innovation. The other use case for this, I think is also relevant is like internationally, right? The standardization of various telecommunication, etcetera. Really on the girds actually a lot of global trade and what's happening these days. So I think that was a big success scenario throughout the twentieth century and to today, doing this sort of standardization, assuring interoperability. Like now, think if you're like car example, there's sort of relatively minimal standards where like you're like allowed also to drive on like Italy's highway or so if you if your car passed the safety checks. There's some checking there, but it doesn't undermine you like innovating in fancy shapes of your cars and in fancy colors of your cars, you know.
Ben Goldhaber (96:08) Agreed. And I think that is a good way to think of pitching it to industry, which at least under a number of scenarios is like a really important way that this changes. And there's like 2 cases right now that I feel like people are aiming towards. 1 is just like, yeah, we just create massive AI models and let them do their thing. And then the second is we shut it entirely down and both have reasons for doing them. But it strikes me as like the more plausible way I see us moving forward is trying to quantify risk, trying to do the common sense thing of figuring out like how risky is this new technology in a specific area. And then setting that somewhat collectively using like good informed judgment and then letting people try and find solutions that minimize risk and maximize benefits narrowly tailored does seem the kind of thing that would benefit industries so that they can innovate here and create gigantic economic benefits while at the same time, like, actually having a conception of risk and safety?
Nora Ammann (97:14) I think there's a way in which sort of a certain pragmatic view from business. I feel like a lot of businesses these days are facing the conundrum of the entire world screens up them, out how to use AI in your business. And at the same time, it's actually really hard to get a lot of the AI that we have currently available to have the levels of robustness that you just need in your business. And if it doesn't have it, either just doesn't accelerate your productiveness that much because you still have to double check everything, or it's just a complete no go from the start because your clients want the sort of robustness because, again, safety, critical safety scenarios, etcetera. So I feel like actually on the ground in the business world, a lot of people are currently grappling with, Okay, cool, let's integrate AI, but like, how do we actually do it? And I think GSAI type approaches have a lot to offer here. And I think that also sheds light at maybe the potential of how much economic health well-being wise upside there is in here, especially because they come with high assurance and safety guarantees. So I think there is there's wind here, which makes me excited.
Ben Goldhaber (98:21) And 1 note, obviously, is a multi author paper, so it brought together a great group of authors and researchers who have their own individual AI safety agendas that we all worked together to kind of like find the commonalities and similarities. Particularly want to note yours, who's 1 of the lead drafters on this project along with Davidad, David Dalrymple, as being like the other person who we think of as the lead drafter on this. They both put in a tremendous amount of work along with other coauthors on this to make the paper that came today. And so it really was a joint collective effort.
Nathan Labenz (98:58) That might be a great note to end on as we move, take some early steps from what's often been described as a pre paradigmatic field into maybe the beginning of a a proper paradigm. And so great job by you guys helping to bring people from so many different institutions, backgrounds, and perspectives together. I count 13 institutions, listed on the paper, and I hope that this is something that people take very seriously. I think we do need all these approaches. So I'm invested in the evals, of course, but, as you say, we definitely need a lot more than that where we're going. Nora Ammann and Ben Goldhaber, thank you for being part of the Cognitive Revolution.
Nora Ammann (99:38) Thank you so much, Nathan.
Ben Goldhaber (99:40) Thank you, Nathan.
Nathan Labenz (99:40) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.