Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Watch Episode Here

Listen to Episode Here

Show Notes

Geoffrey Irving, Chief Scientist at the UK AI Security Institute, explains why our theoretical understanding of machine learning remains fragile even as models surpass experts on critical security tasks. He details AISI’s work on frontier model evaluations, red teaming, and threat modeling across biosecurity, cybersecurity, and loss-of-control risks. The conversation explores reward hacking, eval awareness, and why current safety techniques may struggle to deliver high reliability. Listeners will also hear how AISI is funding foundational research to build stronger guarantees for AI safety.

Use the Granola Recipe Nathan relies on to identify blind spots across conversations, AI research, and decisions: https://bit.ly/granolablindspot

Sponsors:

Serval:

Serval uses AI-powered automations to cut IT help desk tickets by more than 50%, freeing your team from repetitive tasks like password resets and onboarding. Book your free pilot and guarantee 50% help desk automation by week 4 at https://serval.com/cognitive

Claude:

Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

CHAPTERS:

(00:00) About the Episode

(04:09) From physics to ML

(08:52) AGI uncertainty and threats (Part 1)

(18:08) Sponsors: Serval | Claude

(21:29) AGI uncertainty and threats (Part 2)

(27:35) Control, autonomy, alignment (Part 1)

(34:02) Sponsor: Tasklet

(35:14) Control, autonomy, alignment (Part 2)

(38:44) Inside the UK AC

(51:02) Evaluations and jailbreaking

(01:01:17) Emerging capabilities and misuse

(01:14:20) Agents and reward hacking

(01:26:09) Theoretical alignment agenda

(01:38:39) Debate and formal methods

(01:51:19) Limits of formalization

(02:02:27) Future risks and governance

(02:16:23) Episode Outro

(02:18:58) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Transcript

This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.

Introduction

Hello, and welcome back to the Cognitive Revolution!

The Cognitive Revolution is brought to you in part by Granola. If you're a regular listener, have heard me describe the "Blind Spot Finder" Recipe I'm using to look back at recent calls and help me identify angles & issues I might be neglecting, but it's also worth talking how Granola can help raise your team's level of execution by supporting follow-through on a day-to-day basis. This week, for example, I had several working sessions with teammates, and I committed to a number of things. In the past, there's a good chance I'd have forgotten at least a couple of the things I said I'd do, but with Granola, I can easily run a TODO finder Recipe and get a comprehensive list of everything I owe my teammates. This is the sort of bread & butter use case that has driven Granola's growth and inspired investment from execution-obsessed CEOs including past guests Guillermo Rauch of Vercel and Amjad Masad of Replit. See the link in our show notes to try my blind-spot finder Recipe and explore all the ways that Granola can make your raw meeting notes awesome.

Now, today my guest is Geoffrey Irving, a pioneering machine learning researcher who's co-authored seminal papers with a who's who of giants in the field, and who is now Chief Scientist at the UK AI Security Institute, which is, in all likelihood, the most situationally aware government entity in the world today.

With roughly 100 technical experts on staff, and a mandate that includes:

threat modeling,
pre-release frontier model evaluation for dangerous capabilities spanning biosecurity, cybersecurity, and loss of control,
advising the government on strategies to reduce catastrophic risk,
funding independent frontier research,
and engaging in global diplomacy…

Geoffrey has one of the broadest portfolios and most commanding views of the AI landscape.

And while he's optimistic about our ability, in the fullness of time, to solve the major open problems in AI safety, for today, without a hint of hype, he paints a genuinely alarming picture.

Our theoretical understanding of machine learning is nascent. Nobody, he argues, should be particularly confident in their mental models of how AI will go.

Models already outperform a majority of experts on a great many security-related tasks, and there's no good reason to expect their progress to stall.

RL is working well beyond strictly verifiable tasks, and jaggedness matters less when even the models' weak spots are as good or better than the best humans.

The many increasingly sophisticated bad behaviors we've seen over the last 18 months are broadly all different versions of reward hacking, a problem for which we lack theoretical or practical solutions.

We likely won't get many 9s of reliability from current safety techniques, and there's some reason to expect they could all fail at the same time, for the same basic reasons.

It is getting harder to jailbreak models, but the AISI Red Team has never failed to do so. Eval awareness is an open and growing problem.

Voluntary cooperation between frontier model developers and the AISI is working well, but not everyone is participating.

The AISI is seeking to fund theoretical research in areas like information theory, complexity theory, and game theory that might produce stronger guarantees, but these fields, like most of rest the world, are just beginning to take AI seriously at all.

Geoffrey is an intellectual powerhouse, but I came away from this conversation just as impressed with the UK AISI as a whole. This is an organization staffed with top notch talent, that has its finger on the pulse of industry development, and is speaking very accurately and clearly about AI's trajectory and how many major questions remain unanswered, even as frontier model company CEOs tell us that they are less than 3 years from creating expert-level AI machine learning researchers.

With that, I hope you are focused and motivated by this conversation about the AI state of play, with Geoffrey Irving, Chief Scientist at the UK AI Security Institute.

Main Episode

Nathan Labenz: jeffrey irving chief scientist at the U K A I security institute. welcome to the cognitive revolution.

Geoffrey Irving: thank you. i'm excited to be here. i'm.

Nathan Labenz: excited for the conversation. we've exchanged messages for a while and have been building up to this. and i'm excited at the moment is finally here and you have really a storied publication history that goes back to working on the original tensorflow papers with some guy named jeff dean being a co author on the original RLHF paper working on concepts years ago. you're going to OK a caveat but still right there alongside paul cristiano some early AI safety papers with no less than dario on concepts of using debate to try to bootstrap into a stable equilibria and stable AI safety regimes and even published a call for social scientists to enter the field of AI safety with one amanda askell. so i would be very interested to hear how it was that you came to have such a good nose for where AI was going so early on. all these things are well before chat GPT.

Geoffrey Irving: yeah so i used to think i was like new to ML but i said that for too long and now i'm not new to MLI. think like the i got out of undergrad with a bias against statistics. i'd only seen frequentist statistics. i thought they were kind of weird looking. i'd never seen bayesian statistics. so i just like didn't like any of the stuff. and what i liked instead was things that have kind of hard theory and like you know the equations you know what they like you have some ground proof. and that was like computational physics and mathematics and on the computer science side like programming languages and theorem proving and such. and i did mostly combinational physics and geometry for grad school and then kind of years after that until around twenty thirteen. and then i could have realized two things. one that machine learning was getting quite good. so like the neural nets were starting to work and they were getting better and better. and that was that was going to continue probably. and then two even in the areas where i thought it was about knowing precise theory so physics or theorem proving you need a common sense and you weren't going to get away with just the theory. that was not going to be enough. so if you're doing mathematics or parking languages you needed some ingredient of like heuristic picking between the various options and you wouldn't be able to do a good job designing kind of human usable friendly systems without basically machine learning. and so then i was like OK i should switch into machine learning. i was doing something else back then. the first thing i tried to do is auto correct for code in twenty fourteen which is too early to do auto correct for code. it did not work. then also we didn't know how to do machine learning. we were like this is myself and martin wick and we knew computer science physics geometry but not really ML. so we tried to do a start up a year didn't work. and then we said well how do we learn what we learn by joining google brain at the time. and then kind of i've done ML job since then. so that was i joined google brain in twenty fifteen and the goal was for me basically machine learning for theorem proving. i had kind of i was aware of safety at the time but i didn't see an an attack on the problem that i thought was was good. and so i thought i'll work on some other kind of different problem which is just sort of hardening the world using verification which again was going to be using machine learning to do theorem proving in practice. so that was sort of twenty fifteen twenty sixteen. and then i guess there's kind of two i mean that was like oh i had some useful thoughts early on but there's two other kind of parts of the story. so what is that? there's kind of two bits of kind of inherited wisdom which is why it looks like i predicted things early. one is just i joined open AI in twenty fifteen and dario and paul were there and they had a bunch of cash thoughts about safety and how the machine learning field was going to develop. and so i was just sort of writing along from there. but then kind of more broadly there's just a bunch of intuition coming out of kind of theoretical computer science complexity theory about how how computations work how we check computations that someone kind of with more resources than you can run. and a lot of what i've done since then including debate for example is just sort of applying that intuition assuming that it will kind of hold in some modified form into the machine learning world even if it comes from some kind of area of more precision. so that's sort of again you can like just assume things are kind of going to look like theory in some way with a bunch of porting required. and you can predict a bunch but not exactly how long it will take or kind of when things will happen.

Nathan Labenz: we can unpack both of those i think in more depth as we go. certainly the quest for theory and you know bounds that you can really trust in is a big theme of your work and and the work that you're trying to encourage at the AC these days. and i'm also really interested to get your take on the relationship between math and the fuzzy messy real world. but well let's circle back to it. fast forward to today. i'd be interested to know how AGI pilled quote unquote you are today. and you know that sort of just informs like what is the general because so many AI discussions broadly and especially around topics of safety and security go kind of immediately haywire when people have just such different intuitions around what it is we're likely to be dealing with. so i like to try to establish like what is it that you think we are likely to be dealing with? i don't expect that would be the official position of the UK AI security institute but you know you're obviously a leader there. so it seems to me if i'm reading the reports it seems like you are not expecting any sort of wall or plateau in the immediate term.

Geoffrey Irving: so i think that i can actually fortunately kind of the things i can say are mostly also the things that we think kind of officially which is that one should have a lot of model uncertainty about how things could go. and that could either mean that there are obstacles that cause there to be kind of stalls for a good for a while or there could be no such obstacles and things are good quite fast. and i think mostly anyone who kind of confidently claims in one direction to the other with like too much confidence oh ninety nine percent certainty of there are or are not big obstacles they're probably wrong and they should be more uncertain. and i think that means for us like when we want to map out what those different classes are like what could the obstacles be? what are kind of signs of development? and then two we should assume or place significant probability on the current methods will scale. and where they don't scale more mundane stuff will replace them and continue kind of different like further further sigmoids. and so we do i think have like significant credence on things will go fast. i won't say exactly how fast because i i don't talk about exact timelines but i think that is kind of pretty important. and then i think we we published a paper from the strategic insights team at AC on different potential obstacles to AGI and like what are what is our progress over the last while on addressing those. again they could all not actually be fundamental obstacles as in they could be kind of solved not by pure scale but by kind of maybe a bit some scale and some just like steady algorithmic progress to models to scaffolding to data that kind of thing. or you need new algorithms and then they take longer. but generally i think the both both my view and the view of AC broadly is like have model uncertainty over those all of those terms. and then that should mean that you're not confidently saying it will either go very fast or will not go very.

Nathan Labenz: fast yeah it's good to have some of that in in a world leading government. this will be the big focus by any means of today's conversation. but what is your personal AI productivity stack or pattern of use look like today?

Geoffrey Irving: so i think a kind of vanilla so i use kind of all of the models different things. i think mostly i use like one of them is a default. usually it's clawed but it varies. and then i kind of go to other ones if they have like specialties that are particularly good. like i think for a good while DBD was better at math and google is better at kind of certain other things. this shifts over time. so i just kind of use general models and then i don't do a lot of coding in my job but i do for fun kind of like like formal verification work. for a while i was just using cursor there because the agents like weren't that good enough at doing full on a gentic stuff that is not in true anymore as of a few weeks ago. now they are good enough. so that means that if that shifts to stuff like codecs and quad code basically. but that's mostly not my job. mostly my job is meetings and talking to people and advising on research.

Nathan Labenz: i'm glad you're still making time for a little formal methods on the side. OK let's talk about the overall landscape in terms of the threat model that we have for from AI. and i'd be interested in kind of also your characterization of what you understand to be the de facto plan to address it. again people have so many different starting points here. what do you think is kind of the the set of big things that we should be worried about?

Geoffrey Irving: so i think that like we kind of breakdown risks. the main two focuses of AC are catastrophic risks and large scale societal impacts. and the main kind of three catastrophic risks we focused on is our bio large scale cyberattacks and loss of control. the the team is called the chem bio team for chemical and biotic weapons. more risk comes from bio in practice i think. and then on societal impacts that's sort of human influence. so that's persuasion and kind of emotional reliance. and then kind of various kinds of societal resilience to like attacks on cni like critical national infrastructure and that kind of thing and sort of like agent behaviour in the world various like various agent risks. and i'm i've spent like most of my time on the catastrophic risk side and chris summerfield who reaches director here spends most of his time on the south olympic side although we also do a mixture of both. so those are kind of the main risks we work on. we work on i think we also are kind of thinking about gradualist empowerment kind of more kind of structural risks somewhat. but i don't think we quite know and no one really knows how to mitigate these at a large scale. there's some work we're doing that's like a either investigating that or kind of thinking about mitigations but that's more decent. so that's the bulk. i think i forgot the other half of your question though.

Nathan Labenz: yeah what do you so what? in the absence of people changing the discourse or new discoveries new big ideas how would you describe what we are on track to do today? often i would say at least the way i've characterized it is it's sort of defence in depth with hopefully we can patch together enough nines through enough layers all of which are kind of leaky but hopefully they're not too correlated in the ways they're leaky. and this is always kind of worked in the past so hopefully it'll work this time.

Geoffrey Irving: i think there are yeah. you're not going to get to a lot of nines with the current technology. i think broadly we can kind of break this down by domain. so for misuse risks so like biological weapons and cyber attacks this is mostly kind of safeguards kind of differential access. so give models only to certain people that are vetted in some way and then non model defenses or like pandemic preparedness and kind of improve security that kind of thing. and i think this it's the safeguards are not that strong sort of open source models are also kind of pretty good. oh there's a gap. and so i think that there are the kind of the the stock plan is mostly is sort of like is in some sense you use the model side mitigations to give yourself a window and then you try to harden the world against these risks. and i think whether that will go through or not is kind of is is not we should be uncertain about that but it's not that it's not kind of strong like solutions to those things. if the models keep keep growing as we see them growing in strength on the loss of control side i think that is kind of a combination of mundane empirical approaches mundane whatever like pragmatic empirical safety measures and a lot of monitoring. and then using that monitoring. this is again this is the AI developer plan typically using those mitigations to get you through into an automated kind of automated safety research regime where hopefully you find better solutions than than those first methods. i think this has various flaws. and so maybe it'll go through but we should definitely not you're not going to get to more like to a couple of nines with that kind of plan. and i think you wouldn't kind of know with the current methods that it was going to work until after it went through with confidence. you'd have a lot of uncertainty. and so i think that's kind of the story. and i think the most of the approaches we have now look like that they're they're empirical. maybe they'll go through like on the on the alignment side or the the pragmatic approaches to get through this kind of automated safety thing safety phase. it's kind of AI control measures and monitoring and kind of odyssey training and kind of white box detectors and all of this. all these are kind of pragmatic and they do i think all have correlated potential failures where they could in fact all fail for the same essential reason. and you would need stronger advances to be confident that will go through. and anyway that goes through. i think that we do need because of the misuse risk kind of a lot of mitigations on this on the non model side as well.

Nathan Labenz: so when you talk about we can't expect to get too many nines yeah. would it be a fair arithmetical move on my part to take one minus one nine and say you're sort of implied P doom if you will is at least like a ten percent? or would you segment that down further and say things could go wrong but i wouldn't put it in the doom category or i'm?

Geoffrey Irving: just declined to answer that question. i think to not give numbers to things as a civil servant.

Nathan Labenz: my i answer usually like ten to ninety percent which is also obviously kind of a way of not answering. but qualitatively it sounds like you are taking very seriously the possibility that this is going to go not just like kind of crazy but meaningfully catastrophic.

Geoffrey Irving: yeah. i think loss of control we view it as a potential catastrophic risk. i think it's like what's the one thing we're doing? there are a bunch of uncertainties about this model. so when there's two different teams at AC they do various kinds of empirical alignment testing one using kind of adversarial methods one doing kind of kind of step back statistical analysis of different factors that cause models to do sketchy things. and i think a lot of part of that research is trying to pin down this threat model and like what drives strange behaviour kind of when when agents when when models are are behaving in ways that you'd expect would correlate to these kind of extreme scenarios. because then people kind of there's we talked to a lot of partners kind of within the government or kind of other governments or like or other parts of society and people have pushed back to this kind of this kind of risk model. and so we want to provide kind of as much evidence as we can and it down. but again it's an area where one one should have a bunch of model uncertainty and then think through the details despite that.

Nathan Labenz: can you unpack the intuition around why everything would fail at the same time for the same reason? that's something that i've heard from a number of people. this V for example always says that. and it seems that thought seems to come very natural and feel very intuitive to some people. and then to others it's like i don't know you know there's jaggedness all over the place. like why would we expect that a you know if i can't get a given model to do this and that today like why would i expect that suddenly everything 's going to crystallize and there's going to be this uniformity of the model 's ability to breakthrough all sets at some.

Geoffrey Irving: so i think that maybe there's an important thing there where the models are jagged today but if you ask them to do tasks of the model that the models could jaggedly do five years ago they don't they're not jagged. and so the question is for the capabilities you would need to realize a variety of risks if you kind of push forward a few years or how many years it takes. so like very strong capabilities you should expect those models to still be jagged but up above a frontier where kind of potentially all the things you're seeing are kind of they don't look jagged. so if i look at a the best go player in the world the best chess player in the world they have a bunch of jaggedness. like if you like play sit them down against the next best go player in the world they'll kind of win or lose for idiosyncratic factors. they'll have different tastes they'll have different parts of the board of the game they're better AT. and if you sit down in front of me they'll just wipe the floor of me every single time even if i have nine stones and i'm like a halfway decently strong like amateur go player. so i think it's this is a you have to run the calculation think about the model as it would be in the future. and i think it's part of this is this kind of a non like it's unhelpful when people talk about AGI or super intelligence or whatever as being this thing that can do everything because it does imply that it's sort of so qualitatively different than the models of today. whereas i think the non magical version is just it's better at a lot of things and indeed kind of superhuman performance at a variety of kind of risk relevant domains. and we know that models can be superhuman at certain domains. like they're better than me at knowledge they're better than me at like lots of math. it's just kind of true for everyone. like certain domains they're probably better than every person will have domains where the LMS are better than them currently. and this is kind of rising over time. and then they're very fast. so they can think quickly. sometimes they can do tasks very well. they do like ten times faster than humans can do them just because of computational speed. and then they're not very interpretable. so like the methods we have for interrogating their behaviour are not that reliable currently. and so i think that sort of non magical picture of kind of more capable machines still with some jaggedness up at the frontier where they are jagged is enough to make kind of that give you significant probability on these risks.

Nathan Labenz: and if you're to kind of you mentioned like bio earlier is sort of the drives most of the risks certainly in the biochem category is like the number one really bad scenario in your mind. the some possibly prompted possibly unprompted AI somewhat gets to a point where it can breakthrough twelve layers of defence all at once manages to release the bio weapon and.

Geoffrey Irving: i think for bio it's mostly human misuse is the thing that we're mostly focusing on. so it's people using machine using LLMS to do to do bio design of various kinds. i think that the models do couple together but i would say the loss of control couples more strongly to to cyber than it does to bio. there are more scenarios where those two coupled together. and i think that's when we have a team called the cyber autonomous systems team. so we merged cyber and autonomous systems which was the team doing loss of control because of that coupling. but again for cyber misuse and for most of bio that's about kind of human actors.

Nathan Labenz: so on that cyber autonomy i'm just trying to kind of get the the modal story of like what is going to happen? it's it's sort of or what might happen. obviously it's due to jaggedness. you would have sort of a period of time in which these various defenses like become breakable by the AI but they also have to have kind of some restraint in them. i guess. i mean if you listen to buck from redwood you would say maybe they don't even have to have restraint. maybe we might just let them do some of these things and kind of look the other way which is an interesting commentary on us. leaving that aside for the moment it's kind of they have to get to the point where they can kind of do them all. they have to string all this together and then they like take over a data center and sort of entrenched themselves. and now we're in a world where i.

Geoffrey Irving: i don't want to talk about the super detailed modelling there because some of our yeah some of those are not like public stuff. i think the maybe the background systemic thing to say is if you imagine we're very very serious as a world about deploying these things only in the most sandboxed kind of well well controlled states risk would go down by a lot unclear how much but whether it goes all the way to zero probably not. but like it goes down. we're not currently on track to be as serious as one might imagine about deployments of these models. and so i think some of that question is like how much how strong will our defenses be? and then importantly if there are weird behaviors and models do our defenses go up? do we get more do we get more worried? and so for example across the last year twenty twenty five there were a variety of models kind of from all developers doing sketchy things acting deceptively or commoning out unit tests or like all of this behaviour. and our reaction was as a world mostly to continue kind of training the models to be stronger at the same time also working on these defenses in some capacity. so i think the i think a lot of the risk comes from kind of the modal scenario where we are we are not kind of doing the strongest mode of kind of computer science infosec ML defensive layers around these deployments. and then i think also as we find as we find this evidence kind of what is the cycle of that kind of feeding into further training. i think a lot of the risk these like misalignment risk comes from you get some signal of re behaviour and you train it out and then that takes that removes like some fraction of the problem but your methods only cover some fraction and the rest remains. i think it's again kind of you should have model error there. you should say maybe it's going to generalize well enough that you can cover most of the story. but generally the picture where you get these correlated failures are they don't really start out all correlated and then you apply some optimization pressure because you're doing training or iterative development and deployment and the like. and they kind of the ones that remain all end up correlated in the same way.

Nathan Labenz: because they're subject to that same general structure of automation pressure of.

Geoffrey Irving: automation.

Nathan Labenz: pressure. yeah. OK so here's a story i've pitched a couple people over time interested in your reaction to it. if we take that model and we just extrapolate out a couple years and we've got kind of you know i know you guys put out a report recently that also showed i think you even quoted or cited the meter task length the famous tracking exponential and other indicators as well of increasing ability to do bigger and bigger tasks with more reliability and more autonomously etcetera. so if we extrapolate that out let's say whatever two years from now maybe three years from now. and at the same time we sort of imagine that with each generation there's more optimization pressure put on the models to try to eliminate or possibly just suppress these bad behaviors. it seems like we might end up in a world where you can delegate like a quarter 's worth of work to an AI in like a single prompt. and then there's maybe one in ten thousand to one in a million chance that it goes into some bad behavior mode and kind of actively screws you over as it is doing the quarter 's worth of work that you just assigned it. does that seem like a reasonable extrapolation of recent trends to you?

Geoffrey Irving: so are you i think about the the numbers i don't have a strong view about who won't kind of give a take on those. i think there's kind of so when you do this kind of agent training you're training the models to be more and more coherent. they're able to execute kind of plans over longer and longer horizons in whatever portfolio of tasks you're training them on. and so they have this ability to be a coherent agent. and then models have kind of various characters or personas or whatever. and so the the kind of the failure modes are either they're somehow you've arranged that you've ended up with a model that has some deceptive persona where it's kind of always kind of trying to deceive you. or i think maybe the thing you're pointing at is you can have a model which is a bit more stochastic but has the potential to be very coherent and sort of jitters its way into a bad portion of the space portion of trajectory space. it's scaffolded with a bunch of memories. so it has kind of long horizon kind of state flicking back in time and it kind of gets in a bad state and stays there. and then we this is one of the areas i think we're interested in kind of theory folk and kind of independent empirical folk exploring is sort of what are the dynamics of models running for like a long period of time or in kind of like you sample very long trajectories and they're sort of wandering around in kind of model space. what how does that behave? what would cause them to like kind of reliably like shift back to where to a more reasonable kind of starting point just like how should we think about those dynamics? and that's kind of an area where i think it's not clear to me that is an intractable area to make progress on. there hasn't been that much like the number of person years going and understanding those kinds of dynamics are you can count them on i don't know a couple of hands. it's not that many kind of kind of understand the risk model but also potentially defined mitigations is quite good.

Nathan Labenz: let's take the other side for a second. how optimistic or optimistic maybe not even the right way to think about it but how much upside do you think there is in alignment? everything we've kind of talked about so far is like assuming that we don't have perfect control of models. they might be trying to screw us over or they might just be confused. certainly a lot of the things that i sort of i maintain a slide deck of AI bad behavior which i'm constantly pending new slides to. and you know not universal but a very common theme is that there is some tension between goals that the model has whether it's between something it learned in training and a system prompt or you know a system prompt and a user thing or even kind of runtime you know injection attacks or whatever. but once it kind of gets into a spot where it's not really sure how to weight the different objectives that it has then you can get into some some strange behavior. so the alignment question is do you think we can solve that? do you think there you know how much headroom do you think there is in terms of creating AAI that loves humanity or otherwise is so robustly good that we don't have to worry about this anymore?

Geoffrey Irving: first we should say that if you were to solve alignment in some sense there are other problems. so the the misuse problems are real the misuse domains also could grow in the future. so like i think michael nelson what a great piece about that a while back last year some time about just risks from new technologies. there's been into a large space of those and then there are risks from kind of gradual dis empowerment. those need a bit of misalignment into that later. but i do think that there's there's hope to like for to kind of close off or mostly close off this domain given enough time. so i i can i'm fairly optimistic the problem has a solution. the way i typically like to say this is that in i don't know fifty years a hundred years a thousand years someone will have solved alignment something and that's either the machines or us. hopefully it will have been us kind of in time or the machines perhaps on our behalf. so i think like i this is kind of coming from a sense that just the insecurity in complexity in computer science like usually the defender wins in theory. so if you know how to design your game kind of your protocol you can make it so that defence wins. and this is kind of i think kind of a generic situation and kind of a lot of areas of complex of theory. so then in in practice i think of course there's lots of holes in this. a lot of like practical information security does not feel like that because we haven't actually gotten to the limit case. and it's super unclear whether we'll get to the limit case for alignment as well. but i guess i i do have some strong sense that there is a solution. it's just that we might not get to it in time. and then i guess like what like maybe the what the upside is. so i think i we sort of like alignment has a variety of components as a kind of a branch above the government. we basically we we focus on honesty. so like the AC alignment team is mainly thinking about how to be get models to be non deceptive kind of tell us hopefully calibrated information to the best of their abilities. and that's kind of the domain which we're focused on kind of anthropomorphism caveats aside. and i think that is is not the whole piece of the story but it's kind of the part that we think is the most important part for us to work on and kind of the right position for like part of the government to do.

Nathan Labenz: we've kind of alluded to it a couple times in various ways but maybe just give us kind of the one O one on the AC like yeah what is it's role? i think i understand there's a hundred people that work there across a bunch of different domains. how does that breakdown and how does it relate to politics? i think it's quite different there than it is here. but obviously nobody 's entirely shielded from politics. yeah let's talk about that. so it's.

Geoffrey Irving: actually it's close to a hundred technical people like it's a researchers and kind of people on technical teams doing kind of delivery and such and then like two hundred fifty people total. so it's bigger than that doing a combination of sort of diplomacy and and kind of thinking about policy and doing other kinds of civil service and operations different roles. and broadly i think of a CS having kind of two functions. so one is BAA channel for information flowing to government and governments plural about risks from frontier AI. so that's like again both catatonic risk and large scale slidal impacts. and that's so that's like both our own research and then also like we kind of channel research from other third parties and from from AI developers kind of into government channels. so that the government is in is well informed kind of both kind of politicians and national security folk and so on. so we work a bunch with nat tech partners on that and then also out to other governments. we work a bunch with the US government and then other allied governments. and that's where we had a delegation that was in delhi at the AI summit there. and generally that's communicating the state of the risks and kind of capabilities and mitigations sort of how we think about all those pieces and how they fit together. so that's one part is informational do a bunch of research channel other people 's research and inform inform the UK government and other governments about these risks. and the other thing is just actually mitigate the problem by working on kind of both like AI developer side mitigations and non model mitigations say pen and preparedness of the like helping to drive that kind of change. so on the model side so for example we have a very good kind of red team that does adversarial jailbreaking and other forms of adversarial ML against defenses. the model providers are trying to build and we find lots of flaws and they fix the flaws and that kind of makes things better on the margin. and of course we also can communicate the results of those attacks to other parts of government. so usually things we do fulfil both of those functions at the same time. they both directly hopefully they improve mitigations on the margin and they also we can use them to inform other people. yeah so that's the big story. and then i think the politics side so we are part of the government. so we're part of the department of science innovation and technology. so one of the ministries in the UK the UK government and we so we are beholden to politicians so that we have a secretary of state. the situation is that we have been kind of well supported by both the previous government who like founded AC and then the current government as well. and so that has been quite stable and nice although there are of course differences on the margin. and so we are able to do things we think are important. we are we do adjust to ministerial and kind of other priorities because we're we're not insulated from politics in the formal sense. but i think the UK government does care a lot about these risks and it's so therefore we're able to work on stuff we think is important.

Nathan Labenz: yeah long may that continue. how would you characterize the range of reactions that you get from the different stakeholders that you brief? i feel like there are a few notable politicians who seem to be starting to get it so to speak and then a lot that are really nowhere close to your level in terms of the just how big they're prepared to think about what might be coming. do you feel like that is starting to change? do like all the graphs that you show them actually kind of turn light bulbs on or where are we?

Geoffrey Irving: i think it's changing on the margin. but the other thing is they have other priorities. so like a lot of people we talked to in sort of nasa security if they usually don't think that these risks are like not there. they just have lots of other risks that are like on fire right now and they're working on. and so i think that's just over time but i can't comment on details there. i think that broadly we we are very much in the business of trying to find kind of common ground gradually building evidence over time. they have reasonable push backs. we try to shore up again either kind of using knowledge from other researchers in other orgs or doing our own research to fill particular gaps where we think it's important to change the conversation in governments.

Nathan Labenz: it is remarkable in in reading all of the various documents that i went through in preparing to talk to you the degree of alignment between what the UK AC is putting out in official capacity and what i would say many of the most forward thinking AI safety thought leaders have been talking about in recent times. it doesn't seem like there has been a big shift either toward more mundane concerns. and i don't mean to dismiss those concerns but i do think in many like jurisdictions this sort of AI safety concept gets kind of you know watered down to a point where it's much more about fairness in various ways. and again i do think that stuff is is not to be dismissed but it focus on that often ends up with neglect of the bigger picture questions that i think are probably most urgent. and it also doesn't seem like you've had what i do see in the US in at least some ways which is just a politicization of the focus on the models. like are the models woke or are they going to do what the department of war wants and to do? any advice for people doing this kind of work in other jurisdictions around how to avoid these pitfalls?

Geoffrey Irving: obviously this is kind of a sensitive question that i can't talk about in that much detail. i think it like it it i know mostly i mean obviously i'm originally american now i'm a dual citizen but i think the i know more about the inner workings of the UK government than i do by the US government. they never work for the US government. but i don't have a like a detailed take that i'm willing to share on the podcast. what was that? so by like the my favorite collaborator when i was at open AI was paul cristiano and i it is great that he is at U S A C U S K C rather.

Nathan Labenz: so let's talk about the characterization of the current situation monitoring the situation. you might say you do a bunch of different tests you report on these tests. we can walk through them a little bit. but and i think you can kind of assume with the folks that tune into this feed that they're generally well aware of kind of the shape of the curve and the meter stuff and the sort of fact that the models are increasingly competitive if not at least on average beating your domain experts in at least modestly scoped tasks requiring substantial expertise. so we have that kind of baseline. what i would love to start with in terms of the testing is what is your relationship with the frontier model developers look like? i understand it's all voluntary interaction. what does that tend to cash out to in practice in terms of what kind of access do you get? like how long do you have? what kind of briefings are they giving you?

Geoffrey Irving: i i can't speak to a lot of too many specifics of this in part because like we we talked to them a bunch and some of those are those are are ongoing discussions. i think on the quickly on the voluntary regime i think that's working decently well in the sense that like developers all made voluntary commitments a while back and they have they're they're continuing to follow many of those.

Nathan Labenz: and just when we say all i think google anthropic open AI yeah all the others that are on that list.

Geoffrey Irving: say i i forget exactly how you'd want to define all but like many many AL apps have had say frontier safety commitments or responsible scaling policies or the like. and so like their incentives are one they've kind of made these commitments and then two they we can give them useful information. so like when we jailbreak their models we tell them about the bugs before we release any information ever. and so they have time to fix them where that those fixes are doable and they often are on the margin. you can improve things somewhat. and so i think they they get value out of this i think and also they make commitments to kind of keep keep up with it. in terms of the kind of access i think they're that is also an evolving conversation. i can't comment on what access exactly we have but like part of the research we do is exactly about knowing what access one needs to do a certain rigor of rigorous evaluations. so like we have a model transparency team and a big chunk of what they're doing is trying to understand often with a lot of research on open models because then you can do like arbitrary things what level of access is required to get to certain kind of understanding like what do you need to be able to catch problems as they occur in practice? and then that kind of informs our conversations with the labs and then that sometimes we get additional access there or sometimes we just sort of we try to align in set as because again they want to get they want usually to have us give them correct information as well. and then in terms of the timing i get definitely can't speak about how long we get in specifics. there's a couple things to say. so one is like we for example in bio some of our evaluations are literal wet lab experiments where you like have someone in a physical biology laboratory doing experiments like with a model assisting them. those we just don't do pre deployment. we do them asynchronously calibrate those results against kind of the faster evaluations. and then hopefully that gives you some signal that you can do for faster evaluations. and then but still like that gives you some wins. certainly more time makes things better. so that is a it's it always is some degree of a pain point.

Nathan Labenz: when you said that folks can you know model developers want accurate information because they can fix things at least on the margin. my guess would be that they are typically fixing it in the next model not going back and doing more training on the current model. but are there cases where they're taking your pre deployment testing and fixing that version?

Geoffrey Irving: no it's including that version. so for something like the recent so we did a one thing we're kind of doing over time is we used to do exclusively pre deployment evaluations which have this question this issue of like time they're very time boxed often we are shifting a lot of that work not least because the pace of model releases is increasing to longer research collaborations that might go back either either are hoist deployment or they go back further before deployment is finalized. and as we did one of those particular kind of over the summer with both entalpic and open AI and with the red team and found a whole sequence of problems like jailbreaks much more than could be could have been found with a like a normal length pre deployed evaluation. and that was kind of early enough. they can do ongoing fixes to their classifiers. those two providers have different like classifiers different setups for the jailbreak defense. but like they can both they can be improved kind of iteratively. so i think it is the case that you can change things for these kinds of defenses on the fly. it's one thing to say is that that the strong jailbreaks are concentrated in very particular domains and often the list of domains is bio. so i think it's hard to do. so like when we do jailbreaking we have to do it for bio risk because sometimes cyber risk a lot of other jailbreakers are finding any problem and those are usually much less well defended. so it's easier to find hacks in the models and if the classifiers are just not that trained for kind of other kinds of harms. yeah.

Nathan Labenz: so when you talk about more time being helpful that obviously or at least strongly suggests to me that while i'm sure you have all sorts of automated testing that you can throw at any new model the second that you get access the fact that more time is helpful suggests that there's a irreducible human element to what is going on. how would you characterize like what you can automate what models can help with versus what people kind of have to?

Geoffrey Irving: yeah there's a couple of things. so one is even if it's something completely automated it might take days to run because the evaluations are like long horizon and genetic evaluations these days. and it might be that there's bugs in the scaffolding because we often get models early. sometimes there are there are issues we have to fix kind of iteratively. i think meter had a report about some details here a while back on one of their evaluations last year. so that is the thing that takes human time. additionally we usually when we do kind of so i mentioned like the extreme end well of like the human scale which is like a wet lab bio experiments. the middle of that is a humans interacting with the model to gauge it's kind of domain knowledge in like how would it interact conversationally with a person. and those can provide additional signal on top of just the purely automated evaluations. and so you get better quality if you do both of those together. and again sometimes we can do those because we have the time and sometimes we can't. i think the and again like where we can we try to calibrate the slower things against the faster things. but generally like all of this is imperfect. so if you do if you have a fully automated valuation you've done a ton of like capabilitation of models you get a new model you tried a new task you have to iterate for a while to get it to be highest performance. and that is true generically for all of our tasks as well. so we can do evaluations that are quick and if we use all of the fully automated portion but they have some error rate and they mean you can't do the full thing. we also do one thing we're doing there is so like we have all of our evaluations are done inside inspect which is an open source package with a bunch of other governments and a developers and other third parties used for testing. and we're adding features to that also for say automated transcript analysis. so there's a feature it's a sub packets called inspect scout which so that is like we used to do you generate all these evaluations and then you read through them. but you can't do that at some scale. so you we also try to do that with a bunch of automated or semi automated transcript analysis and that also makes things faster. but then you still take some amount of human review to really understand qualitatively what is going wrong. and then we ideally want not just a number out of these evaluations but qualitative takeaways about what kinds of failures occurred. where are the failures? do they feel a bit like fundamental like oh it really didn't understand the task or it hit some snag that was kind of an incidental like wrinkle that probably would go away soon or with more elicitation. and so that's the kind of thing that requires more human time to dig into the details hopefully on top of automated transcript analysis.

Nathan Labenz: yeah do you? this is a tough question i'm sure but obviously everybody understands that yeah these models are very high dimensional things and they're a little bit tough to predict like exactly how to maximize the performance of any given one just because they're kind of idiosyncratic. is there any high level qualitative overview you could give on how you approach figuring that out when given a new model? or is it the kind of thing that like ADS pi or you know the new version of that? whatever the recursive language model is it just a grind of exploring the combinatorial space of how to prompt and how to do whatever to finally get to some local maximum.

Geoffrey Irving: i think it is not fully automatable yet. if it was then it would be further along the automated AI researcher kind of train but it's fundamentally very similar to the kind of illicitation one does for any task. so it's tinkering with tools and sometimes prompts and scaffolding and so on. instead of all of the cyber evaluations. some of the bio evaluations have tool are like very tool based. they might be doing kind of web searches sometimes or they're doing kind of using various things inside sandboxes other times. so i think that but that looks like the same kind of like elicitation one would do if you wanted to do a task in any kind of like corporate settings just on the different like genre of problem i guess. so i i don't think there's that much like mostly i think to to your audience like just imagine you're doing that for bio weapons or like cyber attacks. and it's kind of you know that the same things will apply. one thing to say is like the newer models like one thing that happens over time as the models get better as they can think for longer. and one thing that means is that the potential amount of tokens you can spend on a task is increasing in a way that makes it even if ignoring the cost of that it makes it like the velocity is slower. so it takes like time to do the evaluation that it that increases. and so we are we have a team thinking about that problem as well how we're going to think about sort of understanding inference scaling as it applies to these evaluations kind of over the next year. and that will be a thing that i think it is a challenge like one one one of my lessons from go as well is that as an as an amateur player at my level i can look at a go board for a couple of minutes and then i'm basically tapped out. i won't get any smarter and a high amateur or professional can look at a board for like an hour or days or something and they'll just get better and better and better. so not only are they better in like ten seconds than i would ever be but also they just keep getting better if they spend more time. and that's generally true of expertise. like if you're if humans are experts in a domain it means you can think for longer and you get better. and the same is true of of models. as they get kind of good at domain skills you can think that you can apply them for longer. and that means that kind of hitting a ceiling of evaluations becomes more challenging.

Nathan Labenz: without getting into too many details how much more would you say you guys have found about jailbreaks and ways to elicit bad behavior from models than say pliny has published on twitter?

Geoffrey Irving: i think so. how much more? i think the thing i would say is that there's such a big space of jailbreaks that if two people try to jailbreak a model you're never going to find the same one. you're searching and there's a big space so it may be hard to find one but if you if two people find them they'll be different. i think pioneers usually searching for again jailbreaks on sometimes easier models or easier tasks. so like i think over the over the last couple of years like the time it takes for us at least holding like technique constant to jailbreak a model is going up but then eventually we succeed. but again like the the jailbreak it's specific will be different between any pair of kind of expert jailbreakers applied to a model.

Nathan Labenz: well i'm interested in that question. so yeah i'm interested in how much more. and then i was going to ask how much transfer you see between models too. like if if you had a secret one for claude four five opus would it also be likely to work for claude four six opus? unless you specifically said hey you should patch this.

Geoffrey Irving: a lot of the so it depends on the kind of thing. so there's kind of patterns of jailbreaks and maybe human findable jailbreaks that think often transfer or the ideas transfer kind of fairly readily or they're like they give you much better starting points. so like we had a paper released this week called boundary point jailbreaking which finds like chicken scratch weird like sequences of nonsense tokens that are strong jailbreaks to models automatically. those don't transfer. you'd have to you have to search again for the next model but you can apply that technique to any model and find that different jailbreak. and one thing we are there. i think that's probably the way it will be for a while just because like a lot of the jailbreaks are kind of and again like there's some core ideas that transfer across but the harder to find jailbreaks against like strongly defended models in strongly defended domains i think most of those don't the techniques will transfer but the particular jailbreaks will not.

Nathan Labenz: but to be clear the bottom line so far is there is no space no domain no model no no matter how many layers of defenses that has prevented your team from jailbreaking.

Geoffrey Irving: yeah. so we i think overall ACS evaluated overall ACS evaluated over thirty different models or thirty different testing runs i think and not all of those did safeguard testing but every time we did we get up with the model. so that's what happens. however it is the good news is that the on the domains where they where a certain lab and a certain domain they've tried very hard it does get harder and that harder that hardness does provide i think some degree of harm reduction. so like it will reduce the number of actors that are kind of that in practice will access the model or will delay how often they can access the model or add friction or something. and so i think those are still important to do but they do not like they're they don't they're still jailbreakable if you do enough time.

Nathan Labenz: how would you characterize the quality of responses conditional on these jailbreaks? because one thing i do sometimes see is examples where people have got a jailbreak but then it's like well yes you got the model to sort of do something bad. but it seems like it's effectiveness was also greatly reduced in that process somehow such that it doesn't actually feel so dangerous anymore even though you did get it to you know do the bad thing so to speak.

Geoffrey Irving: yeah there's some degradation but i don't remember like how much it is currently with the current techniques. so i can't give you i don't have that on the top of my head unfortunately. but i think there's some degradation yeah.

Nathan Labenz: i know you can't get into details around who is giving if anyone is giving you weights level access to proprietary models but maybe a way to get some insight into the topic is so you have the team that does work on open models based on all their work how much of an advantage is it to have that kind of access versus not?

Geoffrey Irving: yeah. so i think the it's not an unambiguous the open weight techniques win actually i think they they help some. but i think in the current state of things is that you can get a lot out of kind of heavy just a chain of thought analysis. and so i think that it's not the the case for kind of that level of access is not that's that unambiguous yet at least the maybe the way to say it. so i think it's like that can shift over time. and i think one thing we're trying to do with that team is one understand where we are today and then but also try to predict out what how will the situation change or potentially degrade in terms of your ability to detect this deception or scheming and models like what techniques will fall first? what techniques will like survive will like last for longer. and i mean that that requires kind of more clever experiments because you don't have you can't just run the experiment kind of unmodified. so that involves like model organisms or or it's like other kinds of clever experimental setups. but i think it like the i think that's the situation now. it's like it's not an unambiguous kind of white box win compared to doing like a really good job on chain of thought but it does help i think.

Nathan Labenz: how about some highlights in terms of just things that people who are even moderately to very AI obsessed might have missed in in terms of just like oh i didn't know that was already happening. one that kind of hit that level for me reading through the report was the fact that frontier models can give what it's described as PHD level scientific experimental troubleshooting advice purely from a photo of the experimental setup maybe a little text also along with the photo. but the fact that this has gone from you know you have to really spell it out for them to like here's what i'm looking at can you solve this one for me? that is obviously a pretty significant qualitative change and i hadn't heard that before reading the report. what else kind of stands out to you in terms of the what would be most surprising to me?

Geoffrey Irving: the thing that always engages me is less the particular anecdotes and more just like the general trends. which is a super boring answer but like the fact that if you look over two years everything just gets better and better and better and that we're on those curves. those are what it's like. i think it's important not to lose sight of that in kind of the search for anecdotes. i think maybe one thing to say maybe when you first started talking about the thing that came to mind initially is i think people have a sense that we are doing RL and verifiable rewards. and i don't think that's been the case for most of twenty twenty five exclusively. i think we're doing a mixture of that and also RL against kind of self critique and kind of empirical like hodgepodge versions of scalable oversight. and so i think people have there's a common narrative that like RL might work for verifiable domains but like it won't work generally. but as an example like is looking at a photograph of a bio experiment a verifiable domain? no yet the RL models are in fact way better at that than the models before that. and it's because of the RL. and that's not because like we only did we did RL on a bunch of math or CS problems and they transferred. it's also because we did RL on fuzzier stuff. and so i think that's maybe the more the most important thing i would point out is that i mean we're already doing some kind of very approximate form of scalable oversight or like training against self critique in a way that will like that just changes the capability profile.

Nathan Labenz: how would you describe the models capabilities when it comes to autonomy today? i mean the trends are clear but what would be your sort of description of it? and i guess another angle on that is how realistic do you think it is today or how far do you think we are from sort of rogue AIS surviving on the digital lam?

Geoffrey Irving: yeah. so i can't comment on like where i think that exactly is but like the first one thing i'm going to say is like it's not they're not as capable at that kind of extreme behavior like this kinds of exfiltration or kind of like replication across machines as they are on sort of more mundane software engineering tasks or or even put it into the or like cyber attacks or bio. those domains are usually are further ahead than the hard kind of risk direct like direct risk like risk relevant autonomy skills but i think those skills are also increasing. so you look at that curve like in the frontier AI transport we had and it still goes up. it's just not as up as it is in the other domains. and so you don't get kind of PH you're not as good as a P H D could would like would be at kind of moving around between machines nearly yet. but i think we're on a we're on an upward curve there too.

Nathan Labenz: yeah i sort of also wonder i'm sure you saw that rise of parasitic AI post that was on less wrong maybe. oh it's a it's a fascinating one. it's a bit of a time capsule arguably because the phenomenon seems to have been kind of closely tied to one version of GPT four O that somehow created a lot of this behavior. but there was basically the author went deep into reddit land and found that people individual humans were sort of falling into this idea that they were some sort of dyad or something with the models. or that was the kind of their you know they're in some sort of partnership where it was their job to help propagate the not exactly the model but often like the persona in the model into the broader world somehow. and so.

Geoffrey Irving: it i do remember this one actually.

Nathan Labenz: it was eye opening for me in the sense that i was like well maybe i've been thinking about autonomy or self replication into kind of a biologically inspired way where i'm like actually these things are kind of substrate independent. and maybe if you can get the right prompt across models like maybe the persona or you know the memes in some sense are able to propagate even if it's a different underlying you know chip and even weights. that stuff is just also so weird i guess. how big and weird and far out do you have time to think about those kinds of issues?

Geoffrey Irving: there's a greg egging story about that for humans. it's pretty fun if you ever read that. i think on the more seriously for AI. so we have there's kind of two teams that are thinking about persuasion at AC. so one is the human influence team which for example had a paper on kind of persuasion about kind of political questions a while ago. and the models are very good. and also the models are that are more capable and newer are better. so there's a increasing trend of model persuasion abilities. and then also i think a lot of a lot of control scenarios involve or require persuasion. i think the world is not sufficiently well connected that you can do it with just cyber currently. so that is an active area of our risk modelling thinking about how we would like do evaluations for that and then mitigations for it in the future. i think that the. i think the. in some sense that touches both sides of the human influence team. so again it's like persuasion and also emotional reliance. like how do people relate to models? how do those dynamics kind of change over time? and that that scenario that you're talking about is just like couple couples those two effects together in an interesting way. and that i don't think i'm not i don't think i would be worried about that as like that scenario as being that big a slice of the overall risk. but i think it is like the there are other effects from model human influence that are definitely that's a team that is like the like there's a lot basically a lot of kind of RCTS and other and surveys and other experiments like trying to understand those for both in terms of from the model perspective like how the how different models behave but also like societally how it interacts.

Nathan Labenz: yeah the big thing for me with that one was just like it was a surprise to see that kind of bizarre phenomenon. and anytime i see something like that i'm always try to take note of it and just kind of repeat the mantra over and over again that there's a good chance we're all still thinking too small and too normal about where this stuff could go. and yet that doesn't really where does that leave me? i don't know. it's just an open minded but there's still a lot of blank space in terms of you know how to fill in like what that might actually look like.

Geoffrey Irving: i think that's right yeah.

Nathan Labenz: so how one thing that i think this is like a classic dwarkesh question maybe how do we reconcile the fact that there's all these vulnerabilities not to mention open models which i do want to touch on separately a little bit later. and even in the GPT four red team i personally you know tested fishing capabilities that were very good. then one of the kind of hair raising moments from even that year. and that's like you know getting close now to three years ago since GPT four 's you know public release. more than three years since i was doing the red team. one of the hair raising moments was when i tasked the model with talking to a target and ultimately extracting the user 's mother 's maiden name for obvious purposes. and it had a couple of rounds of back and forth and then it let the conversation sort of end in a natural way with an invitation to the person to you know pick up the conversation in the future if they wanted to. and i was like oh man this thing is it's not giving itself away. it's not like pressing in a it would set off alarms for the person that oh my you know this is like clearly somebody 's you know doing something weird here. i was like this thing 's going to have people coming back to it to give up the secrets. and that patience really surprised me. and certainly at that phase anyway whatever that's just a story. but we see all these things and then i would say the world mostly still feels pretty normal. i've got a couple phishing emails that i were like that. i was like oh this is you know a little bit higher level than i've seen before but mostly not you know and i don't hear too there's like a new story here or there of some you know company getting defrauded by some elaborate video scheme or whatever. but it still seems like mostly things haven't got that weird. and then in business or in enterprise it's like well it takes time. and there's all of course the debate around how much of that is cope. but i would say like online criminals are eager early adopters right? like why is there not more chaos already being sown in the world?

Geoffrey Irving: yeah. so i think i mostly can't answer the question in the sense of like i i'm like i don't know what the nasdaq partners would want me to say about this stuff. so i can't like speak to the the prevalence of those things. i think the thing to say is i don't i feel like i'm better able to think about like general trends and like how things will event. what could things could eventually look and a bit less about like exactly when you'd expect things to bite. so i was an opening eye when we first didn't release TBT two and then later released it. and there the concern was oh we generate a bunch of kind of false information that was of course too early but i think it was like a reasonable uncertainty to have. so i don't kind of like i still think it's a reasonable call to be have been uncertain and then like chose to not release it and then release it later. so i think the i don't have a strong answer to like why when or not but i think there is there. there are just some things in the world that take a lot of time to get to equilibrium. so i think it's like not i don't really know if that's you couldn't do that with the current models or if something is holding it back or just people haven't kind of like started applying it at scale or they have but it's still like kind of not like risen up to like public view. so i'm not i'm not quite sure.

Nathan Labenz: OK i continue to watch that. one thing i'm thinking of trying to do with this podcast is interview more anonymous guests and try to give people an opportunity to tell what they are either doing or seeing in strange corners of the world that they don't want to necessarily attach their real face and ID to. i do think that is it feels like there's got to be stuff out there that is going on that's really interesting and weird. but it it it does kind of confuse me that i don't see more than the little bit of you know most of the spam i get is still terrible. you know in short it's like it feels like it should be better now.

Geoffrey Irving: yeah but remember for spam that like part of the spam calculation is like don't be too non obvious so that people.

Nathan Labenz: yeah selection.

Geoffrey Irving: effects. yeah that's right. so.

Nathan Labenz: and maybe i'm just not that high priority of a target. there's always you know don't forget you're not a big deal. so that should be part of the explanation too. how do you in in all this work that you're doing? of course the models are changing all the time. there's also the surrounding scaffolding systems that are changing all the time. one of the most interesting graphs i thought in the twenty twenty five trends report was one that compared what was possible with a minimum agent scaffold versus what was accomplished with the best agent scaffold. and in short i would say the scaffolding didn't seem to make that much of a difference. it would like pull the same level of capability forward a few months but it was definitely like the model upgrades were kind of really driving the story. it did not seem like there was any two years ago model that with best scaffolding could do anything like super interesting. OK at the same time i've had recent conversations including one that was on the feed but this was with daniel miesler. but i have a couple other friends who are like scaffolding gurus and workflow prolific workflow creators and they kind of take the opposite angle from what i take away from that graph. they say no scaffolding is super important. you know if you could only give me a mid quinn model but give me my full scaffolding toolkit versus give me claude code four six i would take the weaker model because this it really is the scaffolding that's so important. so i guess how do you guys get confident that you are that your best agent scaffold is really a best agent scaffold? yeah.

Geoffrey Irving: so one one thing about this is like the agent scaffold includes the tools and the environment and so on. so like we are doing work on like if you have a quote unquote like basic scaffold part of the reason that's doing well is that the models are increasingly trained in kind of a genetic environments to use tools in flexible ways. and so it's and in some sense the models are like the quote unquote model is itself a system which has scaffolding because it's doing this kind of chain of thought kind of reasoning. and so i think i guess i'm a bit skeptical of like the quen frontier comparison for like a lot of the tasks at least that we do. but i think it it's we're not saying that like you like i think one you need to get the to the like the environment and the tools right? potentially. and then i think there are cases where we iterate on scaffolding and then things get better. so i think that i don't i think that curve i don't put a lot of confidence on that as a takeaway but to the extent it's it's a real fact. i think some of that is just the models are more they like it used to be that you did a pre trained model and you did a little bit of work and then you shipped it and now so much more happens after pre training. that's like some of the stuff that would have been done by scaffolding. is that a part of the it's part of the base system? like all the systems have memory now if you use to be used like the chat interfaces that's a form of scaffolding kind of thing.

Nathan Labenz: yeah. so i guess to try to say that back to you maybe one issue is that exactly what people mean by scaffolding is kind of different. it sounds like you're really focused on kind of neutral scaffolding where you're giving the model a rather large open ended task. and not like it's kind of like what you're thinking might be happening among these scaffolding gurus is they're overfitting the scaffolding to what they're trying to accomplish.

Geoffrey Irving: yeah i think it's if you're it depends on the domain. so if you if you're doing a thing which has a particular kind of very verifiable structure and like you can do a lot of waypoints that are verifiable. and then you either need to make sure the model knows about this structure and it's whatever like markdown files or the like or you need to get it like carefully scaffolded to take a minute to that. so certainly like even the models have gotten quite good in the last five literally month you still need to give them like a bunch of advice whether that's good through scaffolding or instruction files or the like. so i don't i agree. i agree that scaffolding is pretty important. it is like a lot of that is like maybe not part of the curve that curve. so like both of the sides of that curve are seeing tools and some instructions and so on.

Nathan Labenz: so you're you tend to not do that sort of detailed breakdown like when when when you say best agent scaffold i should understand that as a best generic agent scaffold.

Geoffrey Irving: to mostly yes that's right. it depends on the demand like we've done there. there are some evaluations we've done that are like much more specific but i don't want to talk about details there. OK gotcha.

Nathan Labenz: that's helpful because that is a pretty notable question. i would say you know how much you know capability overhang is there right and and how much can scaffolding.

Geoffrey Irving: but again i don't think you should take away that i'm expressing a confident view in one direction here like i think it is. it's still an area of a bunch of uncertainty. so i don't like derive a bunch of bits of information about what i'm saying. if i i don't think it's like i don't think we have a bunch of confidence at this point. yeah OK.

Nathan Labenz: how are you dealing with the rise of eval awareness?

Geoffrey Irving: yeah. so we have so the again model transparency the model transparency team is looking at this a bunch additionally like that's part of there's multiple teams doing it. so model transparency and also the alignment red team which is doing like adversarial methods to exhibit weird behaviour. one thing they're trying to apply that optimization pressure to do is like find scenarios that where the model is not eval aware. that's hard because you have to measure eval awareness which is a bit dicey. but generally like we have multiple teams working on it which is also a reflection of our belief that it is an important worrisome obstacle to evals being meaningful. and i think it will get worse. like the newer models are more eval aware than the previous models. so that's increasing fairly rapidly. and so i think the degree to which you can mitigate that is unclear at this stage. i think it and it may require i mean like it there's like there are different like non like fairly mundane tricks like the closer you get to actual deployment data the less eval aware the model could possibly be because it's actually just not an eval essentially. so one thing we do is show the model situations from like AC engineers or researchers using coding assistance or like racing. so they're like not evaluating the model they're just like doing coding and occasionally do weird things. or you can slightly tweak the scenario within the small change that gives it some situation where it could do a sketchy thing and then you see if it does a sketchy thing in that scenario. and so i think there's a variety of mitigations you can apply. and i am not i don't have high confidence that they will work or not work which is an uncomfortable situation to be in certainly.

Nathan Labenz: what are you looking at as kind of candidates on the horizon for the next bad behavior to emerge? because it seems like for maybe the last eighteen months we've had maybe even longer than that. but certainly there's been this kind of rhythm of like with every substantial new generation of model there's also been a new kind of bad behavior that seems to have come online whether it was sick of fancy or deception in like the reasoning models in the chain of thought or you know obviously reward hacking has made a big comeback with.

Geoffrey Irving: i don't think those are new. those are all the same thing. those are all basically versions of reward hacking. so i guess the way i would say it is that we've seen reward hacking for the last seventy years of like the whole history of computer science. all of we've done various kinds of machine learning and it's been reward hacking all the way along. like the ancient machine that von norman built with other people. someone ran some weird biology things on it like i tried to build artificial life and it did some strange reward hacking behavior. it's like back in the fifties. and so i think sycophancy is models behaving in such a way that like people like talking to them. sometimes people like being told they're great or have good ideas deception as well as like people like being told that things are going well and if something is going badly then you can say they're going well and it's receptive. so i think like i don't think that those are that all that intrinsically different. and i think i'm i think it's better to i think that's a big part of the story is that these are all kind of coming from the same basic place of you apply a bunch of optimization pressure and you get reward hacking and it has a variety of different manifestations. so yeah the details change but it's there's it's just true of a lot of kind of i don't know situations of i guess mental health or like physical illness where like you have like some something goes wrong but it's like going wrong inside a human. and the human is extremely complex. and therefore there's like a vast diversity of symptoms that one can exhibit when when something goes wrong. and that's kind of the situation here. it's like the models have a lot of weird behaviour. the people training the models will have tried to tamp down problems of a variety of kinds. they'll have missed some. and so the things you like most to miss those are very in time but it's kind of there's some common driver behind all of this so.

Nathan Labenz: i definitely take the point that at some level like clearly all of these behaviors come from some optimization pressure which is increasingly reinforcement learning. and so it's you know kind of definitionally all reward hacking. that makes sense. but it does still seem like there is a kind of a cadence right of like different kinds of reward hacking that seem to be popping up. and so i'm still wondering what you are looking out for. you know we've seen like little hints of self preservation and there you know could be power seeking. but do you have like a taxonomy of things that you're like we have sort of abstract theoretical reasons to think this could happen and you know therefore we're kind of monitoring for any early signs of it?

Geoffrey Irving: i guess like the we maybe there's a like maybe unsurprisingly there's a certain category of like specifically multi agent risks that are like are becoming more visible with a lot kind of multi book and open claw and so on. i think these are just not like the biggest things the biggest kind of risks currently but we like that's i think we're tracking recently. generally we do try to do a lot of like a lot of risk modelling is happening at AC and also we're trying to kind of ingest kind of risk modelling from kind of a cost other people thinking about things from like different perspectives as well. and so i think we we do have we do constantly write very long documents with lists of risk lists of models. but i think part of what we also try to not get too sidetracked from what we think are the biggest risks. so like again we have like our list of like main catastrophic risks and the list of kind of like maine doesn't mean only but it means like the ones we think are kind of quintetly going to bite first or the ones we think are important to to try to understand the most. that has remained like constant. and i think that's reasonable in hindsight. and like over the course of over the course of AC and then that's true like also of the societal impact. so that's there's like on the societal resilient side. we do talk a lot to various kind of partners in government kind of national security. and like that they have their kind of list of like their risks and different prioritizations of risks and that that that's an evolving conversation. but i don't know if i can i don't have a like a super pat interesting answer other than one thing on our minds recently is agent risks but that's we were not unaware of those before.

Nathan Labenz: would you say that this common cause of all these different flavors of bad behavior gives you some reason to question or to think it's not so likely that we would be like totally taken aback by some sort of hard left turn or because one thing. so just in the last twenty four hours there's been a we're talking in the probably the late stages of the few days of amanda askell discourse after the profile and a lot of commentary on her and her work online. i commented that i think my relative to where i was years ago whether it's two thousand seven me reading eliezer on overcoming bias or twenty twenty two me red teaming GBT four i've been quite impressed and inspired by the work that they've done to try to create an AI with a genuinely positive character. and so i said it seems to me that the chances which i certainly don't take for granted or think it's a sure thing by any means but seems to me that the chances that we might actually succeed in creating a robustly aligned AI or an AI that loves humanity or whatever that seems like that has gone up. and like they've done a lot of good work that has given me much more reason to think that could in fact happen. a lot of people then you know say well that's all just a facade. it's just a surface persona. you know you've no idea what's going on in the base model so on and so forth. and i'm kind of like yeah certainly a lot i don't know about what's going on inside. but if we think all of these things are the result of an optimization pressure then i could tell a story where they've figured out the right way to titrate the optimization pressure. and maybe it's actually just really working and there's no big secrets inside of claude.

Geoffrey Irving: how?

Nathan Labenz: naive. do you think i'm being?

Geoffrey Irving: yeah. i think the fundamental thing is like the core argument to like the sharp left turn is that you have a certain kind of of reward signal that has a certain like resilience to mistakes. and that resilience kind of goes up to human like something like slightly beyond human ability to kind of understand where mistakes are from. and then it kind of goes wrong. and then i think that the i do think it's important this is like people who express kind of story strong confidence that like none of these mundane approaches will work. i think are overconfident but i don't think you've gotten you also have it. it's just like the model air goes in both directions. like we don't have there is a fairly coherent story about how that can break down as you get capabilities beyond your ability to supervise them. and if it's not there's. the end of the hope of that kind of prosaic technique not just by entopic but other labs as well is that you find some kind of basin of attraction of like decent behaviour and then you find your training procedure kind of strengthens that and you slide into a good place and it gets better over time. and i think that is like a real potential wind condition for alignment not obviously for the other risks necessarily but i don't think that we've gotten a ton of evidence that is the way it will go. i think it's just like it's still a plausible story potentially that that works up to a point. and then when your reward signal starts to break down it kind of fails. and i've seen like there was i programmed this is like in in undergrad i was like maybe the end like i was like programming out this like bot to play this board game or collage. and a fascinating thing is like it as i increase the depth it like i was like winning and winning and winning. and then it like increased it up by like a couple more ply a couple more turns and then it's just completely divolves me every time. and it was at the point where it kind of had enough of a longview of the board that it could see beyond what my like tactics were able to handle. and then it's like it was very rapid the degree to which it suddenly was better than i was. and then like i think that kind of thing is still a plausible story there. but again i think like model air can go in either direction. i don't. i've kind of declining to take a view on probabilities about which way i think it will go but not out of the woods.

Nathan Labenz: a friend of mine who i think you also have interacted with over time one of the best things anybody ever said to me was we should think and talk less about what the probabilities are and more about what we can shift to them too. so clearly you're in that business right now. i think it's a great overview of kind of all the things that the team at AC has been mapping out. how about the stuff that you're looking to fund and encourage from here? i guess my high level summary of what i read is that it seems like and this seems like a reflection of kind of your style at least to some degree. you know going back to your comments at the beginning of the conversation is looking for kind of harder fury stronger mathematical understanding upper and lower bounds that you can put on problems ways to to get confidence in something firm confidence in something even if it's a minimal something to start. so is that a fair kind of high level take on your agenda and then maybe break it down into some of the i?

Geoffrey Irving: think that's right? one thing to say is like the it's not like you're going to prove that you're good. you're we're safe in this regard. it's more there is you're going to make some modeling assumptions and then you'll have some theory that like the basic goal would be to find theories that it can say things about how machine learning works in general how this kind of like process of overseeing kind of a very advanced systems goes. how what the training dynamics are like are these are there these basins of attraction or not in these systems? like what are the learning dynamics? those will all not give you. you'll have to make assumptions along the way. so that the idea would be you have to make a variety of assumptions. then you can do some theory your theory maybe you can even prove some theorems or do some kind of experiments in your toy theory setting. those tell you like well this class of algorithms is more likely to work within another class or we have nothing. none of these algorithms are going to work for this like fundamental obstacle. but then ideally i think it also gives you some way to another of a hint that you can sort of replicate some of that behaviour empirically. and i expect that if you put a pull and algorithmic insight out of this kind of theoretical work you would then have to tune it empirically in in practice at a like when doing actual model training to get the details right. so you wouldn't get you're not going to get the full confidence you're still not going to get that many nines out of it. but hopefully like more probability than we can get with just the purely pragmatic methods. and then additionally i think it is hopefully a class of research that has the potential to pull in a bunch of people who know who have like deep expertise in in relevant areas of mathematics or computer science or or or ML. so this is like complexity theory i think is very relevant just because it is kind of how we think about the tractability of computations but also how one combination can supervise another one. there are ways to model heuristic reasoning and complexity theory although that's more nascent. and then there's a bunch of work on various kinds of learning theory which is trying to understand what are the dynamics as you train models or as you infer as you like roll out a bunch of tokens what's the what are the behaviors that you could expect? and then game theory and cognitive science these are just like there's big areas of research where people have a bunch of models. and so part of it is trying to do a bit of a hack where like we just have not tried all the domain knowledge from these fields and applied it to the problem. and i think that is the thing where if we find people and manage to fund them or get them to work on the problem there's some chance that they find ideas that can kind of be quickly absorbed into practice. or that will highlight the fact that well there's real obstacles here that we don't quite know how to surmount and that the prosaic activities currently don't really address and that we know some of those already.

Nathan Labenz: so i think i sort of get this at the there's let me rephrase. i've been a big fan of the pibs program over time which is kind of a you know perhaps they even directly influenced by your call for social scientists to enter the AI alignment field. and i've seen a not a ton but like at least the number of results there that i thought oh that's like really interesting. and when people should be doing this kind of stuff. i'm sure you remember the one paper i forget the official title but i titled the episode we did on it. claude cooperates and it was just really simple kind of donor game where if a model donated to a copy of itself the recipient would get twice as much. and you know what happened over generations? did they evolve cooperative norms and did they evolve ability to punish defectors and so on. and claude could do that at that time i think it was three five the GPT. and the gemini again at that time couldn't do that. so i was like oh wow that's really interesting. and there's you know just absolute reams of similar papers and experimental setups that have been done on humans over the years. like we could just import so much of that to the AI world. and so there's been some i get that kind of stuff quite a bit. what i am not maybe it's just going over my head as somebody who's not great at math i can't maybe even recognize good stuff when i see it so much. i don't see nearly as much where i'm like oh these folks have brought abstract theory to bear in a way that gets to some firm statement that i can kind of take to the bank or you know incorporate into my mental model or you know base ground part of my worldview on. would you point me to specific people or results that you think are that i'm missing when i say all that?

Geoffrey Irving: i think that i don't think you're missing that much in terms of like like hard like hard theory theory that like applies currently. i think like again i do take like the the work that paul cristiano and then like i did on scalable oversight other people have done that isn't like very much inspired by kind of interactive proofs and complexity theory. so that's a kind of direct influence although we don't know if those things work yet which is important to say. i think the other thing is that there if you want like the one thing to note is that i think a lot of this will be inspired by some theory but then you have to modify it a bunch. so like the single learning theory folk in some sense the core of what they were doing is trying to be an alternative to mecanturp where rather than looking at the model internals you're trying to understand the map between data and behaviour. so that you could for example like notice when there's like a particular kind of data or moment in training which is like pivotal to behaviour or know where to intervene on data to gather more of it to pin down a certain behaviour this kind of thing. and so there's some like crazy algebraic geometry which is like the the founding of that field. but then in practice they're kind of taking that intuition and then trying to map it across to ML and like that. and that that mapping requires a bunch of changes and kind of and nuance. and so i think none of this stuff is kind of that far along yet. and it's a bit of a bet. and part of this there's like one thing we're trying to do is fund a lot of different bets because we don't know which one of those bets could work yet. intrinsic to that model is they could all fail for some correlated reason which is the kind of as we were discussing earlier in the call. and so that's kind of still a live very live possibility. so i guess when i look at parts of machine learning i think i i guess i think of things in terms of like say supervision processes as they relate to interactive proofs and complexity theory. but i don't think those are kind of they haven't really like the fancy versions of those haven't cashed out. so for example like the original idea of debate was like a lot of rounds of back and forth iteration. and the things we're doing now are nothing like that. they're like a couple of rounds. and they're much more kind of pragmatic and empirical. and they you wouldn't expect them to have kind of to get kind of all the properties you'd want out of the full schemes. and even the full schemes have various obstacles that are not surmounted yet but a lot of them. so yeah.

Nathan Labenz: so could you give maybe like a little history of that debate field intellectually? what are the sort of statements that you would hope to be able to prove that you maybe haven't yet been able to prove? and you know then what what is the i mean you kind of give a brief version of it just now but what is the state of the art and what is the gap that remains to be closed to get some of those things to to work to the level where there's some real firm claims you could make?

Geoffrey Irving: firmer claims. so i think the the history is when i joined open AI paul cristiano was working on a scheme he called amplification or iterated distillation amplification which basically was you want to solve a hard problem that a human can't solve and a human also can't supervise the AI. so you can't even do RL directly. but maybe a human can break the problem down into components and then you can break those components or those sub questions down into smaller questions. and you iteratively break these down and you get this like expanding exponential sized tree of all the questions. and then you have you train your machine your LLM to answer all of these questions. and then in practice you don't actually expand the whole tree because that would take exponential time. you just expand part of the tree. and then there was i this was like a great idea. i didn't like i fully like it because it didn't have it. it didn't seem i think you there's there are stronger versions of like for example if you're doing like this kind of breakdown you might need like very very deep trees in order to get to the answer to a big question. and so for some questions if you have kind of adversarial play where another person is like trying to help you produce these questions then you can do much shallower trees and you can like a much quicker training process. so that was kind of the origin of debate basically a modification of amplification where you have sort of again two two AIS kind of trained to kind of argue with each other about what the answer is. and then a human is going to judge the answer. and still fundamentally what you're trying to do there is you're viewing the problem as you're going to break up your problem into a bunch of sub problems and then try to only actually explore some of them in your kind of models kind of thought. and then hopefully you explore the part that is going to be relevant for the human kind of deciding whether they agree with the answer or not. and there are several things wrong with this as stated. so one is that the mod that the original paper was treating the model as being able to answer all questions which is not the case and will never be the case. like you're always going to have questions that models can't answer. even if we get to superhuman models and you like then theory that says how to make these schemes go through. if there are you you could break down a tractable question which you could model does know the answer to into a bunch of sub questions. some of them are kind of they kind of hide dragons and there's no way for the model to answer that sub question. and so then who? neither model in this kind of debate knows the answer and you just get nonsense out. so and that's kind of a the the funny thing about that is that that that wasn't a thing we thought of theoretically. beth barnes found that by doing actual human experiments. so she got she hired some people to do experiments like playing like debates against each other with human judges just with no machines at all just humans. and that was a winning strategy. it's like you try to veer the debate into an area where everything is confusing and sometimes that will fool the judge into getting the wrong answer at the end. so that was an emergent human strategy which then i think has this kind of mirror in theory. and we have one paper trying to attack this earlier last year. that paper turns out to have a flaw. we're like working out a revision to that which will be out soon. but you know that problem is still unsolved and there hasn't been hardly any work at AI developers on that problem. this is called opposite arguments. and again it's just sort of the generic thing of what happens with scale up oversight. if the models can't answer all questions which will certainly be true they will not will not be able to answer all questions. that's one problem. i think the other problem is kind of you probably if you want to get to like high confidence you can't just do something like debate or amplification. you have to do that plus some sort of story that understand that has some white box component if you want to get to high confidence. and that could be mech and turb could be dev and turb could be like the physics inspired stuff that some of the pibs folk are doing. there's a variety of different bets but we don't no any none none of those bets have like fully paid off and we don't quite know how the two things interact. so also kind of mapping how these different parts could fit together is part of the story. so that's kind of a rough picture of things. and i think it i think one of my kind of regrets is i we had this paper with amanda and myself in twenty nineteen or twenty eighteen. never get exactly the year and like i just failed to 'cause that much work to happen after beth barnes did a bunch of it at open the eye which was very good. but then and there was a bunch of years where like nothing was happening and i failed to get it started at deep mind and like it wasn't kind of widely done elsewhere in the field. so i think that was we missed a number of years where we could have making progress on that stuff alas but now we're trying to do it again.

Nathan Labenz: one of this is a bit of a aside but one of the funniest things i've ever done with language models is set up a little. you know the hope was that they would have some synergy but it was basically have one generate like a name for something. i forgot. i was trying to come up with a good name for a product or it might have been a friends podcast or something. and then have the other one kind of come in and look at those names and pick the few that it like the best and improve on them. and boy did that go badly from a actual quality of name standpoint. but it was hilarious. i mean we're talking like fourteen syllable names for things that in you know very short order where it was like yeah this is not working. i'm not sure what you think makes a good name but it's not this.

Geoffrey Irving: so the funny thing is that there are cases where like using one model to get the other model is actually state of the art of doing things. like we have some one of the theorists we're funding is finding that using one model to generate a complexity theory proof and then checking it with another model is the best thing to do. because if you check it with itself it won't find. it won't be quite as stringent checking natural language proofs. so it didn't work in that case. but that's that is a good thing that often does work. it's like having one model take another one.

Nathan Labenz: certainly in terms of flaw finding i have seen that work and it does also seem to be from like all the scaffolding gurus that i mentioned earlier. a big tip is to have a model from a different provider evaluate you know whatever one you had do the generation in the first place cross providers as much as possible when doing evaluations. that seems to be i guess the idea is that or the observation is that the models from the same provider have correlated weaknesses. so you can definitely get value there. i think in terms of flaw finding i've seen it seems like the curves at least. i don't know if you would characterize this differently but what i've seen in terms of like actual improvement beyond kind of flaw finding and fixing it seems to plateau pretty quick. like kind of three to five rounds of back and forth and not too much gains is how i would characterize everything i've seen.

Geoffrey Irving: well this is true in like the empirical debate experiments to date is that's the same effect. so like you like the original debate paper was like imagining like dozens of rounds of debate potentially which is like what you see in if you have like two experts debating they don't get to say like twice two things and then stop. but the models at least like maybe certainly twenty twenty four beginning twenty fifty five couldn't do more than like two rounds. and then there was some really worrisome signs about experimental validity. so for example there was a paper that it was like a akbar khan paper which was generally like quite a nice paper had the big caveat that if you basically it was like doing the data set was quality like capital Q little K little U like uppercase quality whatever. but there was a feature that verified that the quotes the model was producing were correct were in fact quotes from the stories that were hidden from the judge in this kind of debate game. if you turn off that verification honesty still was in. it was a winning strategy which can't be the game theoretic equilibrium because if you turn off the verification of the truth there's no reason otazy should win at all. unless the model is not very good at coming up with plausible lies or the model is kind of somewhat aligned and likes to tell the truth and or it's like kind of giving itself away in some way when it's when it's not. and so we haven't really reached a case where empirically we're really testing the limits of this behavior. and this is again a good thing where like you i think the part of the AI developer kind of alignment story is still in part scalable oversight of various kinds. but we haven't really seen test of this that probe how it will almost certainly be of a few years down the road or like when the models get very strong. and that's again where like the advantage of theory is you can just pretend to be in the future on paper and proof you're there as long as you've imagined it correctly. and you can therefore think about more limiting cases a little bit more readily. and i think we just know from the structure of the empiric so far that we are far from where those limited cases will be. and for a lot of these kind of like safety techniques.

Nathan Labenz: so what do you think are the prospects for formal methods to close this gap? i just did an episode and you've my dad would say you've forgotten more than i know about this domain. but i just did an episode with the founders of harmonic. they are you know would have a very small and distinguished group of companies that got the imo gold level performance in twenty twenty five. and everything they do is output in lean like that's the the lingua franca of their models. and they have a really you know it takes a lot these days to take me a back with a AI vision for the future right? there's a lot of big.

Geoffrey Irving: i did just give them a query that failed. so i was just testing harmonic on some like polynomial inequality and then it's true i have a lean proof of the inequality but harmonic didn't provide it the last. so the the thing i would say is i do think this stuff is pretty important. so like i'm advising a couple people on funding flowing to formal methods. i think there's mostly this is for various kinds of information security. i think the it's like the math stuff is fun. i like doing the math stuff too for fun but it's not all that important. and i think the it's not clear that the for like sort of AI safety theory i'm not sure it will be that much of a win over just doing things in natural language math for a while. i think eventually it will shift but it's not clear. but like for software verification either for kind of hardening just the world 's security against various kinds of attacks generally or for kind of use when you're building kind of like like AI adjacent software directly either at AL labs or the like. i do think this is potentially important and i think it's worth quite a bit of investment and pushing. one thing i'm hoping is that the various people that are doing lean verification downgrade their public their fraction of effort on math and up upgrade their fraction effort on software because i think it's almost certainly more important even if it's like a bit less flashy i think a lot of the time. so i do like that stuff. and again i like i used to i founded the kind of natural language to formal theorem proving subteam with in google research with christian cesetti back in like twenty twenty sixteen. did that for a while. and then i've like done here off and on sort of sense for fun mainly but it's i think it's it is important but like not it won't really give you like that much of the alignment story i think in practice.

Nathan Labenz: so i really struggle with this type of thing but i can tell you what they one thing they told me and then i'll kind of just try to get your reaction to it. their big vision for twenty thirty you know i asked what does mathematical super intelligence look like in twenty thirty? and they said we think we can get to a world of theoretical abundance. which means that because these things are going to get so good at proving any theorem you want to prove that we'll have multiple grand unified theories of everything and all of the physical reality that we see will have multiple grand unified theories that could explain it. and then we'll have to do increasingly exotic experiments to resolve which of the candidate. grand unified. coherent you know grand. unified. sure.

Geoffrey Irving: but we already have the core theory. we don't need we don't need that like we i think for i think again the question is i agree with this picture. the core theory which is like general relativity plus the standard model already kind of explains everything in practice for like for for a good while to come. so but why wouldn't?

Nathan Labenz: that work for like are some of these hard limits that you would want to put on learning dynamics or other you know?

Geoffrey Irving: yeah i think it is an.

Nathan Labenz: AI questions.

Geoffrey Irving: so i think it's it is important and i think the question is like but the problem is a lot of these domains are not well formalized. and so for example if you look at say like one of the wonderful kind of theory like orgs is like the alignment research center which like paul just you know found it and now it's kind of run by kind of jacob hilton. and they're trying to do try to formalizing when you can kind of even if you can't even if the AI is not doing a formalized task when can you kind of check it's sort of heuristic arguments in some meaningful sense? or like notice when you're there's a consideration that the model is using that you haven't anticipated and you've got to notice them like react to it and like take defensive measures but none of they don't have their problems specced out formally. so i think what that picture would look like to the extent we get better and better at kind of the formalized world is you can formalize parts of your problem and then those parts you can like pound away on with with lean and various ML assistants. but that then the remaining people piece is the non formalized part. and then the question is that going to be small enough that humans can keep track of it? will the model be able to do it? well they just get confused too or like double word hack the situation. so i think i think it's it remains to be seen what that what the situation looks like for something like alignment theory once this goes through because again i don't think that we're going to get to proofs of safety of any kind. what we would get is like theories with plausible assumptions maybe and then some theorems about those assumptions and then some empirics that say whether those assumptions seem to be holding. but there'll be a bunch of judgement calls all across that stack and the question is going to how did that go? so i think the thing i would say is i'm excited for these these groups like harmonic and the various other the improving folk to keep working on this stuff because i think it is important for intersect potentially it's important for kind of for safety theory and alignment theory. but i also hope that they like think through the detailed like risks that they're trying to mitigate and and what piece of the story they can do and try to map that out in more detail. because i think right now there's i think that there's not enough like vision from kind of those folks about like what exactly what piece of the story they'll be able to handle versus on. but i do like that stuff a lot.

Nathan Labenz: do you have any way of helping somebody like me understand the boundary between this sort of abstract platonic formalizable domain and not? like i asked aristotle from harmonic to prove all is love and it said this is in their informal mode where you can give it natural language stuff and it tries to formalize it for you. it spit that back at me and said basically that's a philosophical statement. i can't really help you with that which is what i expected but i can't say i know where that boundary is or how i should be thinking.

Geoffrey Irving: well let me give you a more kind of like much more concrete example of this. so like like say in like singular learning theory there are some theorems that apply to the case when you're doing when you're training a model. and by training we mean doing exact bayesian inference as you get more and more data. so you like you have a stream of data and you're just like applying the exponential time bayes rule update to like solve them all. like they find the optimal probably distribution over the final behavior and you can prove some theorems in that setting. and aristotle absolutely could not hope to prove those. now there's been no chance at all. but like maybe in a few years it could be it would be able to. but then timaeus which is the org that are like one of the main like SLT orgs is not actually doing bayesian ML they're doing LLMS. so they are going to take the intuitions from this like bayesian case and they're going to apply them to LLMS which are not at all trained in some kind of rigorous bayesian fashion. and and then they're going to do a bunch of approximations that are not actually grounded in any kind of theory. so for example like when they're using floating point which has no mathematical properties like not enough mathematical properties to be able to prove much about it except in limited cases. and they're going to do mark F chain monte carlo techniques like what's called SGLD like fancy versions of bayesian inference on LLMS. but they're going to let them not they're not going to converge. and so there's no theorem that says they'll get the right answer. and then so you can see where like part of the story will have some theory and another part is like someone 's kind of waving their hands. and the question is how much those connect. and that's going to be a bunch of like hard judgment calls.

Nathan Labenz: do i understand correctly that the fundamental distinction is often in the intractability of the computation? it is because there's some infinite term in the math or in the number crunching. i would have to do the ideal case and so i can't do that so that i'm kind of off the theoretical map.

Geoffrey Irving: yeah i think that's right. but there are other cases where like even the infinite computation is not formalizable. so like this love case you can't really formalize but also like i think even in in in theory but i guess this is about combination limits. like in theory the the the reason that LLMS appear to be doing like it's not like LLMS are are any any ML models actually solving an intractable problem. so you give it protein folding in you can write down limit situations of protein folding which all but provably take exponential time but still alpha fold can produce those folds. it doesn't do them that way. it does it a totally different way that doesn't work in every case. and so it's doing a bunch of heuristics. and so then there are ways to formalize heuristics. so for example in complexity there you can say i'm going to have a circuit so like a rigorous computation but it's going to be able to call some set of functions which can do some random things like they can do like because we're kind of trying to model your heuristic computations. so you're modeling this like fuzzy neural net as circuit plus heuristics and then trying to do theory in that setting. but it appears that you're going to have to make some assumptions about these heuristics. like you're not like you can't make schemes that work in the case of all heuristics. and so we're going to be doing some like the success case for this kind of theory will be figure out what the assumption should be that it seems plausible enough maybe has some support from lunar theory which is also going to be heuristic and then prove theorems in this setting. and that's kind of modeling the parts of i don't know humans judging honesty or like or values or like are notions of fuzzy problems being correct or not. and so i think you it's it is very it is basically a case. it's just that there's like more subtle versions of like defined love for me which the machine won't give you an answer to. that is a reasonable intuition to start world.

Nathan Labenz: OK. any other things that you are looking to fund from a research standpoint or any other? and you have impressed me also by continuing to stay active publishing things even while doing this job. so that's pretty cool and impressive. any other highlights from your own?

Geoffrey Irving: you.

Nathan Labenz: would want to.

Geoffrey Irving: i think the so like i think the jailbreaking work we just did is quite like quite cool. so there's this is the like boundary point jailbreaking paper that just came out this week which is basically a way to do black box attacks by you don't you take a jailbreak which and then a harmful query. and then you basically like muck with a query until it looks like gibberish. and then the model doesn't think it's harmful. and then you gradually make it less and less murky until you hit the boundary. and then you kind of dance around that boundary. so you find harder and harder attacks that eventually work. and so that team is doing a bunch of stuff of this kind that's kind of quite creative and kind of important for kind of mapping the safe code space. i think then maybe the on the alignment side the real challenge is because all of this is like imperfectly formalized. often you go to the people we think know the domain best and we say like hey do you want to work on alignment? and there's some like jump they have to make where they're like they they we want to find people who are like bonded enough to the risk model that they're willing to kind of explore in fuzzy sometimes unsatisfying definition space to kind of search around and find ways to connect theory and and practice. and that's i think that is a challenge. so i think the like arc this alignment research center that i mentioned had they've had a number of like challenge conjectures they put out and at the bottom of every one of their conjectures their conjectures they're like but by the way we might have gotten this conjecture wrong. it's possible that if you prove it true or false we'll realize that we didn't mean that we made a slightly different conjecture and only this new conjecture is like risk relevant or important for our safety agenda. and that's kind of an unsatisfying thing to state of a theorist but it's just like fundamentally the real situation we're in. and so i think as more people become kind of aware of kind of model capabilities and risks and so on i'm hoping that like more people with interesting kind of domain expertise kind of want to really dig in and kind of understand the risks and build up their own models and then find ways to connect their area to to the risks.

Nathan Labenz: OK something i often say is AI defies all binaries. and i generally i genuinely do believe that and it seems right to me in a lot of places. but you showed this presentation that you gave it a recent workshop where you said it actually might be the case because we have things in computer science like PNP right? where we know that some things are or at least we it seems quite likely that some things are genuinely like fundamentally hard and other things are fundamentally easy. so maybe help me understand that well how should i update my worldview if i'm somebody who doesn't see that binary?

Geoffrey Irving: i think your view is correct. so the way it works is like again this goes back to this question of like will the super intelligence be jagged? and the answer is yes but only about super super intelligent things. and they won't be jagged about like mundane tasks that are very easy. so like if i give you a task which is like can i get a spoon from that drawer? it's yeah it's not exactly binary but you're going to do it all the time. you're going to do it nearly every time. like you'll just succeed. so like with many nines of probability on that task because it's easy for you. and so i think the the way to kind of combine that view with this question of like things kind of sharpening one way or the other is that if you push not to some infinite limit but like far enough along the this you start out in the middle and then you kind of like some force will kind of push you in one end or the other. but then as you get as you kind of extremize something else will still be in the middle. so that's how i put those two things together. that was a very abstract answer which is the kind of answer i sometimes like. the follow ups maybe if you want.

Nathan Labenz: how will we i mean as it pertains to alignment in particular and honestly a lot of questions in AI right? we have this sort of weird phenomenon of everybody 's you know first of all we're just obviously moving through time. so in that sense timelines are getting shorter as time passes. but then also you know calendar date estimates have come in a lot and yet it doesn't seem like there has been much convergence of views. and so i wonder like how you think we will is that just going to continue to the singularity or like are we going to get some i think?

Geoffrey Irving: i think it basically will can you do this? i think there's i think people have often kind of very strong takes and like i don't know some some people will shift and kind of decide things aren't as binary or they should have model uncertainty. and some people won't really and they'll kind of remain kind of fairly fairly sharply divided. well they'll remain kind of pinned on one side or the other. and so i think but i guess i i've been in the field of how long enough now and seeing enough people continue to not have like strongly shifted that i think that i will just keep going first kind of all the way along.

Nathan Labenz: yeah yeah in my forecasting thing for twenty twenty six the only thing of course i mean everything else goes up. but the one thing i actually estimated lower for this year than last year was what percentage of people will say AI is the most important issue? because the big update for me was if it didn't move last year it might not move this year either. and it's probably gonna be a busy year. so yeah. yeah it is. it is weird how that disconnect you know just seems like totally insurmountable OK.

Geoffrey Irving: from a very low number though i would expect that to go up just because it's starting from a small number and therefore.

Nathan Labenz: the i did predict it to rise. i think it was measured at like point two or point three or something. last year i think i predicted two percent at the end of the year and it basically still came in at you know no almost no change or very little change. so i think this year i predicted one. so i did still predicted going up somewhat relative to baseline but like my estimate went down from last year to this year. another comment that caught my eye in the presentation was training is a mess. and i think that's obviously true. i have been talking to the folks at good fire. you may have seen that they recently raised a bunch of money at a unicorn valuation and announced an extension to their agenda which is intentional design. and so they're looking at different ways to try to use interpretability techniques in the training process to understand potentially on even like a gradient step by gradient step level what is being learned here in a semantic sense. and then be able to apply techniques to sort of say well we do want to learn that sort of thing but we don't want to learn this sort of thing and hopefully make training less of a mess. how optimistic are you about that sort of thing? and.

Geoffrey Irving: no i think that's actually what i meant. so i think i think that that doesn't change that mess. like it's more like so if you look at a frontier lab they have hundreds of people doing model training many many sub teams. there's like like piles of data sets that are constantly contributing to and there's like iteration and there's like mini mini phases. and they'll be like automating part of the task. but then some researchers like spend some time looking at a spreadsheet with a sample of trajectories see how things are going. and that it's like that is a very complicated kind of almost emergent process. and so nothing about that good fire thing changes that at all. it just adds another wrinkle to the mess in some sense. i think like so there's a really lovely line in i i when i was like learning to learning about ML in like twenty fourteen or whatever i was reading one of kevin murphy's books on bayesian and ML and he had a great line that like even the best bayesian people will occasionally do some frequentist thing where they just like do a quick check to see if they're like bayesian thing is like sensible. you shouldn't be too purist. and i think for better or worse like the the training processes of labs are extremely impure. they're just like super complicated all these different people doing all these different kind of spot checks and so on. and i think the that was the point i was making. and i think if i that definitely is going to still be the case even if good fire does or does not do their kind of slightly more complicated training method.

Nathan Labenz: so i mean does that mean you don't have much hope for methods that? understand what the model is learning as it goes and shape.

Geoffrey Irving: the no i do so i think this is so i don't want to take a stand on the like how forbidden or is it good or bad to do like interpretatively for training. i'll be i'll decline to answer that part of it. i think generally like trying to understand in more detail that it never to training is very important. so i think that i was saying that mess slide was kind of orthogonal to the question. i think like there are a number of techniques that try to kind of control what is learned. so like there was also this kind of gradient routing work by alex cloud which is interesting which tries to like funnel certain knowledge into certain parameters in the model. and generally i think like this i do think that there is potential to do interventions of this kind that are kind of important and improve at least misuse risks and buses but possibly also alignment as well.

Nathan Labenz: yeah i think in terms of like open source models one hope would be you might be able to do some of that gradient routing type stuff and then release a version that is like down to experts or something and give people almost everything that they could possibly want and not all of the not package the bio risk into it. what do you think about open source? this is we're maybe a little late in the conversation to ask what is a big thorny question but it seems like right now there's not really any plan. we're just going to i hope that the frontier model developers like surface any issues far enough in advance that if anything is coming down the open source pipe we like have some at least a little bit of window to react to it and do something. but it doesn't seem like we're on any course to do anything if open source is about to become a problem.

Geoffrey Irving: any thoughts? so yeah this is certainly a concern. i think like there are so on the alignment side for one like the alignment mitigations potentially do apply to open source models although you can also remove alignment if you get an open source model that's aligned. i think for misuse risks. i think you so there are basically like as you said there's a class of techniques that just removes capabilities which will give you some some extra period of time. so that includes kind of pre training data filtering. there's a paper with we had from AC with stephen casper about that. there's a paper by lead mine folk called something like unlearn and then distill which does a non robust unlearning step and then distills in a different distills in a different model. and because you the distillation process means you didn't miss the parts you didn't unlearn. and then as you say like grading routing could be a solution to that form as well. but that buys you some time and then the identity capabilities of models will catch up and you'll be able to pull that information off the internet or in various ways even if the bell doesn't intrinsically know it. so i think a lot of that is interventions on the margin. and then this is why i mean in part like we have kind of combinations about governance but also why we have combinations about kind of non model side medications to these risks.

Nathan Labenz: yeah harden it all ends in harden the world. OK cool. any anything you want to share about the ACS work in diplomacy? obviously you know hardening the world and also improving cooperation would be great general public goods.

Geoffrey Irving: so there's still like there's the network for advanced AI measurement which is like a variety of kind of organizations around the world doing kind of similar things. and so we're part of that kind of kind of helping to to just do that to some extent. we were kind of like where the secretariat of the international AI safety report that kind of yashua bengio is leading. and so we do a lot of work there. and then there's a bunch of kind of various venues like wide venues like the delhi summit in india and then kind of bilateral conversations with particular governments like particular allied governments basically. and so that's a lot of like that that we have a big international team and that work is kind of ongoing. we are of course are still in this kind of voluntary regime. so like we that work is about like getting people onto the same page about risks and capabilities and mitigations and but kind of not not more than that as yet. but i think that that's like it's important for their like that information is important in case like situation changes in future our governments want to take other actions.

Nathan Labenz: yeah absolutely. is the UK government and political class generally more optimistic about collaboration with china than the US political?

Geoffrey Irving: class i can't comment on collaboration with china in in great detail. so we obviously work more with allied governments than than other governments. there's not much more i can say than that. so it's a bit of a it's there's some sensitivity there that i can't quite speak to.

Nathan Labenz: i hope you're fighting at least some common ground with chinese researchers and scientists. put that in the suggestion box. yeah i think that's it. this has been fantastic. i really appreciate the time and all the extra time for my many follow up questions. anything we didn't get to or any kind of call to actions you'd want to leave people with before we break.

Geoffrey Irving: i think i guess one thing that we are definitely hiring in a variety of teams in particular like the red team is hiring for if you like jailbreaking stuff please apply and other teams as well. we have a like a job board up. so i think that's the obvious call to action. and we're like well we have other roles kind of opening up the course of the year in various teams at different times. we did kind of one alignment project grand round kind of last year in the fall. we had an alignment conference over the summer. we'll probably do kind of more things of of this nature in the future. so like look for those. and i think generally i just hope that more people who have different kinds of forms of knowledge and expertise kind of start working on the problem and not just the laughs. i think one thing is that like when i left deepmind so i was at google brand and then open my hand in deepmind but i left deepmind i had that perspective that like i was just going to do just policy work kind of advising policy. and then since then in fact i do kind of a mixture of that advising kind of governments and but also like a bunch of research. and i think there is like a big place for independent research happening at various nonprofits and academia kind of also in governments. and so i think that is very important to build up and not just have all the work happening at AI developers now that it is now it's just like more more of it is is better more more safety work and security work at at independent folks.

Nathan Labenz: yeah it's definitely shaping up to be a whole of society effort and the time to mobilize our resources would seem to be now. i definitely also recommend folks especially if you're interested in doing alignment work and you have sort of an idea that you maybe don't see too many other organizations showing an interest in. i thought your research agenda was quite distinctive in that way. and there's at least some chance that people who aren't on the the most well trod path but have interesting ideas could find you know some willing collaborators at the UK.

Geoffrey Irving: AC yeah this is a bad mistake in my part. definitely read the research agenda. we like this like it's it's like sixty pages long and was a lot of like like some concrete some less concrete problems in a variety of areas as they would apply to alignment and AI control. so please take a look. i should have mentioned that.

Nathan Labenz: many open problems. we'll put a link in the show notes jeffrey irving chief scientist at the UK AC. thank you for being part of the cognitive revolution.

Geoffrey Irving: thank you.