The Great Security Update: AI ∧ Formal Methods with Kathleen Fisher of RAND & Byron Cook of AWS

Watch Episode Here

Listen to Episode Here

Show Notes

Kathleen Fisher and Byron Cook dive into automated reasoning and formal verification as tools for building truly secure software systems. PSA for AI builders: Interested in alignment, governance, or AI safety? Learn more about the MATS Summer 2026 Fellowship and submit your name to be notified when applications open: https://matsprogram.org/s26-tcr. They explain how formal methods can harden critical infrastructure against AI-enabled cyberattacks, and how assumptions, specifications, and proofs combine to deliver real security guarantees. The conversation explores using these techniques to train coding models, enable a “great software rewrite,” and power AWS’s new automated reasoning checks for AI agents and policy compliance.

Sponsors:

MATS:

MATS is a fully funded 12-week research program pairing rising talent with top mentors in AI alignment, interpretability, security, and governance. Apply for the next cohort at https://matsprogram.org/s26-tcr

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

Agents of Scale:

Agents of Scale is a podcast from Zapier CEO Wade Foster, featuring conversations with C-suite leaders who are leading AI transformation. Subscribe to the show wherever you get your podcasts

Shopify:

Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive

CHAPTERS:

(00:00) About the Episode

(04:52) AI Reshapes Cybersecurity

(10:16) Formal Methods Foundations

(17:46) Security Properties Assumptions (Part 1)

(21:27) Sponsors: MATS | Tasklet

(24:27) Security Properties Assumptions (Part 2)

(28:31) Helicopter Formal Verification

(38:15) Proof Confidence And AWS (Part 1)

(41:52) Sponsors: Agents of Scale | Shopify

(44:40) Proof Confidence And AWS (Part 2)

(50:33) Automated Reasoning For Policies

(01:04:39) Generative AI Meets Verification

(01:19:42) Securing Future AI Systems

(01:31:19) Agentic Guardrails And Governance

(01:40:44) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Transcript

Introduction

Hello, and welcome back to the Cognitive Revolution!

Two quick notes before we jump in:

First, applications for the MATS 2026 Summer program are now open. At NeurIPS this year, MATS fellows presented many papers at mechanistic interpretability and alignment workshops, and 6 MATS fellows had spotlight papers â so if you're interested in a career in AI safety research, you should definitely consider applying. I'm also excited to say that I've booked MATS Executive Director Ryan Kidd for a full episode early in the new year, so listen to that and be prepared to submit your application by the January 18th deadline.

Second, we're planning to record another AMA episode, probably in the first week of 2026. Visit our website, cognitiverevolution.ai, and click the link at the top to submit your question, or feel free to DM me on your favorite social network.

With that, today, we're diving into the world of automated reasoning and formal verification of software.

My guests, Kathleen Fisher and Byron Cook, are legends in this underappreciated but increasingly important field.

Kathleen famously led the High-Assurance Cyber Military Systems or HACMS project at DARPA, is currently the Director of the Cybersecurity Initiative at RAND, and starting in February, will take over as CEO of the UK's Advanced Research and Invention Agency, ARIA, which is often described as the UK's DARPA, though its mission goes beyond military technology, with the goal of unlocking scientific and technological breakthroughs that benefit everyone.

Byron, meanwhile, is Vice President and Distinguished Scientist at Amazon, where he's made a major contribution to cloud security by leading the application of formal methods to distributed systems at AWS, which, despite being arguably the world's biggest target for cyberattackers, has maintained an amazingly strong security record.

In all honesty, I don't think I've ever felt more out-classed by my guests than I did in this conversation. My own math career topped out at differential equations and complex analysis in college, and I've never been strong when it comes to mathematical proofs or formal logic.

Nevertheless, I'm hearing more and more in AI circles, not just about the need to harden critical infrastructure against cyberattacks before AI-powered hacking becomes ubiquitous, but specifically about the unique power of formal methods to deliver true information security guarantees, so I was eager to learn as much as I could, and honored that these masters were willing to answer my remedial questions.

We cover a lot of ground in this conversation, from the nature of the cybersecurity threats that AI poses, to the relationship between software specifications and the proofs generated by formal methods, to the critical role of assumptions, and how we can be confident that many low-level logical statements do in fact add up to system-level guarantees.

We also discuss how these methods can be used to create reward signal for coding models, and why, despite what we've seen from AI coding assistants to date, we should expect that LLMs, will, over the next generation or two, achieve superhuman levels of code security and thus create the opportunity for a great society-wide software re-write that could dramatically and durably reduce cybersecurity risks.

Finally, we explore how AWS is applying these methods to AI agents in the form of their new "automated reasoning checks" product. In short, by translating natural language policiesâlike a company's HR handbook or a city's zoning lawsâinto a set of variables and rules which a user can iteratively run test cases against and refine, and then mapping an AI agent's outputs onto this structure, this service makes it possible to use formal methods to check the agent's work for policy compliance. While the translation from natural language to formal rules does create some gray area that isn't present in the traditional process of formally verifying explicitly-coded software, this approach stands out to me as the strongest paradigm I've seen for implementing AI agent guardrails.

So much so, in fact, that toward the end, for the first time ever, I found myself worrying that AIs might one day become so consistent in their adherence to policy that we could lose the flexibility to make exceptions when official policy doesn't fit a specific situation.

That, ultimately, is a problem for another day. For now, the bottom line is that, with threats ramping up everywhere, the techniques we discuss today constitute one of very few strategies I'm aware of that could "really work", in the sense that they're capable of resolving specific critical vulnerabilities once and for all.

This is challenging material. If you're like me, and find that some of this conversation is over-your-head, I would encourage you to pause and chat with your favorite frontier language model to clarify key concepts. I'm confident that if you do, your AI worldview will be stronger for it.

And so, I hope you enjoy this technically demanding but deeply optimistic conversation about the future of secure systems, with Kathleen Fisher and Byron Cook.

Main Episode

speaker_1: Kathleen Fisher, Director of the Cybersecurity Initiative at RAND, and soon to be CEO of the UK's ARIA, and Byron Cook, Amazon Vice President and Distinguished Scientist. Welcome to The Cognitive Revolution. Thank you.

speaker_2: Yeah, glad to be here.

speaker_1: So I'm really excited about this conversation. The world of AI and its impact on cybersecurity is obviously heating up at a tremendous pace. And something that a lot of people are worried about is like, what is this going to do to the sort of offense defense balance when it comes to securing our digital systems? You two are both, you know, world leading experts on this topic. And I'm really excited to develop my intuition and help other people develop their intuition for kind of where we are, you know, what threats we have, what threats are emerging as a result of AI. And then these. sort of, I would say formerly, maybe a little known, or you might even say esoteric but increasingly becoming like quite centrally important methods that you guys have pioneered, formal methods that can hopefully give us a qualitatively different level of confidence in the systems that we're building and how much we can rely on them. And hopefully ultimately, you know, create that sort of DIAC future that we can all live safely and happily in. So teach me everything, but let's start with maybe How do you guys see the impact that AI is having right now on the cyber landscape? We hear it's coming, and Anthropic has put out a report, but I really don't know how to think about it. One way would be chatbots help people that aren't that capable become more capable, but maybe they're just kind of attacking people that otherwise had security by obscurity, and it's not really that big of a deal. Maybe we're going to see these AIs coming up with sort of cybersecurity move 37 type, you know, brand new zero day exploits, and it's going to be crazy. Maybe the way to think about it is like hackers have historically been limited by the fact that they can't like share their, they can't collaborate in very broad networks. So they're like only able to attack a few things at once. And maybe AIs could like make that a super parallelizable sort of thing. Yeah. What do you think is like real and what is fake when it comes to AIs changing the threat landscape in the cyber domain?

speaker_2: I think, sadly, it's all of the above. I think that AI is providing assistance at kind of all levels of the, at all stages of the cyber kill chain and at all levels of expertise. It is making people who don't know very much know more, and it's making people who know some know more, and it's making experts more effective, right? And it's making it, making hackers be able to work more at scale and in more in parallel. So all the things that you said, yes. Colleagues of mine who are really powerful reverse engineers, like kind of the best on the planet, they are shocked at how good AI tools are at helping them go better or how well those tools are able to go to do on their own. So like everything from like the script kiddies to the nation state adversaries, like AI helps everyone do better at at cyber attacks or, you know, well, attacks or whatever the part of the cyber kill chain you're functioning, AI helps you do better at that skill.

speaker_3: If I could, I agree with Kathleen. If I could give a slightly more optimistic take on it, though.

speaker_2: It also helps the defender. I mean, I think-.

speaker_3: But I think there's more, like one of the things that we've depended on for a long time is the absence of mechanisms. So we, we, we, humans wrote code and then they, they didn't really know what the programs were doing and they, and they sort of, they hired some good people to kind of figure out as they wrote it. But as we're using these agentic AI tools, there's a movement to move away from socio-technical mechanisms to, to the formalization. And I think there's now a feedback cycle that's that as security events happen, we can say, oh, we don't want to ever be in that situation again. And here's the property we could write down and then show that to the code. So I think each incident, I think, has tremendously more blast radius on improving the systems of the future. But I'm speaking to the future. I'm not speaking to Detroit at this moment.

speaker_2: I mean, to answer your question, I think like right now, the software we have is riddled with vulnerabilities and AI will make it easier to exploit those vulnerabilities for all the different kinds of attackers, from like the script kiddies to the ransomware gangs to the nation state attackers, and do that either in a subtle way or in an overt way, or for an AI agent to run amok, right? All of those threats are made more intense by AI. I think that's the doom and gloom scenario. I think the same technology that is producing that doom and gloom scenario, though, can be flipped on its head to make the software less vulnerable. And we're talking about software, of course, the software, like we have hardware, we have software, and we have people, right? And all of those things are potentially affected by AI systems and AI formal methods. paired with AI can potentially help with many aspects of the system, not just the software. Although I think the software is the sweet spot where we have the most ability to relatively quickly make a positive difference.

speaker_1: So for somebody who, you know, like me not very long ago, had either not heard of formal methods at all or had like kind of heard of them and had a very primitive understanding, can we just do like a little one-on-one? Like, what are formal methods? What is the formal verification of software? What does it allow us to do that's qualitatively different?

speaker_3: It's the algorithmic search for proofs. And so what is a proof? It's a finite argument of what is true. You can reason about the infinite in finite time and finite space. And to do that, you need sort of chess-like rules. You need like an accepted set of a small number of rules that now to make an argument, you have to religiously follow those rules. And those rules have evolved over the past 2000 years, but we have sort of as a society settled on, you know, a few of them like modus ponens and so on. So basically you follow these rules and then these days we can algorithmically search for those arguments. And actually generative AI is actually very good at helping us with that. So you can use generative AI to go help find a proof and then check it according to one of these checkers that only follows those rules.

speaker_2: I would say at a slightly higher level, it's a set of mathematical techniques that let us prove properties of software in a way that we can check-- the machine can check so that we can make sure that we did all of our work properly, that allows us to prove properties of software. And those properties, it's not a one-size-fits-all thing. There's a whole range of different kinds of formal methods-based properties. And the gamut goes from things that are really kind of easy to use and easy to understand. And you might to harder to use, but you prove richer properties. So a type system, like the type system in Java, is actually an example of a formal method-based property. When you run the type checker in Java, it proves that the way your program is manipulating data, you're only going to ever put integers together in a way that is consistent with integer-like behavior. You're not going to try to add an integer and a function, for example. That would be a type error. So if you've ever written a Java program and run the compiler, you've used a formal methods-based property, and the type checker has checked your work and is like, yep, you've gotten the seal of approval. You've used that formal methods-based approach correctly. You didn't have to know very much about how it was doing it. You just had to know the rules of how various kinds of types could interact together appropriately. And the compiler checked that those rules were used properly. And there's the escape of like typecast, which you had to not use to break the soundness of the type system. That's kind of at the one end of the spectrum where it's actually pretty easy to use and you get a relatively weak guarantee. Type safety is a pretty weak guarantee. And another end of the spectrum is an interactive theorem prover where you're proving like full functional correctness, where you have a specification that of exactly what the code is supposed to do. Like CompCert is a verifying C compiler that proves that the C code maps down to assembly code that has exactly the same semantics. And that was proven in an interactive theorem prover where the person doing or the people doing the proof had to use a really a huge amount of brain power to prove that was correct and each step was checked, each step of logic using Modus Bonens like Byron referred to, that's like the other end of the spectrum where you're talking about PhD-level smarts to use. And there's a whole bunch of things in the middle where you have an intermediate level of property and an intermediate level of how much elbow grease you had to use to get that property proven. All of them are useful. Everything from the very simple properties, like type safety, to the very high-end properties, like full functional correctness, are useful. It's just a question of, when do we want to use this level of effort to establish what kind of guarantee? And choosing which properties and what level of effort for what kind of thing is part of the art of appropriately using formal methods. Does that make sense?

speaker_1: Yes, although I definitely have some follow-ups.

speaker_2: Please.

speaker_1: My software background is basically more on the JavaScript side where I can, I'm not even into the TypeScript so much, so I've kind of written code where I can add an integer and a string. And if I do that, it's on me. And then I sort of do what I think most kind of rank and file web developers, including many building on AWS, are doing, which is like trying drove some code together, fire the app up in a browser, click around, see if it's doing what I want it to do, if it's generally doing what I want it to do, then I feel like, okay, I guess I did a good job. Obviously, I'm not in that process able to be anywhere close to exhaustive in terms of all the possible things that could logically go wrong. So I definitely have an intuition for how Type checkers, for example, can just help reduce errors because things that I wouldn't have thought of, these tools can sort of flag, Hey, you're breaking a rule here, and if you do that, you could get yourself into trouble. That much I think I have a decent intuition for.

speaker_2: When you use something like a JavaScript, you're getting memory safety, which is another kind of formal method kind of guarantee, right? When you, you're not, like the memory boxes, like there's a abstraction that the language is providing for you, which is when you store something in memory, you store an integer in memory. when you take it back out again, it's still an integer. It didn't transmogrify itself into a function. That's a memory safety property that is being provided by the language that you are using. If you are using C, that's a language that doesn't provide memory safety. So like memory safety is another property that is very, very useful for not getting hacked. And it's a language that that's a property that the language you're programming in is providing. and the interpreter that you're using is enforcing that particular property.

speaker_1: So help me kind of understand if I have this sort of basic understanding that these type checkers can help me, you know, avoid shooting myself in the foot. And, you know, I don't even have maybe that great of a sense of like, the taxonomy of attacks. Memory type attacks could be one. In my homework for this, I've noticed that parsers are maybe not the root of all evil, but like a very common source of weakness.

speaker_2: Memory safety type, like input safety, yes.

speaker_1: So maybe give us like a little walkthrough of sort of the kinds of things that people tend to exploit and then maybe the kinds of approaches that people take to And this is where I get so fuzzy because I'm like, how do I really know when it was fully correct? How do we get to that fully? Because the layperson who doesn't quite get it can wonder, well, what if there was a mistake in that process? How do I know that it's really, truly proven in a way that I can accept it and take to the bank? So that's a fuzzy question, but I'd love to see to kind of work my way up toward having the confidence that I believe is well justified. But I don't, I can't articulate yet like why I can sleep soundly at night because this process has, you know, has been executed.

speaker_2: Yeah. So, I mean, to some extent, cybersecurity is analogous to the security that you would want on your house. So 100% security is not actually a goal that you really should be going for. You don't want to have a house that no one can break into ever. Because then it's like you lost your house keys, you would have to go buy a new house. You want a house that has an appropriate level of security. Like right now, the level of security we have is equivalent to the doors are open, the windows are open, they're not even closed, let alone locked. Right? So like the question that you're asking is a little bit, like we can walk through the example that you're asking for, but like total security is like way, way, way beyond where we are. And we're looking for just the equivalent of let's close the doors and lock them as opposed to like everything is wide open, which is kind of where we are right now.

speaker_3: Yeah, the mental model I have is that there are a set of properties that you want to hold of your systems. And they can be motivated by security concerns, past things you've seen, and also where things are going. So you might want to know that no credentials are ever logged, that all data at rest is encrypted. And you can kind of layer on more and more of these, like that maybe for sovereignty, you want to know that certain data never leaves the region. And to establish some of those things, you'll probably need to establish, you might need to establish memory safety. You might need to establish various things, but maybe you don't, right? And then the way I like to think of it is like, well, what properties do we want to hold? And then that creates an envelope on which the system can operate safely. And then the programmer or the LLM can choose to implement that in different ways. And so I think that notion of like proving everything of the system, I don't think is really true. I think that you want to under specify the system. and then allow there to be different implementations of that program. Yeah, so that's kind of it. And then the other point I'll make is that you're always having to make some assumptions. Like when you prove a program correct, you typically assume the microprocessor is correct. Okay, well, we could go prove that. But ultimately, if we prove the correctness of the microprocessor, we're going to be making assumptions about the principles of physics, right? And we don't really know that they're always going to hold. So we believe, historically, gravity has held, and we believe it will continue to hold. But we don't really know that. And you're not going to be able to do a mathematical proof of that, per se. And if you do, it's going to be under more assumptions. So there's always assumptions all the way down. So you're only ever raising assurance. You're never getting 100% assurance.

speaker_2: But it becomes confident enough-- Yeah, exactly. You have assumptions that all the air in the room isn't going to go all of a sudden into the corner. And that's good enough, even though theoretically it's possible that all the air in the room will go into the corner and you will suffocate. It's a kind of thing like that. I mean, I think there are properties like if your system is memory safe, if your system uses parser generators and is input valid safe, things like that will make it so that the likelihood that your system is vulnerable to certain kinds of attacks is much, much lower. So things like that are the equivalent of having the practice of closing and locking your front door, closing and locking your windows. So it doesn't mean your system isn't vulnerable. It just means your system is likely to be much less vulnerable, and the people who are attacking maybe will attack your neighbors instead of you.

speaker_3: So what we've done at Amazon is we've listened to customers about the concerns that they have, and then have chosen to focus on proofs of some of those things. And then the other thing we've done is we've worked with engineers who are developing systems and identified the parts that we're worried about or that seem hard to get right. And then we've sort of focused there. So we have identified pieces of the system we've done proof of, and then what we're seeing in time is that they're beginning to spread and start to touch. So you'll have proofs of two systems. But those proofs begin to touch each other and they begin to blur into kind of a system that has, parts of that have been proved. And then the big thing is like, you know, what are customers worried about? And then both internally, but also their like policies and networks and that kind of thing, so.

speaker_1: I'd love to go through some examples of, you know, customer concerns and the types of assurances that you're able to give customers based on all this work, but maybe before doing that, just one more kind of beat on the, say the C compiler. So when you said that is proven to the point where the semantics of the C code are proven to be reflected in the semantics of the machine code that it's compiled into, what assumptions are still lurking there? Like I sort of have an intuitive sense of what it means to say that the semantics are the same, but I'm unclear on like what assumptions that would rest on or sort of, you know, what's the equivalent of the air all going into the corner for that scenario.

speaker_2: I think that there are, I don't remember off the top of my head what the assumptions were. They're probably assumptions about the underlying hardware.

speaker_3: Yeah, for sure. They're going to assume the correctness of the ISA. I only know the works of secondhand, but for sure, they didn't go prove the But they're assuming x86 or ARM instruction set, and they have a model of what that is. That's what we do in our work, too, at Amazon.

speaker_2: The original version had assumptions about the parser that turned out not to be correct, but they went and fixed that in later version.

speaker_3: Yeah. They'll also probably make some assumptions about the code, like, I don't know what they do for concurrency, if they do anything, so it's probably sequential--.

speaker_2: The original version was sequential. they were focused on avionics software. So avionics software is assumed to be single threaded.

speaker_3: Yeah. So yeah, there'll be some, some assumptions along those lines that, and then well, so there, but when you're calling, so there'll be some issues around calling the operating system. There'll be some assumptions around like what do the operating system's APIs do? And there'll be the other third party APIs that they'll have to, so the nice thing in the this work is you can model the outside world and you can often model it with demonic non-determinism. So you can say, I don't know what this thing is going to return, but it's just a formula represents I don't know. And then that actually, non-determinism isn't free, but it's already paid for in the formal reasoning tool. So it's actually quite nice. And then you can refine that as needed. So one of the areas that sort of practical realities when you're doing these kinds of proofs is you end up kind of refining your assumptions around the third party APIs that like the operating system systems calls.

speaker_1: So can you give me a little more flavor of like, and it might be useful to do this with examples, you know, starting from customer concerns, or maybe not, you tell me. How do these, what do these proofs look like? I have the sense that they're like exhaustive mapping out of like every possible logical state, every possible corner case. I've heard estimates that like the length of the proof might be like an order of magnitude longer than the code that you're trying to prove the validity of, maybe even longer than that. What do these things look like as we kind of, if we started to inspect the proofs, what do they look like?

speaker_3: So there's a whole, there's a huge diversity of tools and they make a number of different choices. So it's looking at all 80s bands and then trying to characterize what is a pop song, right? Some of them are going to involve like really heavy, some are going to be super simple, some long, some short. And so like Kathleen was alluding to this too, that there's a spectrum of tools and some of them are really easy, some of them are really hard to use. And then with Gen. AI, maybe they're a bit easier in the future. And then there's tools that are tailored to certain domains. So for example, there's a technique called predicate abstraction and counterexample guided abstraction refinement, which is good at proving certain kinds of properties, like API usage properties. And so they automatically, you don't need to do anything, right? They run on the tool completely, on the code completely automatically. They discover a proof, and then if you were to want to investigate it, it has a particular shape and sort of representation. But then if you're trying to prove the, I don't like the Collatz conjecture, or, you know, if you're trying to do, like we, at Amazon, we've recently announced the proof of a new hypervisor. So we've announced the existence of a new hypervisor. It's called the isolation engine. It'll be in the AWS stack, Graviton 5. And then we and the Isabel theorem prover have proved a bunch of properties of that. And so those are quite intricate. And so there's really a spectrum. So I could give you, for any data point you want, I could find an example.

speaker_1: Let's do a couple relatively simple ones.

speaker_3: So for proving the correctness of the TLS handshake in AWS S2 in, so it's an open source implementation of TLS that we wrote in Amazon, and it's what Amazon uses all across Amazon, there's No, you're able to, it's all completely automatic. And underneath the hood, the tool that's being used is SAT with the discovery of inductive invariants. But because cryptographic code is usually, the loops are actually usually quite simple. And they're not over complex data structures and so on, like fairly simple heuristics actually work. So that's one example that's quite easy. Whereas in the proof of the hypervisor, there's a, My mental model is there's sort of a one-to-one ratio of engineer to formal methods expert on the team. And so I would imagine, I haven't looked at it in great detail, but it's on the order of that size, right? It's gonna be the roughly one-to-one ratio of like proof scaffolding to code as a couple examples.

speaker_1: So can we go a little bit further into, again, like what are the sorts of statements being made you know, the sort of this software does what it's supposed to do and only what it's supposed to do is like a great place to get to. But I still want to have a little bit better intuition for how do all these low level statements that we can make aggregate up to that? And how do we know that they do? And I apologize if this is like a really ignorant line of questioning, but I do struggle with, I can sort of look at these individual statements and say like, okay, well this, we've gone down this logic path and like that one's okay. And we could do that a lot of times. And I'm still like, oh, but how do I quite know that we're actually proving the things that we mean to be proving?

speaker_2: I think it depends on the system. So like SCL4 is a separation kernel, right? It's one of the exquisite artifacts that was proven to be fully functionally correct by a team in Australia. And the main property of a separation kernel is that you can configure it so that you have different compartments, and the code in one compartment cannot interfere with what's happening in another compartment. And they specified in formal mathematics that property. And part of that was a massive undertaking. And that's like 10,000 lines of C code that is the hypervisor. And then there's about a hundred thousand lines of Isabel that was proving that property holds and that you have integrity of the system. And so the reason why that level of effort was worth it was because that is a building block that you can use in many systems. In HACMS, which is a program that DARPA ran from 2011 to 2016, one of the things that the HACMS performers did was they built a helicopter, the Boeing Unmanned Little Bird, that used the SEL4 hypervisor to separate the code on the helicopter into different partitions. One of the partitions was the camera partition that an analysis suggested was not important from a security perspective. It didn't matter if the camera partition was owned by the bad guy for the flight operations of the helicopter, for the ability of the helicopter to communicate to the ground station. But then there was the flight operations, the thing that controlled the actual ability of the helicopter to fly. That's the mission control computer. It was the thing that talked to the ground station. And there was the flight computer. That's the thing that actually controlled the ability of the helicopter to fly. They decided, they analyzed the system and modeled the overall architecture in another formal language called AADL, Architecture Description Language. And they modeled that they wrote the protocol to talk to the ground station in a grammar. Instead of writing it by hand, they wrote it the grammar for what the message format had to be. And they proved properties, system-wide properties that was like the only way to communicate from the ground station to the helicopter was through a pathway that had to be authenticated and encrypted. So only messages that would get to the, from the ground to the, all messages that got from the ground to the mission control computer had to be authenticated and encrypted. Properties like that, which are kind of the high-level system properties that you cared about. And they reasoned about those properties in this architecture-wide description, which is a formal methods tool. And like using those combination of formal methods tools, things like the parser tool, that is what they wrote the description of the properties in, and an encryption tool, that is what they wrote the encryption tools, improved properties of encryption in. And then the SEL4 microkernel, which is how they got the separation property. They were able to prove system-wide properties, like the only way to get a message to the mission control computer was through an authenticated encrypted channel. They were able to prove those kind of system-wide properties that you were asking about. They then tested this with the red team. They let the red team put whatever code they wanted in that camera partition, which they had done the system-wide analysis of to decide that wasn't critical from the security of the overall system. They let the red team put whatever code they wanted in that partition. At the end of phase two, they let the red team attack the system while it was on the ground. But at the end of phase three, they let the red team attack and try to disrupt the operation of the helicopter while it was in flight with two test pilots on board. They were basically trying to crash not only their partition, but the helicopter while it was in flight with test pilots on board. The test pilots, not only did they survive, they couldn't tell that they were flying the high-assurance version of the helicopter instead of the normal version of the helicopter. The camera partition kept crashing because the red team could crash their own partition. They could basically do a fork bomb, and that would bring down their own partition. But the rest of the system would be like, oh, the camera partition went down. Let's restart that partition. So that's an example of how you can use a combination of different formal methods-based tools to get system-wide guarantees of behavior, right?

speaker_1: Yeah, one question I guess I still have there is, and that is an awesome story. And you kind of also glossed over the beginning part of the story where, as I understand it, basically before this whole project began, the red team hacked right in and--.

speaker_2: Right, and surprised the Boeing engine. So the baseline of hacking-- And this is a military helicopter. It's a military helicopter, right? They also did the same thing with a quadcopter. No one was surprised that the red team could hack into the quadcopter. The Boeing engineers were quite surprised that they could hack into the Boeing's helicopter and the DARPA had the red team attack the platforms unmodified to demonstrate what the baseline security was to basically show that like these platforms were hackable without that much effort, right? The red teams were able to hack in with like six weeks worth of knowledge in the first place. And then after the modifications, the red team was following along throughout the whole program. So one concern people have voiced was like, well, the red team didn't know very much about the system. So why would you think they would be able to hack in? It's like the answer was that the red team actually knew a ton about the system because they were following along through the program the entire time. On the quadcopter, DARPA took the quadcopter. So not only did did the Hackham's researchers do this for the Boeing and Little Bird. They also did it for a quadcopter. DARPA took the quadcopter back to DEFCON and took it to the aerospace village and had hackers try to break into the quadcaptor several years later, and no one was able to hack into the quadcaptor several years later. So kind of demonstrating that these kinds of techniques really pass the test of time. There are still assumptions. So it probably is possible to hack into it, but probably you have to attack through the hardware instead of through the software.

speaker_1: Yeah, well, my one main remaining question on that story is, like, at the end of that process, For somebody like me who maybe doesn't get it, there's no substitute for flying the helicopter, right? But were you and the team, how much lingering doubt did you have about whether or not the red team was going to get in? When you say something is proven, in your mind, is that 100% proven? Or is it like, how many nines of proven is proven? And was it purely for the doubters or the people that don't get it that you needed to do the flight for your own sense of, you know, epistemic certainty? Like, did you have everything you needed before the bird even went up?

speaker_2: Yeah. So the flight was, I think, mostly for the demonstration quality of like, we were, the people running the program were willing to risk the lives of the pilots. But I think the people, everyone in the program knew that it was going to be fine because of the level of guarantee it and all of the, it wasn't just the proof, right? It was all of the surrounding reasoning and artifacts that went into producing the evidence that was in that proof. I mean, it probably is still hackable through the-- there are still-- as Byron says, there are always assumptions. The hardest probable--?

speaker_3: There's always more to prove.

speaker_2: There's always more to prove, right. But what the hardware, what the attackers could do in that case, where the red team was and what the red team was going to be able to do was not going to be able to crash the helicopter in that situation. You know, could another, you know, could somebody come and attack it in a different way? Probably, but like that would be, you know, could somebody, you know, throw a rock at the rotors of the helicopter and crash it? Like there are other ways of attacking the helicopter that would cause a problem for like helicopters are fundamentally not very stable, right? So like is the risk to the helicopter through another means now higher than the risk of the software attack? Probably.

speaker_3: If I could propose a way of thinking about it, if we sort of separate the how and the what, I imagine there's and the problem is undecidable, so this won't be possible. But imagine there's just this magical tool that can answer any can and can find any proof. Now, there's still an infinite set of questions you could ask of that tool. So it's like a query tool, right? Imagine a Google search, right? You go and you type a question and you get an answer. So imagine we have a procedure that can answer questions of your program. You have a program, you have a tool that can answer any question of it, 100% reliable. There's an infinite, if not uncountable, number of questions you could ask. Right? And so imagine you knew all the answers that you had asked. You had 5,000 questions. There might be the 5,001th question you want to ask that might reveal something that you actually didn't like about your system. So that's kind of one way to think about it. And then what's traditionally been a problem, and made this very difficult, is that then going and actually establishing the answers to those questions is also rather challenging. And there are... quite remarkable steps of progress now happening to make that easier. And that sort of brought it to the forefront. But the other thing that's brought it to the forefront is with generative AI and agentic AI, suddenly a much larger set of people are asking those kinds of questions. And so it becomes much more interesting now.

speaker_1: So yeah, tell me where we are today in terms of, I mean, you've been working on this at Amazon for a decade, right, and there's been like a lot of wins. maybe kind of sketch out some of the big wins, like what are, and you know, for a little developer like me, I don't even, I just trust you guys are doing a good job. I don't think too much about the how. I know you're, you know, big enterprise customers that are making decisions to move highly sensitive data into AWS are asking a lot harder questions than I'm asking. So what sort of, how much of the roadmap, you know, has been accomplished? How much remains to be done and what does the acceleration look like today with the augmentation that we're getting from generative AI tools.

speaker_3: Yeah. So let's say pre 2022, the things that we have done are, first of all, we actually have, it's like we have tools to help customers reason about their configuration. So their virtual networks, their policies. So in, in AWS and in many sort of cloud compute environments, there's a notion of policy that's quite flexible. And so you basically have resources on the internet and then you have policies that defined when those resources can be accessed and under the constraints under which that can happen. So a secure system typically is the configuration of those policies turns out to matter quite a bit. And when I joined in 2014, customers love the flexibility of the policy language and love the flexibility of the virtualized networking environment. But they didn't love not knowing whether or not they got it right. And so one of the first things we did for customers was to build tools to help customers actually reason about their own systems. And so those are embodied actually in many AWS features, but the flagships would be IAM Access Analyzer and VPC Reachability Analyzer to reason about the policies and to reason about the networks. We also have the S3 block public access, and there's also tools and Inspector and Config rules in a bunch of places. So that's externally. And then internally, we identified a bunch of places that we wanted to get stuff right. So the cryptography, the virtualization infrastructure, storage infrastructure for durability, the identity infrastructure. And we have proved a number of sort of key limmas in that space. What's been very interesting as of lately is that like, what did we, so we proved the correctness and I think we're, I don't think the audience will be able to see me, but they'll hear me. So there's quotes around the correctness of the policy interpreter. So the AWS policy interpreter gets called over a billion times per second. And what's happening is whenever you call AWS and AWS API, we gather all of the policies related to the principle and the resource and then put them together in a collection, and then we call this interpreter, and we make a decision, yes, no, should we allow this action to take place? And that code is now proved correct. Now, what does correctness mean? Ah, it's correctness with regards to the semantics of the policy language that we defined in that previous tool that's the basis via M Access Analyzers. That's an example of these tools beginning to connect up. We first defined the semantics of the policy language, then we provided customers with tools to formally reason about their policies, and And then customers began to ask questions like, well, how do I know that you're actually interpreting that policy correctly? Well, okay, so let's go prove that. And so that's an example. And you kind of see that in other places in the virtualization space and so on. So then what happened, very interesting in 2020, late 2022 and early 2023, for obvious reasons, that suddenly people became very interested in generative AI. And the coming back from the holiday break of 2022 to 2023, boards and C-suite leaders really wanted, were wanting to develop stories for what their like organizational generative AI story was gonna be. And so a whole bunch of people began using generative AI tools. And the questions we began getting from customers evolved and it became much more about first like correctness of chatbot applications. And then increasingly correctness of agentic AI, multi-agent systems. And so we've been applying, actually, it's many under the hood, it's like many of the techniques we're actually using for agentic AI in chatbots. So a policy of the Family Medical Leave Act is very similar to an AWS policy on S3 buckets or VPC networks. And then the reasoning about the composition of multi-agent systems actually looks a lot like reasoning about the composition of microservices to show that S3 provides strong consistency as opposed to eventual consistency. So we're actually able to sort of take a bunch of those techniques and adapt them into that world. So that's kind of where we're at. And then the other, but the area that's quite exciting is that As generative AI can actually help you find proofs, when you're asking a generative AI tool to help you find a program, you can actually ask it to find a proof at the same time. And so you can begin to check those proofs as the programs are synthesized. And so big challenge, and the challenge has always been, and I imagine Kathleen will agree with me, is the hard part is figuring out what you want to prove. And so a lot of the work that's been going on recently, and you'll see this in automated reasoning checks and bedrock guardrails from AWS, but also Kiro, which is an IDE that Amazon has launched. It's helping customers figure out what their specification is. What do you want to hold of the system? And let's iterate together on that to make sure you got it right so you're actually proving what you think you're proving. And so that's where a lot of the focus has been right now. So that's a snapshot of where we're at.

speaker_1: It might actually be helpful to dig in a little bit on the automated reasoning checks as they relate to the LLM and agentic system outputs. Because I think certainly, you know, this feed is by an AI obsessive for AI obsessives, but my, as much as I'm, you know, obsessed with all the latest developments in AI, I also am like, not very good even with my own, you know, just password, you know, management level security for myself. So I think maybe the intuition will be a little bit easier to develop in some ways for the more familiar use case. And then we can maybe, you know, back port a little bit of that intuition to the kind of rest of the world. So one of the promises that the automated reasoning checks make is that you can minimize AI hallucinations and deliver up to 99% verification accuracy and Maybe it'd be helpful to just talk through like how this sort of policy, because what people can start with, I think this is really super interesting, is they can start with a natural language policy, right? You mentioned like the family medical leave policy. So this is something that somebody has written, you know, in prose, presumably without even thinking about a software implementation of it when they wrote it, right? These policies typically will, you know, predate the the ChatGPT moment.

speaker_3: Yep.

speaker_1: So there's this natural language policy that exists somewhere. Then somebody comes along and says, okay, you know, I could really save a lot of time and money if I had an AI evaluate everybody's FMLA applications.

speaker_3: Yep.

speaker_1: But then we have the question of, well, is it doing it right? So maybe take us through that loop of how this fuzzy language, natural language policy gets translated into a set of checks in the first place. And then there's also like an iterative loop, which is, you know, that's part and parcel of kind of everything I've seen in the AI game recently. So yeah, take us through that story of how it's like how a bill becomes a law, but it's like how a policy becomes like a formally or almost proven policy.

speaker_3: Yeah. So you can go today and ask Claude, hey, translate this. Here's this PDF. Can you please translate it into you know, temporal logic, your branching time, or linear time temporal logic, or SMT, or lean, or there's all kinds of formats, and then I want to answer these kinds of questions. And so under the hood, that's essentially what the product is doing. So there's sort of two activities in the product. The one is to help you formalize what is true, and then the second is to remove, you know, remove incorrectness due to hallucination at inference time. So we basically ask a generative AI tool to help translate it. Now, now, now you have the problem of Well, hold on, we just used generative AI to formalize, but I thought the whole point of this was to address the incorrectness due to hallucination in generative AI. So what we can do now is we can walk over the structure of that formula and identify interesting corner cases and walk you through what that formula is really giving you. So you have a non-expert who maybe doesn't understand logic, probably doesn't understand logic, but we can calculate all of the corner cases that are representative and walk you through what answers you're gonna get. And if you don't like what you're gonna get, then you can refine. You can say, I don't like this answer, here's why in natural language. And we can go back to the generative AI tool and do that again. And that kind of models what we've been doing all along. in Amazon, but in my entire career, and I imagine Kathleen will agree, is that you typically would have traditionally paired up a formal methods person together with a domain expert in the other domain. Biology, operating systems, device drivers, aerospace, railway switching, you name it. And then you're gonna, together, you're gonna build a formalization, you're gonna look at a bunch of examples together, the domain expert is gonna say, oh no, you got it wrong, you didn't understand the assumptions, you bake those in, you do it again, and you iterate until you can't find any more problems. And then you say, well, I hope that's what we're trying to prove. And then you go prove that of the real code. And so that's kind of what's happening in the system too. So there's a neat feedback cycle after deployment, which we'll talk about in a moment. So, okay, so now let's talk about the deployment. So once you've locked that in, you can now deploy it. And then you can, if your customers or your users of your chatbot are asking questions, we translate those into natural language. And then we prove or disprove the correctness according to the formal model. Now, there's a couple of other gotchas. How do we handle the natural language? So again, we use actually generative AI, but now we don't want to do this loop. So what we do is we do multiple translations, and then we use a theorem prover to see that each of the translations is equal to the other translations. And if so, our assurance that we got the transition right is relatively high. That's why we're 99%, not 100%. And then if not, then we basically can now do a form of active listening, right? So if one translation said they're talking about a first-class flight and the other translation said they're talking about a coach flight, then we can go back to the user and say, did you mean first-class or a coach? Or we can query that in a database. So there's various methods there. So there's a form of active listening to build a mental model about what the question is. And then we can go prove or disprove that according to the model. So that's the kind of product. But by the way, this is for corporate policies and these kinds of things. But you see very similar things happening in code. You see very similar things happening in agentic systems. This pattern of the human-in-the-loop that's translating data to knowledge. is you see quite a bit. And then the deployment in, into the inference system and either over natural language or structured systems like programs is sort of the, the downstream place where that gets deployed.

speaker_1: Let me try to echo a little bit of that back to you. I think that was really good. The, in the traditional setting you said, and it's the same essentially for the Gen. AI setting now, you start off with a fuzzy idea of what it is you want to be doing. And you have to bring the domain expert who, you know, has, if nothing else, the ability to answer the question of like, are we doing the right thing in this case? Like you're trusting that person's judgment to, to sort of be ground truth for what we want to happen. Hopefully they've been able to capture that in written language, but probably not a hundred percent. They get together with a formal methods person and translate natural language into logical specification.

speaker_3: Yep.

speaker_1: They then run a bunch of examples through and validate that this all seems right so far, at least. And this is probably the biggest, in the traditional sense at least, this would be like the biggest weakness, right? It would be, did you spec correct?

speaker_3: Yep, by far. It's the hardest part. Even though once you have a spec in your reasoning about programs that sound decidable or intractable, NP-complete, By far the hardest part is the second thing. And like in my career, working in large organizations, I have spent a lot of time in shuttle buses between buildings trying to get agreement amongst teams on did we get the spec right? I've spent a lot of time doing that kind of thing.

speaker_1: Okay, so that's an interesting, and that obviously becomes one of the assumptions. on some level is that we're only as good as the spec.

speaker_3: By the way, I'll give you a concrete example that's pretty illustrative, right? Please. So for various compliance reasons like SOC, but various compliance regimes around the world, there's gonna be one around all data at rest is encrypted. Okay, sounds plausible, but then, well, what do you mean by encryption exactly? Like if I use an encryptor that I get from like a, serial box, you know, like a little... That's encryption. Well, oh, that's not encryption? Well, how do you define encryption? Uh-oh. See, now we're talking about what do you mean by encryption? And so now we're kind of iterating on that. And then we could talk about, well, what is REST? What is at REST, right? And like, well, if I put it on the table, is it at REST? Oh, no, oh, you mean digital storage. Oh, okay, so now what you mean is, and so then you can talk about like, Like what if the data's flying in a network and you're using the latency of the network such that it never actually gets stored to a disk drive, but it's just going around and around and around the world. Is that at rest or not? Because if that is at rest, then it probably doesn't hold because that's not what we're proving, right? So you have to define, well, oh, so a rest means these medium or these APIs. And encryption means, you know, coming out of these APIs, and you have to sort of get real clear on what those are. And so we've already now done like three or four refinements, and that's just for a very simple case.

speaker_1: Yeah, that's fascinating. So in the traditional pre-gen AI context, you have your spec, you go do a bunch of work that is above my pay grade to prove that the software as written satisfies the spec, and then Which you won't. You're done in that, which it won't. So that you have to fix it.

speaker_2: Which you won't.

speaker_3: Yeah, you won't. Yeah, exactly. And then you're gonna have to, you're gonna have to translate the bugs to the, oh yeah, and that's the other thing is people will, there's cognitive biases around like, no, that's not possible. Right? And so, so the, you'll get into very, it's very hard to get agreement. So that, that's another thing is like, you know, like I, invariably, when you're doing this process of like, you find a bug, you find what you believe to be a bug, you go get three engineers from the team together, put it up on the board or whatever, show them the bug. One of them will say it's a bug and two of them will say it's not. And now you're going to it's like they're going to like people are going to be have some people are going to be pessimists, some people are going to be optimists. They're like, oh, that's not possible. The environment.

speaker_2: Their conditions, right? The like the environment will make it so that could never happen.

speaker_3: That can't happen. And then they'll argue amongst themselves. And then really, the only way to actually get to ground is to create an exploit.

speaker_2: Proof of concept.

speaker_3: Yeah. To show that it's possible.

speaker_2: To show them that it does happen.

speaker_3: Yeah.

speaker_1: And you identify those exploits by finding places where the proof is incomplete or can't--.

speaker_2: No, you find inputs that trigger the condition, that make it happen.

speaker_3: And some of those tools are really good at finding those, and some of them aren't. Because we talked about all these different kinds of tools and all these different algorithms, and some of them lose so much information that you're like, the proof didn't go through, but I kind of don't know where the bug is. But some of those tools are really good at saying, oh, no, here it is, and here's the crazy path through all the code.

speaker_2: Those are harder to argue with.

speaker_3: Yeah, so that's, yeah. Okay, but sorry, so we interrupted you. So you were No, please.

speaker_1: Again, I want you guys to do as much of the talking as possible, as long as I can roughly follow it. So you iterate through this process of defining the spec, and then you're again iterating through, well, here's why we weren't able to prove all this stuff. We're finding all these edge cases, we're finding these vulnerabilities, we're closing those down, and then eventually, We sort of get there and then we're done at least for like this version or this generation of the system. Now, contrast that to the Gen. AI case because we now have an additional level of fuzziness where--.

speaker_3: By the way, if I could just make one parenthetical statement, the academic-- and by the way, this field was that this field was proposed by, you know, Turing wrote, you know, how to check a larger team, showing how to do proofs of programs. But this field went into decline in the '70s and '80s. So it was really only the, in the monasteries, if you will, there were people kind of keeping this area alive, but it wasn't really used in practice, except for in some very obscure cases. And so it's become the discipline of the academics. So from the academic view, you're now done and everyone's happy and everyone can go celebrate. The problem is that the code, particularly in a cloud world, changes all the time.

speaker_2: All the time.

speaker_3: So now the big thing is, well, you did CI/CD, right? Like you integrated this into the pipelines. And then you're now trying to repair the proof because the code is constantly evolving. And that turns into, like, that's a whole thing that you kind of have to deal with. But yeah. And that.

speaker_2: Does some really cool work that shows that sometimes the proofs can be automatically updated when the code changes are irrelevant to the main arguments of the proof, that the proofs can be automatically updated and you don't need to kind of re-bring the proof engineers back. Amazon and--.

speaker_3: Yeah, oh, and here's another beautiful thing. So because the tools are actually solving the undecidable or intractable, they sometimes just go to lunch, right? Like, because the problems are undecidable or intractable, when they give you an answer, okay, well now we can believe the answer, but if they don't, because the problems are undecidable and intractable, sort of surprise, from time to time, they're just not gonna come back because they're off spinning forever 'cause the problem's undecidable or intractable. And unfortunately, it might be that you changed a variable name from x to y, and that, messed with some hash table deep in the theorem prover, and suddenly something that took three seconds maybe never comes back, or maybe it takes like 400 seconds. And so now your engineers are like, whoa, like what happened? So that's a big challenge for us. So again, the generative AI world kind of addresses this in kind of an interesting way, but I'm trying to lay the groundwork for a sort of pre-gen AI world.

speaker_1: Yeah. Before we get to the GenAI world in terms of using these methods to help boost the assurances we can get around GenAI systems, how are GenAI systems helping with this work? I understand that they, we're seeing all these results in math, which for somebody that's not a mathematician, it's kind of hard to parse, like sort of taking it on faith on Twitter, like, oh, this is a novel problem, but it's not that big of a novel problem, or the insight here was key or it wasn't key, I don't really know. What has your experience been in terms of bringing language models to the process of building out reasoning and these sorts of proofs?

speaker_3: Do you want me to say something, Kathleen, or do you want to?

speaker_2: I mean, they're showing-- I mean, I can start. They're showing a lot of promise for-- I mean, one of the reasons why-- so Byron was saying that formal methods was in the monasteries, it was in the universities, it was only a research problem for a long time. What changed was it became possible for automation to solve a lot of the problems. So SAT solvers, which was a technology that Byron mentioned earlier, for example, made it so a lot of the hard work could be done in a completely automated way, and which kind of revolutionized a whole lot of formal methods work. And that is some of what underlay, like the, some of the advances which underlay the work that lay behind the advances that was in HACMs. And that sort of started the spiral of like, oh, actually formal methods can be used in the real world and underlay what Amazon has been doing, et cetera. And I think that large language models and generative AI have the promise of turning that up to 11 in terms of what can be automated because language models, generative AI is really good at language, and a lot of formal methods are language-based problems. Formal methods are kind of sparse languages at the moment. There's not a massive amount of training data in a lot of formal language-based, formal model-based languages. But they're a place where we can generate the training data and be sure that the training data is correct because we can put it into formal methods-based tools and get the, yes, this is good signal so we can turn the crank to generate the training data. Putting on my DARPA hat, I was the ITO director for four years. This is the place where there's a massive beating signal here that there's a huge promise that we can get generative AI to be really good at formal methods-based approaches and so that we can marry best of both worlds of generative AI and formal methods to get not just code getting generated, but high quality code getting generated. And that is a huge promise for the future. And that's part of why I think one of the things that we really want to do is get a high quality benchmark generated so that we can measure the quality of code being generated from a formal methods and cybersecurity perspective. So we can get kind of frontier model companies racing to produce high quality formal methods based code and security code so that not just racing to produce code, but racing to produce really good code. If a lot of the code in the future is going to be written by language models, can we get that code to be really, really good code instead of eh code from a cyber perspective or from a formal verification perspective? There's no reason to think, if it's going to produce code, can we get it to produce memory-safe code? Can we get it to produce type-safe code? Can we get it to produce input-validated code? It's not going to care. add the benchmarks to and drive the frontier model companies to compete on that dimension, then we can get a future that is much more secure from a cyber perspective. And that the threat that we started this conversation with, you know, back an hour ago about how AI is going to make the cyber threat much worse, we could get it to tamp down that concern a lot. And Byron has been working directly with some of these challenges of getting the frontier models to work better with formal methods, and he can Add more details.

speaker_3: I mean, to answer your question, I'm gonna identify a hierarchy. So imagine you have a program with no loops, no recursion, but you have conditional expressions. Just asking, hey, could this conditional expression, like if blah, then launch the rockets. The question of could the rockets be launched is NP-complete problem. You can put arbitrarily complicated formulae in the if expression with thousands or millions of propositional variables or integers or whatever, like all kinds of crazy stuff in there. And so just finding a satisfying assignment, finding a way to get into that then expression is NP-complete. And it's under certain restrictions, like if you choose a language like C, for example, and we're not talking about the heap and a few other things, then it's basically the same question. It is actually the propositional satisfiability question. And that was the first question to be proved in P-complete. So that is an NP-complete question. And when you use propositional satisfiability, you're solving that problem. And so there's been... Amazing, amazing progress. I first saw this graph-- actually, I have a report, I think, from the HACMs project, where they showed each winner of the international satisfiability solving competition compared on the same hardware and the same benchmarks. And it was like, wow, they're getting so much better. And now we've had really blood success since then because we've moved from sequential solvers to distributed solvers. So this is really amazing tool called Mallib. which runs many, many, many self-solvers across the network, and then they share limas as they try and solve the same problem together. And that's just unbelievably better. So there's really a scaling law, very similar to transformer models, where it's like you put this in, you get this out. And so that really explains why we've been so successful at Amazon, for example. It's why we can go just in less than 20 seconds, reason about your policies and your VPC networks are just very reliable. It's because of the success in that. So that's one piece of the hierarchy. Okay, now imagine you have loops. So you're allowing loops or you're allowing recursion, but you don't care about any word, like anything about eventually. You're just saying this, we can't hit this, like this state, we can't hit this state, like an assert that we, this assert can never fail. So now you're talking about undecidable problems. And so to prove that assert can't fail, you have to find something called an inductive invariant. It's fundamental. So you say you find this artifact, this weird thing called an inductive invariant, and then you show that it's true that when you enter into the loop, it's true. And then when you go around the loop, it's true. And then you show that thing implies that the assert can't fail. That's all what they call safety. It's all property, basically assert verification. So that's kind of the next level of the hierarchy. The next one after that is termination. So you say, oh, this loop eventually terminates, or this, or if you use the word eventually, then generally speaking, I mean, there's some side conditions about not having negations and so on. But generally speaking, if you see, like if you're thinking about the word eventually, like if I call this API, then eventually I call this other API, that's ultimately a termination question. And so to solve that, now you need, to find something called a ranking function. So to prove termination, you need to find a ranking function, but to prove that the ranking function is a valid one, you also have to find an inductive invariant. And now to prove the inductive invariant, you also have to do combinatorial reasoning. And now the last level of the hierarchy is concurrency. Oh, now we have two threads, or now we have n threads, and we're trying to prove, and so now there's a technique called rely guarantee, where you basically find environment abstractions, such that you only have to prove something about each thread at a time. So there's these different sort of levels of abstraction. And so I'll now generalize that we have really amazing tools for handling that combinatorial, that first thing, the no loops. But the tools we had before for reasoning about programs with loops or termination or rely guarantee, we had techniques for sure, but very often the successful way to really make these things work was to just hire a human. So you have a PhD, they're chained to their desk, you know, get us our proofs, right? And there's abstract interpretation, there's predicate abstraction, there's various techniques, but they're limited to certain domains and so on and so forth. What's really neat about generative AI is you can ask generative AI to find these artifacts. You can stage a generative AI, find me an inductive invariant, find me a ranking function, and find me rely guaranteed constraints such that this whole thing holds, and all that now reduces just to the combinatorial reasoning. which we have really good tools for. And so now you can iterate the combinatorial reasoning or the checking together with the search that the GenAI tools can do. And then you can combine that with all the techniques we had before. And then because we're on the cloud, you can run all different combinations of those in different machines and then take the answer you like the most. And then that really changes the game.

speaker_2: Have you seen examples? of this that, have you seen examples of this that correspond to like the, the move 37 in Go?

speaker_3: Oh, the inductive, the inductive and arguments and ranking functions that Gen. AI tools find are pretty, pretty great. Yeah. Yeah.

speaker_2: Yeah.

speaker_3: I mean, but let me breaking, I realize this breaks the move 37 analogy just a little bit, but basically to get a proof to go through. requires incredible insight. So there's very few people who can find those. So you're on the verge of a Move 37 miracle every time you get a proof to go through. So it's.

speaker_2: All the time.

speaker_3: Yeah. And so what we're finding now is in Amazon, the teams that did the scaffolding for proof are now just super happy with GenAI, right? Because now they can pair up a programmer with a generative AI tool. And so there's a whole methodology being developed in Amazon, and you see this in startups and so on too, where they're changing their development methodologies to help to use generative AI to write the code. And then that scaffolding they built around the proof provides a semantic guardrails that allows them to deploy and not have to roll back very often with the generative AI. And they're really happy to put the investment in.

speaker_2: What does the Gen. AI need to see about the code base to be able to do this?

speaker_3: Ah, so it works better. And there's blog posts about this from some folks in Amazon. It works better if you actually put all the documentation and the threat models and all of the stuff all in the same place so that the generative AI, when it's generating the code or finding the proof, also has the documentation and the like ops plan and the like everything else they've written. If you throw that all in, it's actually better able to find the code.

speaker_2: So it has the entire code base and then it also has like all the thinking about the code base and all the documentation. So that's really cool that it has not just the code, but also the kind of human level understanding of the code.

speaker_3: And then for proof search, you can take like, how do humans do proofs? Well, they often like read a bunch of other proofs. And then they essentially are doing matrix multiplications in their head to derive like, oh, yeah, this proof looks a lot like this thing, but it's a little different. So I bet this is the inductive invariant. They sort of like a rabbit out of a hat and voila, the thing goes through. And so the people are really good at that, like John Harrison have read all the proofs. And so the generative AI can do that too. So you could actually pass it old proofs of programs. And it's going to be better at finding new proofs.

speaker_2: So it does that too. It kind of reads old proofs.

speaker_3: Yeah, you can do that.

speaker_2: It reads the proofs of software construction and kind of digests all of the literature of here's how you prove these kinds of things.

speaker_3: And because the model providers-- so most model providers today are training their models over proofs, like in LinkedIn, for example. They actually understand some of those old proofs. But then, yeah, you can provide your code and proofs you've done also into the context window.

speaker_1: So do I understand the sort of path from here to, because this has obviously been a huge critique of AI generated code, it's insecure, right? We've heard that all over the place, and I think not incorrectly, at least so far. If I am understanding correctly the vision for getting to a world where AIs generate secure code at a much higher rate, it is basically that First, we're kind of taking advantage of a pretty familiar trick of something that might be hard to question. It's hard to come up with the proof, but it's easy to verify the proof, right? That's like a cornerstone of this. So the AIs are presumably, they're at least in the game now where they can get based on all this pre training and other things that they've learned and all the proofs that they've read, they're able to at least have some non zero hit rate. So you got to run your AIs especially in the beginning when you're kind of just starting to spin the centrifuge. You gotta run 'em a lot, have a high failure rate, but you get some successes. Those successes can fold back into training data. With that overall flywheel starting to turn, we're now getting the AIs to be quite good at the proofs. And then we can sort of apply that proof capacity as sort of a reward signal to the coding problem itself. And so then downstream, we're now rewarding the core coding objective in part based on correctness as evaluated by these proofs. So we've got a sort of multi AI, multiple roles for AI and kind of bootstrapping this thing into success.

speaker_3: I think it's easier to help developers encode into linear temporal logic what they mean by all data at rest is encrypted than it is to encode the family medical leave act, right? It's much smaller in scope. So now we have these tools that are already doing the what you call auto formalization from natural language to logic. And so we can deploy those tools in these domains. And so like it even makes it even remarkably easier. Yeah.

speaker_1: So we can envision a world over a generation or two of future models where as long as this is prioritized, given what we're already seeing in terms of logical reasoning ability, mathematical ability, coding ability, Would it, does it stand to reason for you that like GPT six and Gemini four should be like basically superhuman in their ability to write secure code the first time? Is that where we're headed?

speaker_3: Yes. And also Nova and the open weights models. Yes. So, so, so the, yeah, I think we're there. I mean the, that we're seeing that right now and there's like open math conjectures being like there, there's a, or last week there was Erdos, open Erdos problems approved. So like I, that we're, Quickly heading there already.

speaker_2: We need to incentivize the production of that code and not the bad code. And I do think the, I mean, we need to not just create the capacity for the good code, we need to move away from the bad code. And I think that we are likely to see the increasing incidences of cyber attack that would cause the motivation to get rid of the bad. And I guess, right, we need to incentivize the creation of good code, disincentivize the creation of bad code, and then incentivize the use of the creation of good code to replace the pervasive bad code in our critical infrastructure, et cetera, so that society is much less vulnerable to cyber attack than we currently are.

speaker_3: In good news, there's tools to help you translate your code, right? So generative AI can help you move from one system to another. So you can upgrade your versions of your language and then enjoy the benefits of the fixes and the runtime of the new version of the language. And then we, well, one of the easiest things to do in formal methods is actually prove the equivalence between two programs, because now you have a spec. And so, so it's actually showing the, showing that one version of the program implements in the same way the other program is something that we can do. Often when you're updating, you're moving from Java to Rust, for example. So there's a little bit of work to make sure you got the semantics of Rust and Java right. And then often people use different APIs, and it's a little harder to do that mapping. But it's much easier to move between APIs and languages now than it used to be. So that's also good news for the security.

speaker_2: I'm moving from C to Rust. DARPA has a program called Tractor, which is about automatically translating from C to idiomatic Rust, which is about moving from a non-memory-safe to a memory-safe language. which is about establishing and maintaining the, you know, the property that memory is stable. It's the abstraction that you intended when you stored something in memory stayed the way you intended it to be.

speaker_1: I guess the because we're really hitting our stride now, I think. I sometimes, you know, try to coin these phrases of, like, the great implementation, you know, was sort of my phrase for, like, we're going to go... unbundle jobs into tasks and get AIs to do all these tasks. We'll sort of implement AI in all these different corners of society. And here I'm getting the sense that there's a great verification or a great rewrite, if not fully verification. So what is the next few years, I guess how fast do we think this needs to happen? It seems like it might need to happen pretty fast if we really want to be secure before the bad actors are using the same capabilities to break into all the critical infrastructure. What is this next period of time? How long is the period of time? And what does it look like when we need to kind of go through, I don't know, 60 years worth of software and fix all these holes that our feeble human minds have left us for today?

speaker_2: I think that's a question of motivation more than a question for technology, right? We're seeing with the results of the AICC cyber competition, which showed that AI plus cyber could find and fix bugs at speed and scale. Google released the CodeMender project. OpenAI has a similar thing. And what we've been talking about are really showing a promise that AI-enabled code generation can rewrite code at speed and scale to a much higher standard of correctness in the relatively short term. And I think that the way AI is improving super fast, that is just going to-- that's the worst it's ever going to be is where we are right now. And I think that the conversation we started out with, which was AI is going to be enabling cyber attackers at all levels of skill at all parts of the cyber attack chain, means that the level of attacks are just going to increase. That said, the statement of like cyber attacks of increasing severity has been true for the past, I don't know, 20 years. And we've just been like the frog getting boiled and we just keep getting boiled and we haven't responded yet. So I don't know when we are going to be like, holy crap, we actually need to do this now. right? The capabilities are there, the tools are there. When the year 2000 date problem came around, like the society did mobilize and did go and fix those bugs, right? So there wasn't a huge problem when the year, when 2000 turned around and all of the dates that were stored in two digits all of a sudden meant something completely wonky. So I don't know when we're going to be motivated enough to use these technologies. I think that the technology will be there, when we get serious about doing it? What's your sense, Byron?

speaker_3: I mean, I think that the commercial marketplace kind of drives this, right? So how much are people willing to put up with and how much are they willing to pay for their infrastructure? I personally choose my machines and phones based on their reputation for security. And certain makers of machines and phones take it more seriously than others. I pay a little bit of a premium for that and I would continue to do so and would advise others to do so. And then I think that the cost and difficulty of these tools is, I think, dropping pretty substantially. I mean, I'm seeing that in Amazon, the teams who previously couldn't have used these. I mean, teams in Amazon are also like, you know, Some of them have different threat models than others. And so for some teams, it's really important to use these tools, and for other teams, it's a little less so. And so there are teams that choose not to use formal and some teams that do. And so that barrier to entry is going down. So those teams that weren't using before are beginning to adopt them now. And so I think that will continue to happen. And then there's a big commercial push on, Amazon released the Kiro IDE, and it's big differentiators, it's specification driven. And so there are companies trying to innovate, making products. And this is a new dimension in which to innovate. And if those products are successful and people flock to them because of the capabilities, then that flywheel will drive things. So I'm actually fairly optimistic. And then if you look at Agentic AI, you can-- Agentic AI is kind of a way of saying we're doing declarative programming. Like if you look at agent core policy, which we announced last week at reinvent, it's a cedar, it's a formal specification over the envelope in which you're okay with an agent, what they're doing and not doing. And that kind of gives you a new way to like declaratively specify what the systems can do. So I think that society is also beginning to appreciate through the lens of generative AI, the notion of declarative programming as opposed to imperative programming. And so I think that'll change how programs are written and we'll have different thread vectors and we'll have to figure that out as we go along. So it's pretty unstable as far as I can tell, but I think that I'm no less worried now than I was 10 years ago. Say it that way.

speaker_1: Okay, well, hopefully the next one year will bring some progress. Okay, so I think I maybe have three more questions. They're all kind of substantial. Hopefully we can get all three of them, at least partially. answered. I guess one is like, how far does this go? How long does it take and how far does it go? If we said as a society, okay, we're going to stop writing new software, we're going to do the great rewrite, here's all the formal tools that we have and the language models that can help you maximize their use and rewrite stuff, how long do you think that would take and how far could we get? There are, as you well know, many other approaches, including companies that are trying to create AI hackers to go find the vulnerabilities. You know, could we get to a world where those those strategies are rendered unnecessary because we're just so confident everything is kind of buttoned up that we don't need an AI agent to go, you know, poking around to try to find vulnerabilities? That's sort of imitating the old human way, right? Of like poke and click and find things. things and then come back and say, okay, I found something, now we can fix it. Could we ever get to a point where we don't need to do that anymore because the whole system broadly has become sufficiently robust? Whereas that two pie in the sky, you think we'll always kind of have to have these different angles of attack on the problem.

speaker_2: I think you could get to a place where the... I mean, I think belt and suspenders is a fine idea. I think it doesn't hurt very much to have the AI bug finding things be poking. But I think you could get to a place where they weren't finding things very often. I think in the-- we talked about how you have people, you have software, and you have hardware. I think we can get to a place where the hardware and the people are the source of almost all of the vulnerabilities. in the not too distant future if we sort of went all in fixing the software.

speaker_1: Yeah, that's really interesting. I mean, one of the big worries, of course, with AI broadly is like, what happens if the AI gets out of control without, you know, digressing into how plausible that is. I'm always on the lookout for any sort of theory that could really work. And this seems like one line of attack that you could imagine getting to a sufficiently high level of assurance that you could say, look, the AI is not going to be able to hack its way out of this box because, you know, we've done this level of assurance and we can at least kind of take that sort of like self-exfiltration risk off the table.

speaker_3: So my answer is that I find that every safety technology just allows us to push even harder. Like I think we drive faster because our cars are safer and the commercial marketplace also drives features and speed and all these kinds of things. So I remember I worked for a company that had device drivers in user space, but because some games were slower, they moved the device drivers into kernel space. Right? And so they lost protection. And You know, and then video device drivers just weren't checking error conditions. So if the operating system failed to allocate memory to you, they just didn't care because they didn't, they wanted to be high performance in the common case. So I think that there's also the balance that with availability and features. And so the commercial marketplace makes, you know, there's a bunch of dimensions and they're trying to, they're navigating that and they're not going all for security. So I think if you did.

speaker_2: I 100% agree with Byron. I think if you if you were 100% focused on sandboxing a super aggressive AI, formal methods would have things to help with that would help with that.

speaker_3: Yeah. And that's I mean, that's where we're that's I mean, literally, that's the meeting I was having before this call, right? That's definitely the frontier is like, for Virtualization, sandboxing, policies, and for AI is kind of where we're pushing right now.

speaker_1: Sandboxing superintelligence. Yeah, I mean, that's crazy, but that's kind of, we might need that. How about in closing, let's go back to the automated reasoning checks. The most common profile of listener to this feed, as far as I can tell, is the AI engineer. If you are somebody who is building generative AI solutions for businesses, maybe just give me a little bit more of the pitch on what kinds of problems they can be solving, what kinds of things they can take to their stakeholders, and say, hey, if we use automated reasoning checks, we can be sure, to a certain level of confidence at least, that we're going to implement the family leave policy correctly. And then I want to do one more thing on that, just to zoom out a little bit on the societal question of, how do you envision this existing in the sense that I think a lot of things right now, the society's wheels are greased a little bit by the fact that people can go off policy a little bit. It's sort of, well, you maybe don't quite qualify for the family leave, but we'll make it work in this case because you've been a long tenured employee or whatever.

speaker_3: Yeah, one could argue that's because the rules are often, like there's different interpretations and people, we're putting socio-technical, we have socio-technical mechanisms right now, right? So for example, if you wanna build a accessory dwelling unit in Portland, Oregon, you know, the little small house in your backyard, what are you gonna do? Well, you're gonna take, you're gonna file with the city. and someone's gonna look at it. And to be successful there, you're probably advised to get an architect and probably to get an architect that works in Portland, Oregon, not Portland, Maine, 'cause the rules are gonna be different. And then that feedback mechanism is like, how do you know that the architect really solve for the best? There's various like setback rules, you have to be, If the house was a certain height, it needs to be five feet away from the property line, but if it's this other property, it can be right up against the property line. How do you know they got that all right? How do you know they didn't sort of maximize what you see? You're sort of like using these intermediaries through their socio-technical mechanisms. And I think what the dream of AI and its failure currently is that it has not been able to deliver on the ability for people to democratize access to information, right? Because of incorrectness due to hallucination, no one's going to use a transformer-based language model, ask it about the city of Portland zoning rules, and then just start digging up and start building a house, right? They're ultimately gonna have a bunch of people look at it because they don't believe, they don't trust it. So it's tantalizingly close to allowing everyone to have frequent information, access to information anytime, 24/7, super fast, super cheap, which democratizes access to information. It's not just for the rich. People can hire an architect. And but to do that, we need to do the last mile, right? So we actually need to make the answers actually correct. And then automated reason checks is designed to fill that gap. So it's not creative at all, right? It merely defines the envelope with the answers that are true and the answers that are untrue, and it uses logic to define that infinite space, typically. And then it allows the language model to be as creative as it wants, but then it's gonna map it back over to logic and prove or disprove that according to the rules. And then what one can imagine is these rules are open sourced, or these rules are provided by the city, or there's various models, maybe they're licensed, maybe different organizations provide their rules and you can conjoin them. And so there's various models that are possible, but I think it gives access to accurate information about what's true and untrue. And then also gives us an ability to argue with that, right? To say it's this axiom that makes this rule unfair. And so then you can discuss as a society, as an organization, should we change this rule? And then you could save all the past interactions, and you could propose to change the rule, and you could go replay those chats to see if the answers are gonna be happy with the differences. So it gives us tools as a society and as organizations to appreciate what our rules are, who they impact, and to rationalize those rules. And so my hope and belief is that people now have a much nicer way of pushing back against the rules rather than just sort of cheating.

speaker_2: Well, I think it's an interesting question of where does the wiggle room come in and what are rules that are like, absolutely, this one can't be fudged and this one, there should be wiggle room. And where does the judgment get to come in? Does the large language model get to provide judgment or is the judgment at a higher level of actually going and talking to a person? Or is there no slack?

speaker_3: Right, that's the other thing. Not all axioms are equivalent. You can do proofs under certain axiomatizations, and they can have other ones that's like second class citizens. These are the ones we'd like to avoid, but if the only way to get to the answer is to use this one, that's fine. And the final one could be, go talk to your human. What does Steve say? And so that can be the last axiom. Steve agrees, then, okay, great. So that allows us to now fudge things, but Get.

speaker_2: Out of jail free card.

speaker_3: Yeah.

speaker_1: Yeah. Interesting. I, I find the the promise of obviously just the scalability, the speed of all these things. I mean, talk about, you know, wanting to get to some sort of housing abundance, you know, the ability to expedite so much stuff in the case that you outlined there is incredible. The fair fairness would presumably be dramatically improved. And then I do wonder what new mechanisms we will also need to put in place around. I think Europe is doing some of this stuff with right to appeal to a human. We don't have that stuff in the US yet, but we might come to need it.

speaker_3: But there is also a societal marketplace. I have moved countries to get access to resources I didn't have in the other country. And so people kind of move with their feet a little bit. And so it may be that certain countries regulate in ways that slows things down and adds cost and adds latency. And then other countries regulate in a way that actually makes a lot of sense. It provides fairness or provides transparency, but also speeds things up. And then those countries will have faster growth or they'll meet the goals of the society. And so then other countries will see those succeeding and will maybe change their regulations. So I think we're all kind of trying to figure out with these how to work with these tools. But I mean, there's a lot of there's a lot of things that can be good and then there's risks and we need to as a side to figure out how to deal with them.

speaker_2: I mean, sometimes human judgment is good. It adds flexibility that clearly should be there. And other times human judgment is bad because it adds bias.

speaker_3: Yeah, exactly.

speaker_2: In a way that contorts the process in a way that is bad.

speaker_3: Yeah.

speaker_1: Yeah, definitely. No easy answers. I think that's a constant theme of the cognitive revolution as a whole.

speaker_3: The big challenge for society is they are learning how hard it is to figure out what truth is.

speaker_1: Yeah. Well, that's all I had. Do you guys want to maybe each give one closing thought or touch on anything we haven't touched on? Keeping in mind the AI engineer profile, if there's anything else you would want to impart or a call to action or even just one more pitch for the product would be welcome too. And we can leave it there.

speaker_3: Yeah, so I'll start. So there's this notion of neurosymbolic, which is the combination of the automated reasoning we've been talking quite a bit about, but also the cognitive techniques. And I think much of the magic is at the intersection, like how do you do inference and reasoning and over like non-deterministic at the same time. I think there's a lot of, like it's a peanut butter and chocolate moment where if the tools combine, like they really have a multiplicative effect with each other, so. So I think traditionally there's been siloed, there's been automated reasoning, formal methods nerds, and then there's the nerds that work on the statistical approaches, and they didn't speak the same language. They didn't go to the same conferences. And I think what's an amazing time right now is those areas are blurring back together in a way that was envisioned in the early '50s, and it's a pretty exciting time.

speaker_2: Yeah, I would say the same thing. Formal methods and AI are two great tastes that go better together. They have differing strengths. Formal methods gives you guarantees, but are not very flexible and hard to use. And AI is very flexible and easy to use, but often not trustworthy. So if we can figure out how to put them together, you can get the best of both worlds. I2O, DARPA has a number of programs that are exploring different combinations of using those technologies to get exactly that, the best of both worlds. And there's lots of different ways of combining them, and the magic is in exactly what combination and in what circumstance. to get exactly that, the chocolate and peanut butter.

speaker_1: Yeah, love it. It's an inspiring vision. I really like it. I've got a lot more to learn, as you guys can tell from my baseline coming into this conversation, but I really appreciate you both for taking the time to educate me and all of us. Kathleen Fisher, Byron Cook. Thank you for being part of the Cognitive Revolution. Thank you.

speaker_2: Thank you.

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

The Great Security Update: AI ∧ Formal Methods with Kathleen Fisher of RAND & Byron Cook of AWS

Watch Episode Here

Listen to Episode Here

Show Notes

Transcript

Introduction

Main Episode

Read next

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

The Great Security Update: AI ∧ Formal Methods with Kathleen Fisher of RAND & Byron Cook of AWS

Watch Episode Here

Listen to Episode Here

Show Notes

Transcript

Introduction

Main Episode

Read next

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

Mathematical Superintelligence: Harmonic's Vlad & Tudor on IMO Gold & Theories of Everything