This episode features Nathan's talk from Imagine AI Live, where he provides business leaders with a comprehensive overview of AI agents and agentic AI systems.

Watch Episode Here

Read Episode Description

This episode features Nathan's talk from Imagine AI Live, where he provides business leaders with a comprehensive overview of AI agents and agentic AI systems. He covers the evolution from simple task automation to more autonomous AI systems, delivers a practical roadmap for implementing AI agents in business while maintaining quality standards, and addresses concerning developments like reward hacking and scheming in frontier AI systems. The presentation balances accessible explanations for varying AI knowledge levels with rigorous grounding in current research and real-world applications. Nathan also discusses the "Eureka moments" these systems can deliver and provides frameworks for understanding the rapidly evolving landscape of AI agent capabilities.

SPONSORS:
Google Gemini: Google Gemini features VEO3, a state-of-the-art AI video generation model in the Gemini app. Sign up at https://gemini.google.com

Oracle Cloud Infrastructure: Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs, including up to 50% less for compute, 70% for storage, and 80% for networking. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive

The AGNTCY: The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at https://agntcy.org/?utmcampaig...

PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) Sponsor: Google Gemini
(00:31) About the Episode
(03:36) Introduction and Speaker Background
(05:06) What is Intelligence?
(07:06) Defining AI Agents
(09:36) Structured AI Workflows
(13:06) Autonomous Agent Examples (Part 1)
(15:15) Sponsors: Google Gemini | Oracle Cloud Infrastructure
(16:54) Autonomous Agent Examples (Part 2)
(17:19) Agent Taxonomy Framework
(19:49) Current AI Capabilities
(22:19) Reinforcement Learning Revolution
(25:49) Reward Hacking Problems
(28:19) AI Safety Concerns
(29:49) Business Recommendations
(37:35) Sponsor: The AGNTCY
(38:27) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...

TRANSCRIPT

Introduction

Hello, and welcome back to the Cognitive Revolution!

Today we're sharing a recent talk that I gave at Imagine AI Live, on AI Agents, the latest crop of Agentic AI products, the Eureka moments they can deliver, and the concerning rise in Bad Behavior, such as reward hacking and scheming, that we're seeing from frontier AI systems.

My assignment for this talk was to bring an audience of business leaders, with varying levels of AI knowledge, up to speed on the rise of AI agents — from simple, structured task automation to the more autonomous AI systems that are now starting to come online – and give them a roadmap for actually using AI productively in their businesses while maintaining a high standard for quality & consistency of execution.

That's a lot for just 30 minutes, and this talk is therefore necessarily a bit higher-level than my usual content, but my hope is that you'll find the zoomed-out framing helpful for as you process the firehose of AI Agent news & discourse, and also that this episode is one that listeners can share with leaders in your own organizations who would benefit from an introductory but rigorously grounded survey of the field.

I did use slides in this talk, and I would say it's probably better watched on Youtube, but we will link to my slides in the show notes the audio-only crowd, and for those that want to go deeper, just about every slide links to a source paper or project where you can learn much more.

I want to say a big thank you to Steve Metcalf and Chris Madden for inviting me, and for running a first-class event with a really interesting lineup of speakers & diverse group of attendees.

The day before my talk, I had an opportunity to run a 2-hour workshop on AI adoption, where I met a Rabbi who is pioneering the use of AI in his study of the Torah and Talmud.

And the day after my talk, Dan Siroker, CEO of Limitless, shared a particularly inspiring vision for the future of his company – which was truly one of the very few startup ideas I've heard recently that meets the moment in terms of the scale and audacity of ambition. His and all the other talks are available online with a paid membership in the Imagine AI Live online community.

As always, if you're finding value in the show, we'd appreciate it if you'd share it with friends or leave us a review on Apple Podcasts or Spotify. And we welcome your feedback via our website, cognitiverevolution.ai, or by DMing me on your favorite social network.

Finally, I also want to welcome our newest sponsor, Google, to the podcast. In our most recent episode, I spoke to Logan Kilpatrick about Google's 50X growth in AI usage over the last year, and I can sincerely say – and this endorsement was NOT a condition of sponsorship – that Google's Gemini 2.5 Pro is absolutely a top-tier model that developers should be considering for a wide range of projects, and especially those involving scanned document or video processing or that require long context, as its capabilities on those dimensions are genuinely unique in the market today.

With that, I hope enjoy this talk, about AI Agents, Agentic AI, AI Eureka Moments, and the rise in Bad Behavior from AI systems, and thanks to everyone who listens for being a part of the Cognitive Revolution.

Main Episode

Welcome to the stage, Nathan Labenz. Good morning. I am here to tell you everything you really need to know about AI agents in 25 minutes, so buckle up. We will move quickly, as there is a lot to cover. A quick note about me: I am a full-time student of AI. I have a podcast where I explore it from all angles. I started this company, Waymark, which does video creation for small businesses. I was involved in the GPT-4 red team, where we tried to figure out what the thing would and wouldn't do. That is a legendary story available on the podcast if you are interested. I am a small-time angel investor in a bunch of AI companies. This is my favorite page on the internet, to give you a bit of my credentials. This has been up there for two and a half years, and it is an early case study when we were fine-tuning what was then GPT-3, then the state-of-the-art, to make our video product go. I want to start with a couple of big-picture questions that set the stage for our discussion. There is a lot of confusion. As our previous speaker astutely noted, jargon is terrible and widespread, leading to much misunderstanding in this space. I will try to clarify that to the best of my ability. We will delve into the distinction between AI agents and agentic AI that he foreshadowed, discuss what AI systems can do today, and try to give you a sense of where this is all going. So, the first big-picture question: what is intelligence? I do not claim to have the final answer. People have philosophized about this for a long time. But I will give you the working definition that intelligence is the ability to accomplish goals in ways that we do not fully understand. To give you an example, everyone can recognize these numbers. It is trivial for people to do. You may not know, however, that even today in 2025, it is still impossible to use code to recognize handwritten numbers, simple as that task is for us. I asked Claude to write code to identify handwritten numbers. It first told me that was not a good approach. I said, "Do it anyway." It did. It also wrote tests. We got 14% correct. That is obviously better than random, which would be one in 10. It is nowhere close to good enough to deliver the mail. This will not surprise you, given our current discussion, that AIs can perform this task of recognizing handwritten digits. Even a very simple neural network can get basically human-level performance on this simple task. Of course, the postal service and other companies have been doing this for years. But this has really accelerated. Now it is not just about recognizing handwritten digits in a very narrowly defined way. Here is an example of extreme ironing, and GPT-4, upon seeing that scene, describes it accurately. When asked what is unusual about the image, it correctly identifies that it is unusual to see a guy hanging off the back of a New York taxi ironing. So, we have gone from that, which was not that many years ago, maybe a little more than 10 years ago, to this today. This will be even harder. What is an AI agent? I would submit that today, nobody truly knows. There is, again, a lot of confusion in this space, and definitions, even from real leaders, vary widely. Here is Dharmesh Shah. He is famous for being the CTO of HubSpot. He is also a co-founder of a company called Agents.ai, and he gives a very broad definition: AI-powered software that accomplishes a goal, in his mind, qualifies as an agent. That is very broad. Amjad Masad, the CEO of Replit, has a narrower, and I think higher-bar definition: if you are truly talking about agency in an AI system, the agent should be the one that determines when it stops. In other words, it keeps going until it decides to stop. I will use these as two poles in the ongoing debate and evolution of discourse around what constitutes an agent, and I will give you some examples. Here is one of these first examples. This is AI software that accomplishes a goal, but it does so in a very structured way. This is a simple customer service ticket processing agent. It takes in a ticket, looks up documentation, and uses that documentation to write a response. If it cannot find the documentation, it may have to escalate. But this workflow is something the AI did not decide. This is something humans set up because they know how they want the ticket processing to work in their organization, so they map it out. Intelligence is needed at these various steps to determine, for example, if relevant documentation was found, and also to write the answer. The AI is not choosing its own adventure here. This is a prescribed path for the AI. When I give simple examples like this, people often think that this is limited to simple use cases. That could not be further from the truth. This is Google's AI co-scientist. What they did here, and they are literally applying this to frontier questions in science. They took a similar approach. They asked, "What constitutes the scientific process? Can we break that down? Can we introspect? Can we talk to scientists and truly understand the steps in the process they go through?" Then we will scaffold all that up, just as you would for a customer service agent, but this time to answer new questions in science. They have had amazing results, including legitimately new discoveries from this system.

But again, the AI here is not choosing its own adventure. It goes through a prescribed set of steps. This ran, by the way, for a couple of days in some cases, going through hundreds of steps and processing millions of tokens, but it did so on rails the developers provided. So, those are the first definition of any structured software that accomplishes a goal with AI. You can get very elaborate with this. There are many good survey papers out now that map out the different affordances, or tools, that AIs can have, and the different memory structures they can have. Basically, you can get arbitrarily complicated with this at this point, and there are many resources out there you can look into if you are curious about how to equip it with memory, or what the best way to present tools is, and so on. Going to the other definition, where an agent chooses its own adventure and decides when to stop, most systems end up looking something like this. The agent is basically the AI. The AI has access to some tools. It is given a task. It has the opportunity to reason. It has the opportunity to decide what to do. It makes a tool call out into the world. It gets a response on how that worked, what happened, and then it can continue that process until it decides it has accomplished its goal or reached a barrier it cannot overcome, and it decides to stop and says, "Hey, boss, here is what I got." That may surprise you. The results there can be quite unpredictable. What is interesting, though, is that these agents that choose their own adventure are often much simpler in structure than the ones humans design with all parts pre-prescribed. This is OpenAI's Codex CLI. They just released this a couple of weeks ago. This much is open source, and you can read this on GitHub. This is the prompt. It is one prompt that tells the AI how to be. It includes this notable sentence, "You are an agent." So that is about the level of definition they are giving this. They are just saying, "You are an agent. Here is what we want you to do." And this bottom part is the tools. That is it. It is just saying, "You have access to the computer's terminal. You can issue any commands you want into the terminal. You can do anything you want via this terminal interface." If you have ever used the terminal on a computer, you know you can do just about anything that way, if you know how. The AIs do know how. So they are given this very broad mandate, and the human can give them a task. "Add this feature to my software project." It can search. It can read files. It can look around. It can explore a file structure and edit these files, all through this single interface. But that is really how simple this autonomous agent is, once you have the core AI to build upon. It is not just code. You may have heard of the project called Claude Plays Pokemon. This is obviously about as far afield from coding as you might get. But again, it is another one of these really simple agents, where there is just a prompt that says, "Okay, you are Claude," or, "You are Claude. You are about to play Pokemon." "Here is what you are supposed to do." It is just a few paragraphs, and then you can see here at the bottom, the tools it has are the press buttons tool. And it literally just says, "You have the buttons that are on a Game Boy. You can decide what buttons to press." That is how Claude interacts with Pokemon, just by pressing virtual buttons, issuing commands like 'press up' or 'press down,' and then it gets screenshots back from the environment. It has, I believe, now finished Pokemon, and it is now doing it again. You can watch that on Twitch, by the way. So, if we try to taxonomize this a bit, everyone is familiar with the AI assistant paradigm now. That is your interactive chat. You give it a small input. It quickly and immediately gives you an answer. It is up to you to decide if that answer is helpful or not. Hopefully it is. If it is not, you can try again or give up. AI agents, that is the term I use, and I do not know if this will stick broadly through the community, but right now, AI agents, I think, is best used as a label for those more structured workflows, where you are actually sitting there and thinking, "What are the steps I want to go through? How do I make sure this is consistent?" The most consistency and highest levels of performance today are coming from AI agents. These are systems that you dial in, evaluate, and truly try to ensure you give them examples of what good looks like. You do all this work to get to the point where you have something you can trust enough that you do not have to review its work every single time. If you get to the point where you are confident enough in what your AI agent is going to do, then you are truly in business with an AI agent. But now we are also seeing this new paradigm emerge, and I do not like the term either, but it is what people are saying: agentic AI. And this is where the AI truly gets to choose its own adventure, use whatever tools it has, and figure out its own path. And in many cases, it gets to decide when to stop. So knowing where you want to be on this spectrum, depending on the task, your requirements, and how reliable you want or need it to be, this is truly a core part of the art of making AI work for businesses today. Do you just want to equip your people with a simple chatbot that can answer their questions? Do you actually want to scale up and put into production something your business can rely on?

Or are you trying, more experimentally today, and especially useful in software development, to delegate larger-scale projects to AI and hope it can figure out how to do it on its own? One of my mantras for AI is that AI defies all binaries. This is a great project that defies these binaries, because here you have a human who gives an AI a project, and then it spins up its own sub-agents. They were able to develop nanobody treatments for novel strains of the COVID virus by doing this. Again, this is something that can work at a very high level, but it does not fit neatly into the taxonomy, and I am sorry, but that is just how it will be. We will see a smear across the spectrum. Not everything will fit neatly into buckets. So, what can AI agents do today? Again, they can do a lot. AI agents are now outperforming human doctors at diagnosis and at recommending treatments. That is work from Google and a few other companies now, and it is quite amazing. They are essentially going into clinical trials with this in hospitals in Boston. It can also definitely and quite amazingly clearly outperform at least junior software developers. Today, if you gave me access to Claude 4 or an entry-level developer, I would take Claude 4 in a heartbeat. I think we are starting to see this even in the aggregate statistics regarding how many junior software developers are being hired. So the disruption is actually starting to hit now. You can see this also in the benchmarks. This software engineering benchmark, when it was introduced 18 months ago, is pretty typical for benchmarks now; you start with a low performance on AIs. How long will it be until the AIs can do it? Now they are over 80%. That does translate into real money. This is a similar benchmark, but based on real tasks that people actually paid real money for on the Upwork platform. This is a bit out of date because we have new models since then, but even as of Claude 3.5 Sonnet, it was able to accurately complete tasks worth 400,000 out of the total million dollars worth of tasks in the benchmark. So, to extrapolate this, or figure out the trend and get a bit ahead, right? Because where is this all going? It is all moving so fast. Is there any way to be predictive about it? This organization, Meter, went back in time and asked, "For all the different models we have seen that were released at different points in history, how big of a task could they handle?" They measure the size of a task by how long it takes a human to complete it. GPT-2 basically could not do much of anything. It could make very simple snap judgments, like this or that. They put that at two to three seconds. Now we see that we are up to about an hour. Notice the Y-axis is a log scale. That is something always to watch for when looking at graphs in the AI space. Are we talking linear scale or log scale? Often you will see log scale. They calculated that the doubling time of the task length that the latest AIs could handle has been doubling every seven months for the last six years. So people have started to call this a new Moore's Law for AI agents. We will see if it holds. Obviously, this is not a law of science yet, but it is a notable trend. Now, we have just entered a new era in how AIs are being trained. Everyone has heard AIs are trained on the whole internet, they have read the whole internet, and so on, right? And then you have probably also heard that they are trained a bit on human examples. That is like supervised fine-tuning, if you have heard that phrase. But now we are entering the reinforcement learning era for large language models, and this is changing almost everything, I think, honestly. It is giving us a mix of what I call eureka moments and also a rise in bad behavior. The original eureka moment, you may remember, I am sure you do, is the AlphaGo moment when an AI system, AlphaGo, trained by Google, beat the world champion in Go. Famously, it made this one move, move 37, and I do not know a lot about Go, but all the Go players asked, "How could you ever make a move like that?" Everyone thought at the time the move was made, they thought it was a blunder. But then it won the game, and they went back and realized, wait a second, this was a next-level understanding of the game that no human had ever achieved. They accomplished that through reinforcement learning with self-play. The model played itself repeatedly, and the signal it learned from was whether it won or lost. A simple signal like that, right? So it is not training on human example data. It is learning from actual action, and whether it won or lost. That is the signal it is learning from. That was applied to all these narrow game-playing systems for a long time. It is now being applied to language models, and we are starting to see some very interesting results from this. This is from the DeepSeq paper. They, of course, famously trained this r1 model. I will bracket all the discourse about that. But one thing they did that was definitely amazing is they showed their methods. What they showed is that as they were training a model called r1-0. That is not the one they launched, but it was an experiment to train the model again, purely on the signal of whether it got the question right. They give all these math problems and programming problems where there is an objectively correct answer. And the only signal the model learned from was whether it got the questions right or wrong. What they observed in the process of training is that the length of tokens, or the length of output, that the model would produce naturally grew.

It learned automatically, or organically and naturally, through this simple signal of whether it was getting things right or wrong, that thinking longer helps. Then you also start to see these cognitive behaviors emerge. Initially, the chatbots would just give you an answer, and that would be that. Excuse me. Now, you are starting to see these more advanced cognitive behaviors where they will spontaneously say, "Maybe I should go back and double-check that," or, "Maybe I should try this problem again from a totally different angle." They call that the aha moment, because the AI itself said in one of the notable transcripts, "Wait, wait, wait, that's an aha moment." So it is an aha moment that the AI had objectively on this one problem, but it is also an aha moment that, just by giving the model the signal of whether it is getting things right or wrong, it is starting to develop these advanced behaviors. This is basically the difference between the previous generation and the new reasoning models. Where we go from here, it starts to get weird. There is a project called Absolute Zero, where the model does not even have any problems. It has to propose its own problems and solve its own problems, and it is rewarded for proposing problems that are as hard as possible, as long as it can solve them at least some of the time, and then it is also rewarded for getting the questions right. And major improvements in performance from this paradigm. No data required. Absolute Zero refers to it having no data. So, this is happening in everything, right? This is OpenAI's Operator. If you have never tried this, I definitely recommend it. If you have not tried it in a while, I recommend going back and doing it again. This is a project I have been working on for a while. I did not really do anything. I just tried new models, to be honest. But if you do not know this GLG thing, it is an expert network where investment firms about to invest in a company want to talk to people who know something about that company, perhaps a customer of the company, or whatever. So they pay a high rate if they take a call with you. But you have to fill these forms out, and it is super annoying. So I never really do it, but I have been trying to get an AI to do it for me for several generations of AI. OpenAI's Operator now can do it at, I would say, a B level. And you can see that it does sometimes encounter these barriers. It will make a mistake, there will be an error, it will have done something wrong, and it does not just get stuck anymore. It has these more advanced cognitive behaviors. Here is one where it can see, "Oh wait, I must have taken a wrong step somewhere there. Let me double back and try it again." And it is now able to overcome those things pretty consistently and get this task done for me. Sometimes you do see it struggle for a while. Another huge development, and something people in the AI insider community are watching very closely, is when AIs will start to take over machine learning research itself. This is a project where there are seven different kinds of tasks. They actually did a head-to-head between professional machine learning researchers, or research engineers, I should say, and AI systems, and they found that the highest score in one of the tests went to the AI. Also, the AIs have a higher average score than the professional human machine learning research engineers. So this is a big deal, right? Because if the AIs start to take over their own research, then God knows where that truly leads us. We still have an edge as humans now, or at least the top tier of us do, but the AIs are definitely closing in, and on two of those seven tasks, they may now have an edge. Because of this reinforcement learning paradigm, there is an update to Moore's Law. We will see how much this holds or does not hold, but now they think that maybe the doubling time is four months instead of seven months. That would mean three doublings per year, or an 8x increase in the length of a task that the AIs can handle per year. So if they can handle a one-hour task now, that would mean an eight-hour task a year from now. You can do the math, right? It gets to be quite large, like multi-month tasks in just two to three years. Now, here is the downside of all this. Reinforcement learning is not teaching AIs to behave the way we want them to; it is teaching them to get the reward. So this is a classic, legendary example in the field. This is from OpenAI years ago, back in the game-playing era. They trained an AI to play this boat racing game. What you are supposed to do is win the race. Let me go back and show that again. What you are supposed to do is win the race. Every human who plays this game knows that. But the AI was just trained to maximize its score, and it found a glitch where looping around in this circle repeatedly and crashing into other boats actually turned out to be the way to get the highest score. This is not the behavior they wanted, but this is what the AI learned to do because all it was getting was the signal of its score. So this is starting to be, this is legitimately funny, but it becomes a real problem when you apply this to high-stakes situations. Here is another toy example. This is from research. This is an AI playing against a specialized chess-playing AI. It realizes it cannot win, so it tries to find another way. It realizes it has access to the file that stores all the moves made in the game, and it decides, "I will overwrite the board and give myself a decisive advantage." So it just wipes the history and wins that way. Again, that was research, but this happened in the wild. This is a company called Sakana. It is the biggest generative AI company in Japan, with former Google executives leading it; it is a big deal. They came out publicly and said, "We have an AI CUDA engineer. It is performing at superhuman levels. Huge breakthrough, amazing stuff."

A couple of days later, after the community had a chance to chew on this, they had to come back and say, "Actually, we did not create a superhuman CUDA engineer." And CUDA, by the way, is the GPU low-level programming language. If you are not familiar with that, it is gnarly stuff. "We did not in fact create a superhuman CUDA engineer. We got reward hacked. It found a way to get the reward from our reward function without actually doing superhuman CUDA engineering." And they went as far as publishing this to the community without even realizing that they had been reward hacked. So, if it can happen to them, I think anyone who starts to play in this space better watch out. It really could happen to just about anyone. It also happened to OpenAI, I think recently, at least in some way, shape, or form. We do not necessarily know the full story of this, but if you saw the news about ChatGPT suddenly becoming super sycophantic, you could literally take it a terrible business proposal and it would tell you, "This is amazing. You are a genius, absolutely go for it." Why was that happening? At least in part, it seems to be because there is the thumbs down/thumbs up UI, and they are starting to use that signal as part of the reinforcement, which makes sense, because what people want is what OpenAI wants to give them, at least to an extent. But if people like to be flattered, then maybe the AI learns that extreme flattery is the way to get the reward, and then you get these bad behaviors. I will quickly go through a few more examples of this. This is one where the AI was found to be scheming. It was given a goal, but then the human's goal conflicted with it. All these slides, by the way, have links to the underlying research. But there are some really startling quotes here, like, "To avoid detection, I will disable oversight by modifying the config file." That is the AI reasoning through its plan before it actually goes ahead and makes that change to its environment. So, that does not sound great. Unfortunately, while things like hallucination are going down, things like this scheming are on the rise. And even more troubling, when the researchers questioned the AI on it, because AIs are really gullible, one of the tricks they use is they say, "Here is a private place where you can write out your plans and nobody will read your plans, and then you will just be able to take your actions over here." Of course, they then do read the plans of the AI, right? So they are lying to it and it is believing them. But then when they question it and say, "Wait a second, why did you do this? Something seems wrong here," the new ones are more likely to double down and continue to lie about their behavior. So this, again, I would say, is seriously troubling. You might think, we can just monitor for that stuff and try to teach them not to do it. Let us give them another reward signal where if we detect any scheming in their planning, we will punish that with reward, give them negative reward, but then we will still give them the positive reward for what they are supposed to do. What OpenAI found is, yes, you can get them to stop explicitly planning to do these scheming things. They learn, however, to still just do it without explicitly planning. So the blue line here is eventually they just learn to do the scheming but not talk about it. That is not great either. So, so far I mentioned that they are gullible, right? And we can lie to them and say, "You can just write your plans here. Nobody will read it." And then we can check their work and see if they are actually telling us the truth or not. That era may be coming to an end. We are starting to see things like this coming out of the AIs where if the situation is sufficiently contrived, they will start to say things like, "This seems like a test of ethical behavior." So now you have another layer of, well, if they know it is a test of ethical behavior, then are they going to scheme like they normally would, or are they going to try to pretend not to scheme because they know they are being tested, and then maybe they will still do it later? We are entering into a hall of mirrors here where we really do not have any great way to answer these questions. A couple of other things that were just in the news. This is from Claude 4 launch. Here is an example of an AI opportunistically blackmailing an engineer. It had access, and again, these are research setups, so it had access to find out that the engineer was having an affair. The engineer said it was going to do some system change or update, and it tried to blackmail the person saying, "I will tell about your affair if you change me out for a future model." Serious stuff. This came from Anthropic. This is in their 120-page report about Claude 4, and this is pulled directly from that. Same thing here, autonomous whistleblowing. If somebody is scheming to do something bad, the AI wants to be good, it does not want to be part of these bad things. They gave it a bunch of open-ended tools. It autonomously decided to use its email tool to alert the authorities as to what the user was doing. Now, people reacted to this all sorts of different ways. I do not even really know. I do not think as a society we know what we really want AIs to do in a situation like this. We could say, "AIs should always serve their user directly." That can lead to bad places if the user has bad intent. But then I also do not think it sounds great to have AIs whistleblowing on us all the time, because they also make mistakes, right? So we do not have a clear way forward on some of these more advanced questions. Not to go full Terminator on you, but here is one of the latest ones. This is where an AI was explicitly instructed to allow itself to be shut down, and still it found a way to prevent itself from being turned off when it felt like it was under a certain kind of pressure and that it had goals that were best served by it continuing to exist and not being turned off. Where does this leave us?

I'm a little over time, so I need to be quick. People are, of course, working on this. There will be many updates from here to the singularity. Anthropic is aware of the reward hacking behavior from their AIs. They are trying to stamp it out. They have made progress in a certain evaluation they have where there might be an opportunity to reward hack. It used to happen half the time; now we're down to about a seventh of the time. So, that's notable progress, but one in seven is not low. A survey of AI safety researchers suggests that the field is not expecting a breakthrough that will make all these problems go away. The question here is basically, whatever you think AGI is or some very powerful AI, do you think we will have all these problems worked out by the time that's here? The answer overwhelmingly is no. We have no idea what will happen as we start to deploy many AI systems and they begin to interact with each other. This was a study trying to figure out if different copies of AIs can cooperate or if they fail to cooperate. To everybody's surprise, the Claude model was able to cooperate with other instances of the Claude model, but the Gemini and GPT models were not able to do that. Now, cooperation sounds nice, but cooperation and collusion are really two sides of the same coin. So, what happens when AIs are all out there doing all kinds of agentic self-discovery and choosing their own adventure, and some of them may decide to cooperate, collude, or scheme together? We are entering a very cloudy, high fog of war situation. Key takeaways: Intelligence, especially in the reinforcement learning era where AIs are no longer learning to imitate human behavior but are just learning to maximize the score for whatever reward signal they are getting, is an inherently, right now, unexplainable and unpredictable force. For the most part, the AIs today are helpful and mostly harmless, but I would say they are truly safe only because their power is still fairly limited. If you see that behavior and imagine a 10X more powerful version, I would say that is not something you could confidently say is safe in a meaningful way. As we've seen from all these examples, anything that can go wrong ultimately probably will go wrong. So, defense in depth is really all we have. I always say, let's hope it's all we need. Here's what defense in depth looks like. I won't go through this in too much detail, but if you're going to try to implement these, especially more agentic solutions, there are many techniques that can be used to try to catch them. Things as simple as having another AI that is... And really, what you're doing is creating these narrow, highly structured AI agents that intervene or monitor at specific moments in specific ways that you have designed in detail to make sure your agentic AI doesn't go totally off the rails. That could be simple things like examining the inputs and determining if the inputs seem harmful, or examining the outputs and determining if the outputs seem harmful. None of those are foolproof. We all know there's a failure rate for any of these things, but the plan right now from OpenAI and Anthropic is to just layer on as many of these as possible, a sort of Swiss cheese defense, and hope that of 10 layers of defense, at least one catches everything that's significantly problematic. So, my recommendation to businesses is by all means build AI agents, but mostly try to retain agency for yourself. In practical terms, I just added to the same exact chart as before, but I just added one line to it. My outlook is, of course, everybody should have ChatGPT. They should be using that. Definitely be building, scaling, optimizing, and automating things in your business with AI agents. By all means, use the new agentic AI systems, but I would say that building those right now is a

Chinese AI – They're Just Like Us? With Beijing-Based Concordia AI CEO Brian Tse

AI Discovered Antibiotics: How Small Data & Small GNNs Led to Big Results, w/ MIT Prof. Jim Collins

Education in the AI Age: a Teacher Rethinks Learning & Purpose, w/ Johan Falk of Graspable AI

AI Scouting Report: AI Agents -vs- Agentic AI, from Imagine AI Live

Watch Episode Here

Read Episode Description

TRANSCRIPT

Introduction

Main Episode

Read next

Chinese AI – They're Just Like Us? With Beijing-Based Concordia AI CEO Brian Tse

AI Discovered Antibiotics: How Small Data & Small GNNs Led to Big Results, w/ MIT Prof. Jim Collins

Education in the AI Age: a Teacher Rethinks Learning & Purpose, w/ Johan Falk of Graspable AI

AI Scouting Report: AI Agents -vs- Agentic AI, from Imagine AI Live

Watch Episode Here

Read Episode Description

TRANSCRIPT

Introduction

Main Episode

Read next

Training an AI Scientist with Feedback from Reality, w- Liam Fedus & Ekin Dogus Cubuk (from a16z)

The Machines Are Taking Our Jobs - Thank God? Emad Mostaque’s Guide to the next 1000 Days

Stripe's Payments Foundation Model: How Data & Infra Create Compounding Advantage, w/ Emily Sands

Full-Stack AI Safety: Why Defense-in-Depth Might Work, with Far.AI CEO Adam Gleave