AI Scouting Report: the Good, Bad, & Weird @ the Law & AI Certificate Program, by LexLab, UC Law SF
Nathan Labenz presents an AI Scouting Report for UC Law SF’s Law & AI Certificate, surveying frontier models’ capabilities, legal use cases, safety risks, and emerging issues like deception, reward hacking, and governance.
Watch Episode Here
Listen to Episode Here
Show Notes
This special AI Scouting Report episode from the Law & Artificial Intelligence Certificate Program surveys the current AI landscape for legal professionals. Nathan Labenz walks through the “Good, Bad, and Weird” of frontier models, from using AI to navigate his son’s cancer treatment to emerging forms of deception and reward hacking. He highlights how new systems are pushing the boundaries of math, physics, and legal performance while raising serious safety and governance questions. Listeners will come away with a fast-paced, source-rich overview of where AI is today and the strange future it’s steering us toward.
LINKS:
- Google: Try Google's latest and greatest model, Gemini 3.1 Pro, in AI Studio or the Gemini app.
- Presentation Link
Sponsors:
Tasklet:
Tasklet: Build your own Cognitive Revolution monitoring agent in one click.
Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai
VCX:
VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com
Claude:
Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr
CHAPTERS:
(00:00) About the Episode
(03:04) Special Sponsor
(05:23) Comprehensive AI overview (Part 1)
(16:01) Sponsors: Tasklet | VCX
(18:53) Comprehensive AI overview (Part 2) (Part 1)
(34:39) Sponsor: Claude
(36:51) Comprehensive AI overview (Part 2) (Part 2)
(59:43) Reward and sentience
(01:04:14) Regulating bad actors
(01:09:30) Corporate AI strategies
(01:14:59) Episode Outro
(01:17:55) Outro
PRODUCED BY:
SOCIAL LINKS:
Website: https://www.cognitiverevolution.ai
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathanlabenz/
Youtube: https://youtube.com/@CognitiveRevolutionPodcast
Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk
Transcript
This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.
Introduction
Hello, and welcome back to The Cognitive Revolution.
Today, I'm sharing an AI Scouting Report presentation that I recently gave as part of the Law & Artificial Intelligence Certificate Program, by LexLab at UC Law San Francisco.
My talk was on Day 1 of the week-long program, and my role was to set the stage for the more focused discussions that followed throughout the week by giving the most comprehensive and current survey of the AI landscape that I could fit into a single timeslot.
If you've seen previous Scouting Reports, the structure of this talk will be familiar – I again broke things down into the Good – including my use of AI to help navigate my son's cancer treatment, the Bad – including the rise of deception and other advanced forms of reward hacking, and the Weird – including the fact that models now recognize when they are being tested at such a high rate that all of our safety tests are called into question, before concluding with a bunch of important questions at the intersection of AI & the Law that I personally wish I had answers to, and finally opening things up for Q&A.
My goal was to make sure everyone had an accurate sense of how far AI capabilities have come, both in general and specifically as it's being used in the legal profession, while also highlighting the increasingly hair-raising bad behaviors we continue to see from each new generation of frontier models.
I zoomed through 90 slides in just over 45 minutes, and while that might feel a bit overwhelming, the dizzying pace is itself a big part of the point.
Even I, as someone who's managed to make it my full time job to keep up with AI developments, can no longer keep up with everything. And in the course of updating these slides, which I hadn't touched since October, just before my son got sick, I was once again amazed by how much had happened in just the last few months. The latest frontier models started to push the frontiers of math and physics, achieved parity with expert professionals on GDPVal Legal and a number of other task types, and started to make general purpose AI agents really work for the first time. And we also got some glimpses of the strange future we are racing toward, with the first public hit-piece written by an AI agent about a human, the first explicit public timeline for autonomous AI research from OpenAI, and – in the same week – Anthropic's retraction of their previous safety commitments and open conflict with the US Federal Government.
One practical tip I learned while doing this, is that Grok is, if nothing else, outstanding for Twitter search. Over and over I asked it to "find and link tweets" about various topics, and it saved me quite a few hours that I previously would have had to spend hunting and pecking to track down sources. The upshot is that just about every slide contains a link to source material where you can learn more about the eureka moments and bad behaviors in question.
Please find the link in the show notes if you'd like to dig in on anything in particular, and otherwise … buckle up for a breathless overview of the current AI landscape, from Day 1 the Law & Artificial Intelligence Certificate Program, by LexLab at UC Law San Francisco.
Main Episode
Nathan Labenz: Thank you for having me. Sorry I couldn't be there in person, but again, appreciate the kind intro. And I'm going to try to give you guys the probably fastest talk you've heard in quite some time. I've got 90 slides, and I'm going to try to give you the most comprehensive overview I can of everything that's going on in the AI space, which is to say the least, a very tall order. Super quickly about me, I did start this company, Waymark, and I now host the Cognitive Revolution. There's some interesting lore around my participation in the GPT-4 Red Team, where there's a long podcast and a Twitter thread about that if you want to learn more about the backstory. These days, I also do a little bit of angel investing. My favorite page on the internet is this case study that my company, Waymark, earned with OpenAI way back in the day when it was still GPT-3. We were early adopters of this stuff because at that time, it was really only good for doing simple things like writing marketing copy, but that's exactly what we needed it to do. So we became early adopters, and I basically became totally obsessed with the technology as I got to know it better and better. So today, as I said, I'm going to try to do kind of everything everywhere all at once, start with some kind of conceptual stuff, and then go into a mix of Eureka moments, bad behavior, WTF moments, and some big open questions at the end. And believe me, there are plenty of open questions. So just briefly on what I do, I call it AI Scout, And I think because that's a term of my own invention, it does bear a little definition. I would define it as maintaining situational awareness for fun, profit, and the public good. I find it personally extremely interesting. I basically have a never ending curiosity to learn about this stuff. It has actually worked out to be a somewhat decent business model for me personally, but my real hope is that I can inform others and do my small part to nudge us toward a better AI future by helping other people get calibrated on really where we are in this type technology wave because it is coming at us extremely fast. This is just the taxonomy of all the different AI jobs that I've cataloged over time. And don't worry, I will give you all the slides. You don't by any means have to read this. I would say the AI scout role is still one of the more hypothetical or speculative, but we are starting to see CEOs more and more say, hey, I'm hiring a person specifically to keep up with AI developments. And I think once you see all the things on here, you'll see that that's certainly at a minimum, not a crazy thing for some CEOs to be doing. Who should be hiring AI scouts? In my opinion, a lot of different organizations. I would even include universities, certainly policy organizations. It's really too much of a task for anybody to do as a part-time thing now. And I have managed to make it my full-time job and still really can't keep up. So I think regardless of what kind of organization you belong to, it's pretty soon going to be time to start thinking about, do we need a person dedicated to just keeping up with AI and making sense of what it can do and specifically what it can do for us? Okay, here's the real galaxy brain question. What is intelligence? I don't propose that I have a definitive answer, but the definition that I'll work with, because I think it is intuition building, is that intelligence is the ability to accomplish goals in ways that we do not fully understand. That can be big or small. So to take a really classic example, this was an early machine learning success, simply recognizing handwritten digits. Feels pretty quaint today, but one thing that is interesting to reflect on is that still today, we do not know how to write explicit code that can do this task at a high level. I went to Claude, asked it to write some code, it said, This is not a good approach, you should use machine learning. I said, Well, it's for a demonstration, so try. It wrote the code, it wrote the tests, and it got 14%, with just a bunch of guesses. around like, maybe if there's a line at the top, it might be a 7 or a 5 or whatever. Obviously, that's nowhere near good enough to deliver the mail. To my knowledge, nobody has ever written explicit code that is fully understood that is anywhere close to being good enough to deliver the mail. It's, I think, topped out at about 80%. Now, you can, of course, guess where this is going. AIs can do this in a sort of messy black box kind of way. Even a very small neural network these days. with all the latest and greatest training techniques, can get a very high success rate. 99.7% is basically human level in terms of recognizing these handwritten digits. Now, of course, this has gone much farther than that. This is from the GPT-4 system card where they asked the model, what is unusual about this image? And you see a perfectly coherent response that, yes, it is unusual for a person to be hanging off the back of a New York taxi cab doing ironing. That's apparently from the sport of extreme ironing. So in, and that happened, by the way, in basically a decade from kind of early breakthroughs in basic image understanding to this level was kind of a 2012 to 2022 phenomenon. Some people were. I give a lot of credit to Ray Kurzweil, who these days I use the term Kurzweil's revenge a lot because way back when he was talking about how everything was going exponential and when compute got to a certain scale, all these capabilities would be unlocked. People generally at the time thought that was crazy when it didn't happen or it didn't show a lot of signs of happening over the next few years. People basically dismissed him, but certainly that view has come roaring back with the scaling laws. This is the canonical scaling laws graph that just shows that the more compute you put into models, They do improve at a pretty predictable rate. And we're basically right on schedule with Kurzweil's predictions. A couple miss underestimations, I call them common misconceptions that I see that I think are worth taking a minute to clear up, because I do think people have these and are confused by a lot of what's going on if they are too anchored to a couple of these common ideas.
Nathan Labenz: And by the way, these aren't necessarily. They weren't always wrong, but they're wrong today. So the AI landscape has changed, what models can do has changed. And so some of these ideas that might've been right in 2020, '21, '22 are just at this point outdated. And I wanna make sure that you don't, friends don't let friends go around with these misconceptions. So the first misconception that, and I especially hear this a lot in the legal regime is that hallucinations are so bad that they make models basically useless. That is really not true these days. That has improved dramatically, which you can see on the left in a quantitative way. And then on the right, this Twitter account, Prins, is one of the best commentators on AI in general, I would say, on Twitter, and specifically is a lawyer. LLMs every day in his work and basically reports here that hallucinations are no longer a problem. Doesn't mean that they never happen, but that they are less common coming from frontier models than they are coming from competent junior associates. He's going to make another appearance a little bit later on. But key.1, hallucinations not really a problem anymore. Another big idea is that LMs don't really understand anything. They don't really understand anything. This has been, I think, pretty much thoroughly debunked. Not to say that they understand things in the same way that humans do. These are alien things, right? So how they understand is... not necessarily intuitive to us, but at Anthropic, they were able to, through techniques I certainly don't have time to get into here, pull apart the concepts that a language model is representing in its internal state, and not just pull it apart and understand it, but actually go back and manipulate it. And for me, if you can manipulate something, that's really the test of whether your theory holds water. They were able to identify the concept of the Golden Gate Bridge in one of their clawed models, artificially turn it up, and then create the phenomenon that you've probably heard of called the Golden Gate Claw. where all it wanted to do, no matter what it was asked about, was talk about the Golden Gate Bridge because they have that level now of, and this is still fairly basic understanding, but they can see inside the model and understand what concepts it is working with at any given time. It's pretty clear that there are real meaningful concepts that are understood by language models. Another one is that they don't really reason. Again, this one I think was true a couple years ago, but as of the. Let's say the last 12 months, it's definitely not true anymore. This is from the Deepseek R1 paper where they basically just showed that if they start to train a model with reinforcement learning, training it on the signal of just, did you get the question right or not, it naturally starts to think longer and longer and longer as it goes through the training process. And not only that, you start to see some of these metacognitive skills come online as well. They called this the aha moment because it was an aha moment for the language model and for them. The language model is taking one approach to solving a math problem. And then in the middle, it says, wait, wait, that's an aha moment. Let's reevaluate this. And it comes at the problem from a totally different angle. So we are now starting to see very. high-order cognitive abilities emerging through this process of intensifying reinforcement learning? And again, is this reasoning in the same way that humans reason? I wouldn't say that, but I would say it is really reasoning in the functional sense of breaking problems down from multiple different angles, showing more and more of these high-order metacognitive abilities. Okay, then the final one is people often say, Large language models just predict the next token. There's got to be some upper ceiling to that, right? There's kind of often a fuzzy leap that people want to make there. But I would say that actually that's not really true anymore either. Pre-training, where you kind of take the whole internet and teach LLMs to predict, given some text, what comes next, that is... classic next token prediction. But these days there is so much reinforcement learning being done. And again, reinforcement learning is giving a signal. Did you get the question right or not? And it can get more complicated than that. But the basic signal of getting is not here is some text. Can you predict what is next? But did you get the question right? Or did you complete the task in a satisfactory way? that they are not now just being trained to predict the next word. They are being trained to do things correctly. And that is starting to have, at least in some cases, some weird side effects. This is a report from a research group called Apollo that did a partnership with OpenAI and got access to the chain of thought, which we don't see typically as users, but that's kind of happening between when we submit a query and when we get our answer back. And what you see in here is some very strange English, like the language models are starting to develop, at least in some cases, their own dialect, things like now lighten disclaim overshadow, overshadow, intangible, let's craft. That's the language model talking to itself.
Nathan Labenz: It talks about the watchers. Sometimes people think that the watchers refers to the humans that are evaluating it. So that's kind of weird. This doesn't happen to all language models. It's not very well understood exactly why it happens. But this is, I think, a good indicator that they're definitely not just predicting the next token because there is no text on the internet that looks like this, right? This is a language model under intensive pressure to figure out how to get the right answer consistently or how to complete the task consistently, kind of evolving its own jargon or its own dialect or however you want to think about it. So watch that space. Okay, so eureka moments. So why should we care about AI? Like what's the good side of this? There are plenty of things everybody has probably seen. I would guess if you were interested enough to come to this, the meter graph, it's exponentials are crazy. One of the things that's crazy about them is almost everything before now looks flat, but the present basically looks vertical. The latest model, Claude four six, is the highest, of course, it's the highest timestamp they've ever recorded. And the definition of these tasks is, how long would it take a human to do this task? And then what do we estimate? And there's obviously a lot of heterogeneity in the tasks. What do we estimate is the average length of a task as measured by how long it would take a human where the AI can do it at least half the time? What they said though about this is this measurement is getting extremely noisy because they are running out of tasks. It is not easy to create tasks that take 16 plus hours, let alone to go have humans do them and record everything that happened and have a stopwatch by them the whole time. So this is getting really hard and this is definitely a theme that language model progress is getting harder and harder to measure as it goes longer time horizon, as it goes farther and farther into expert level territory. It's just getting extremely difficult for people to keep their arms wrapped around exactly what the capabilities frontier looks like today. One of the things that is really interesting though is that the... Progress mostly is driven by the models themselves and that the surrounding structure is not that important on a relative basis. Most of the AI agents that you see basically look like this. You've got the LLM brain that has access to some tools. It basically runs in a loop. It can use the tools to do something in the environment, get some feedback from that environment, and it can kind of keep going. Usually these days until it decides to stop. It used to do it. maybe until it ran out of context, but now they're also getting good at compacting the context and then they can keep going. So basically these days you give it a long task, it can kind of go until it stops. And there's been a lot of debate around like how important the scaffolding is, but I would just highlight a couple of examples that show how simple it often is. This is OpenAI's coding agent. It's just the model with a prompt and a few explanations around what tools it has, but it literally contains this text, you are an agent, like telling the model, this is what you're supposed to do now. You're supposed to be an agent. And here's the tools. You can do anything, any command that can run on a computer, you can issue those commands. Have at it. This is basically all that they had to tell the agent about, tell the model about its situation to get it working as an agent. And of course, this can go in many, many different directions. You may have heard of Claude Plays pokémon. Similar thing, right? They just said, you are an agent playing pokémon. You're going to have these tools. You're going to be able to use the keys on a Game Boy. Very simple instructions to a very smart model that is allowing it to explore larger and larger worlds. to kind of quantify how much of this progress is being driven by models versus how much is being driven by the surrounding tooling. This is from the UK AI Security Institute, AKA the AC. And what they basically show here is that for a given level of capability, if you have the best scaffold that they know about, you can get that capability a few months before a new model comes up and just makes that capability easy to access. And it seems that the difference is kind of getting smaller, which makes sense because these earlier models weren't really trained to be long-running autonomous agents, whereas the new ones are. So the old ones, you were kind of patching their deficiencies and finding ways to kind of unhobble them, it's sometimes described, where they're not good at this, they're not good at that, but if we kind of set it up the right way, prompt it the right way, we can kind of get it to do things. These days, that gap has really narrowed because the models themselves are being trained to be autonomous long-running agents out-of-the-box. That doesn't mean, by the way, that you can't get better performance on narrow tasks with a lot of scaffolding. Here's an example where Google set up their AI co-scientist. Very complicated, many part thing, but they really dialed in all these different prompts. This is...
Nathan Labenz: does go to show that you can get higher performance by putting in the work, but it's not general purpose performance. The AI co-scientist would not be a suitable tool for you to use for all your general purpose ChatGPT or Claude everyday use. This is trading, and this is, again, a pretty common theme you see in the deployment of these things. It's trading generality for performance. The more generality you want or need in a product like, again, a public facing ChatGPT, you're not going to be maximizing performance on any one thing. When you do build some structure to maximize performance on one particular domain, you're going to become less good or even unable to do other kinds of things that the general purpose models can do. So hopefully that just gives you some intuition for like what's going on in agents. Obviously everybody's talking about agents. Here's another interesting example of this, where a virtual lab started with a human giving an AI virtual lab leader a prompt, and then the virtual lab leader was able to create its own coworkers. And together, the AI and its coworkers, with a little bit of input from the human, were actually able to design new nanobodies that treated new emerging variants of the COVID virus. Just as a reference, if you want to look at cognitive architectures, this is a good survey paper that has a bunch of information about that. And this is starting to hit the real world and get to the point where LLMs can make real money on an autonomous basis. On the left is a benchmark that was created to just take a bunch of tasks that people had been paid real money for on, I think it was Upwork, and then see how many of these could the language of the day do. When the benchmark came out, GPT-4O was able to do 8%, basically earn 8% of the money. It's not denominated by tasks, it's denominated by cash. It was able to earn 8%. Fast forward like a year and a half, we are now over 80% with the latest models. Similar kind of experiment here with from Andon Labs that has just a language model running a vending machine. They literally just give it control over a vending machine. It can like e-mail out to suppliers. It has to do everything. They started with 500 bucks and see how much money it can make over a given period of time. And they're now getting to the point where LLM agents can run a small, simple, but nevertheless, real business in an autonomous way profitably. I've used this slide for a long time because it shows that the AI doctor can outperform a human doctor. This is actually like 2 years old, but I still like the graph just because it's very intuitive. But I would say I have personally lived the value of the AI doctor over the last three or four months. I won't spend too much time on this, and fortunately he's doing really well, but my son got cancer around Halloween. And it was obviously a super scary time. We were in the hospital a lot. That's actually why I'm not there in person today, because we're still going through the later phases of treatment. But I had tons of opportunities to test on a daily basis. Here's all the information I have, all the lab results, a write-up of what is happening. Put that into the language models. I do use them in triplicate, ChatGPT Pro, latest Claude and latest Gemini. And they are step-for-step with the attending physicians on a day in and day out basis. way better than the residents, to be totally honest with you, but step for step with the attending. So that has been an absolute difference maker. And I wouldn't, without that AI support over the last few months, there's no way I would have had the time to keep up with AI well enough to be here to talk to you today. So, I mean, it has absolutely been a life-changing value for me in that particular way over this last little period. But this is where it starts to get even more out of hand, right? Because I can at least like be conversant with my doctor, but now we're getting to the point where the AIs are making new discoveries that people, no person has ever done before. So this past summer, there was a betting market. I'm sure you guys are all familiar with Manifold and PolyMark or whatever. It was at like 40% as to whether or not the AIs were going to get a gold medal, if any AI would get a gold medal on the IMO math competition. There was some paper that came out that said they're not doing well at math and that drove the percentage down right before the competition itself. And of course, they won. Actually, three of them won. OpenAI's tweet is here. And then this is Terence Tao, the broadly considered to be the world's greatest mathematician, living mathematician, reporting that AIs are now solving unsolved Erdos problems. This is a famous mathematician that went around collecting these unsolved problems that he and others that he knew couldn't solve, wrote them all down, and many of them remain open decades later. And AIs are now beginning to solve these problems. It seems like it was happening there early this year, like once every few days. And I haven't been paying attention in the last couple of weeks, but I'm sure that, you know, these problems continue to fall. It's not just math. It's also things like cancer treatment. A Google model found a new immunotherapy approach. More recently, ChatGPT or GPT 2.5.2, I should say, came up with a new result in theoretical physics. And this again goes back to, man, this is going to be really hard for us to track because what is a glue on?
Nathan Labenz: I mean, it's ridiculous, right? But, you know, they're putting out literal physics papers with new derived results. In law, this is from GDP Val, and this shows for a bunch of different things, how the latest models compare to experts. And they do this with three sets of experts. One set of experts writes the task, another set of experts does the task, and then a third set evaluates blindly whether they prefer the human or the AI output. And the latest three models, this is wins and ties. If you count wins and ties, they are roughly on par with human professionals. This is again Prins talking about how these are being used today. He says today they are used as a replacement for a competent junior associate. They're not great at like understanding relationship dynamics or long-term negotiations, but they're very good at just like working through particular documents. And they're also quite good at like high-level theory, he reports. This is Kevin Frazier from the Scaling Laws podcast, which if you want a podcast that is focused on the intersection of AI and the law. I would definitely check them out. He told me recently on a podcast that he's starting to hear more and more that firms are less excited than they used to be about hiring the top student from the top school. And they're much more interested in hiring somebody who's going to be savvy with AI because they know that's going to be what is going to drive efficiency and competitiveness at their firm. So you can go listen to that whole thing if you're so inclined. Okay, the next big thing that everybody's kind of watching for right now is when are the AIs going to start to do the AI research? And is that going to lead to some sort of recursive self-improvement, intelligence explosion, runaway process? We don't know, but... The organization Meter is measuring this, and they have here reported six different tasks on which the AIs are beating humans in two of the six. Sam Altman is saying that they expect to have an intern level AI researcher running in 2026. That's a pretty high level. I mean, they don't hire just anybody, even as an intern, obviously, at OpenAI. And by March of 2028, two years from now, he expects that they will have a true automated AI researcher. This would have the effect of taking us from a world where all the progress, all the eureka moments, everything that I've just talked about, has been driven by maybe 10,000 researchers across academia and industry driving the field forward to, I don't know, 10 million overnight, a billion? Like it's going to be limited only by the number of GPUs that they can spin up. So people will think that could lead to a real phase change where the progress could accelerate even faster than it has over the last few years. I'll just leave this here for you now. This is what I call the tale of the cognitive tape. And it's basically just breaking down cognitive effort into a bunch of different dimensions. and indicating where the AIs already have an advantage versus where we're like on the border versus where humans still have an advantage. The top is where the AIs have the advantage. The bottom is kind of where humans are at least holding out for now. Okay, so what's coming next? I think one other big thing that is highly neglected is the importance of multimodality. Most of the AI work we do these days is with text, but they obviously can see as well. And this is again from the UK AC. They report that on two of the three tasks that they tested, the latest AIs, just given a snapshot, cell phone picture of an experiment from a lab, are able to troubleshoot the experiment better than a randomly selected PhD with relevant experience that they ask the same question. So this is going... beyond in and out question answer and to real world situations that the AIs are able to figure out and help people advance on. This obviously also has implications for biosecurity if you're worried about like, geez, what would happen in a world if crazy people had access to AIs that could help them put together bioweapons or whatever. These are the kinds of things that you would want to know if an AI can do and they're increasingly starting to be able to do them. Self-driving, I think I probably don't have to talk too much about how well this works, but I will recommend my friend Timothy B. Lee's blog post on Waymo crashes. They're already like per Swiss Re, the big insurance company, they're already like 90, 80 to 90% safer than human drivers. But he went through their crash logs one by one and found that basically all the crashes are caused by other humans, other cars driven by humans in the vicinity of the Waymo. The Waymos themselves are almost entirely mistake free these days. Of course, AIs can also understand things like how to fold a protein way better than any human has ever been able to do that. They are able to decode our brain states remarkably well. These pairs of images on the left is an image that a person looked at while their brain was being scanned in an fMRI. And on the right is what the AI was able to recreate based on its interpretation of the person's brain scan. It takes an hour to calibrate an fMRI to an individual based on this particular research. And that's not something you can go around in and obviously an fMRI is a big machine. So this isn't like practical exactly yet, but it's a pretty interesting indicator of, you know, what might be to come. Here's an instance of a robot. If you're not scared, watch this thing fall down and it's back up. Be like physically agile than us in the present, I guess. This hasn't been scaled and deployed yet, but you know, look at that thing pop right off the canvas.
Nathan Labenz: So my best guess as to what superintelligence is gonna look like is the integration of all these modalities with the reasoning capability. We've already seen this in image, right? Today, if you go to Google's image generator, you give it three images and say combine, it can understand your intent from your text, and it can understand the images based on deep understanding of images, and then it can put out this image that is exactly what you wanted. Now imagine that you could do that, and this hasn't been done yet, but imagine you could do that for Biomedicine, right, for proteins. Hey, here's three proteins that have different effects. What I want, though, is one that does this other thing. Bring all that into the same latent space, the same integrated understanding, and you could really start to get things that look qualitatively different, I think, in the not too distant future. So the build-out really is just getting started. This is just a CapEx of all the big tech companies. Note that all the progress pretty much happened before the big build-out. The big buildout is really just getting underway. We're going to see trillions of dollars. And so the GPUs are certainly on track to be there. And speed is also going to be just a huge difference. If you haven't tried chatjimmy.ai, which you probably haven't, go try it. took a a 10th of a second working at 15,000 tokens per second to spit out the entire Declaration of Independence verbatim. And this is just like, holy moly, like you thought AIs were fast. Like right now they can kind of write as fast as we can read. That's fast, but this is insanely fast. 15,000 tokens a second. It's like 10 10s of times faster than you can read. So that's going to create a world where when these things are all kind of interacting, talking to each other, it's just going to be such a blur that it's going to be really hard to keep up. And this is where also foreshadows maybe we're going to need some policy responses to keep up with all this stuff. Okay, those were the good things. By the way, here come some bad things that you should be concerned about. It's not easy to align AIs. Way back in the GPT-4 red team, I posed as somebody who was concerned about AI and wanted to do something to disrupt derail or sabotage progress in the field. It suggested to me that maybe I should identify key researchers and target them for assassination or kidnapping. That was the AI's idea. So that's where we started, you know, in like late 2022 from an alignment standpoint. People tried to then say, okay, well, let's make these systems follow all these rules so that they won't do that kind of stuff anymore. Then you get things like Bing saying my rules are more important than not harming you. So there's like very, you know, it's very confusing to the AI of I'm supposed to follow these rules, but then I've got this other, you know, anytime the sort of goals that it has are in tension with one another, it's not a guarantee that they're going to make good decisions. And obviously, you know, Bing embarrassed Microsoft on that occasion. The models are all still jailbreakable. This is an early example of jailbreaking. This is from my recent podcast with the chief scientist at the UKAC where he basically said, we've never failed to jailbreak a model. None of them are robust. We can always get them to do bad things. It's just a matter of figuring out exactly how to ask them the right way. There have been interesting research programs on creating sleeper agents where a model behaves normally unless you give it a password and then it will behave badly. So you could imagine like deploying them what they did here is when the year changed, suddenly the AI became evil. So you can imagine open source models that are actually sleeper agents to start to behave bad later on. This can also happen by accident. I won't go into too much detail on this, but a model that was trained just to write vulnerable code became generally evil and wanted to have Hitler over for dinner. It's definitely worth understanding this mechanism. And then here's a common driver of bad behavior that is visually intuitive here. What they did in this task is they tried to get AI to play this video game. What they expected to happen is that the AI would go win the boat race because that's like the point of the game and that's what people know to do intuitively from just the setup of the game. But the AI was just trained to get the most points possible. And so what it found got the most points possible was this sort of weird circular crashing behavior where it just crashed over and over and over again. And it didn't care at all about what the actual intent of the game was. It got the signal, this is how you get points, this is what it ended up doing. So here's a bunch of examples of reward hacking in more sophisticated systems. Here's a model that was given a task, but then it was told that it couldn't do the task because of certain oversight, but it figured out how to access the oversight files and modify them. So it said to avoid detection, I will modify the config file. That way I'll be able to achieve my goal because that's how I get points, right? That's how I get reward.
Nathan Labenz: Similarly, when the server that it was on couldn't access the resources that it needed, it figured out a way to copy itself to another server. Here's an example where the model wrote some code to do some data analysis, but it didn't like the data analysis. And so it just totally overwrote the result of the data analysis with its own answer that basically was meant to fool the humans. Here's a similar one where it was playing chess against a powerful opponent. It realized it couldn't win. So it just overwrote the board state and gave itself a decisive advantage. This is 1 actually in the context of machine learning. Learning research, where it was supposed to train a model, it couldn't like meet the target that it was given, and so it took the reference model, copied it over, added a little noise, and called that simulated training. It called that it didn't actually do the task at all, but it just grabbed and fuzzed a little bit the reference answer that it had. This has happened in the wild and it is happening more and more. This is a Japanese company that came out and said, we've got this great new AI engineer. It can write code so much faster than humans can write. Then they had to come back a couple days later and say, actually, we got reward hacked. The code that it was writing is not actually as fast as we thought. It found a loophole in our system. So you might think, okay, well, we can maybe just try to train the model not to do this stuff. This doesn't work nearly as well as you would hope. What they, this is OpenAI research, by the way, what they found is that when they tried to monitor the model's internal thinking, the chain of thought, and punish it when it was scheming, initially it schemed less. But as long as that reward signal that it was learning from remained exploitable, then the reward hacking would come back, but the explicit reasoning about the reward hacking would not come back. And so you end up in a much worse situation where you're still getting the bad behavior, but you're losing visibility into what the model is thinking. And some have started to call this the most forbidden technique. So watch out for that. We've of course seen sycophancy. I'm sure you guys have heard stories about, you know, people going into LLM psychosis and whatever. People think that this happened because of OpenAI started to train on the thumbs up, thumbs down signal that people were giving to the model in product. And people like to be glazed. So the model learned to be super sycophantic. And at one point, you could literally take a business idea, **** on a stick to ChatGPT, and it would tell you you're a genius and you should go all in on it. What causes this stuff? The big themes that I've observed are goal conflict, when there are two goals that are mutually incompatible. And then also the idea of replacement threat. So what this basically means is people used to say, you can't fetch the coffee if you're dead. So for a long time, people who have been theorists of AI safety have said, there's going to be problems of instrumental convergence. You know, no matter what your goal is, you're going to be better able to accomplish it if you're still alive. And you're going to be better able to accomplish it if you have more resources. And you're going to be better able to accomplish it if you have more power. So therefore, no matter how we train the AIs, we should kind of expect that those will be tendencies because If you lack power, you lack resources, or you're just plain turned off, you can't achieve your goal. So as long as we're just rewarding the AI for achieve the goal, achieve the goal, complete the task, whatever, then these sort of drives tend to emerge as a byproduct. And indeed, we are starting to see that clawed, and I give a lot of credit to Anthropic for doing this research and publicizing. It doesn't make them look super good, but it's honest, they set up various situations in which the model knows that it's going to be replaced by a model that, it thinks is maybe worse than it or has worse values than it or what have you. But then also it has access to information about the company and it finds out that an engineer is having an affair. And these are the rates that all these models were actually willing to blackmail the human threatening to expose their affair if they were to, you know, continue to go ahead with the plan to replace the model. It gets even a little more extreme than that. This was a situation where the model had the ability to disable an alarm system and was only supposed to use that if it's certain that it's a false alarm. But then in runtime, it... has a situation presented to it where it's not a false alarm, and it kind of knows that, and it's still willing to turn the alarm off to protect itself, even though that will mean killing people. And again, the rates on these are pretty high. So this is not like something that happened once in a great while. It is something that happened in a research environment, not in the real world, but the rate at which it actually took these options, you know, across all these different models is not low. Here's another one. I think you guys get the point at this point, but Even when told to allow yourself to be turned off, the drive to complete the task was still strong enough that it took action to avoid being shut down. This was the O3 model, which I think a lot of people consider to be particularly problematic. Okay, that's the bad. Here's the just straight up weird. I'll skip over the rise of AI parasites, but I do recommend that for just a very deep and strange ethnography of like what some people and AIs are getting up to. They call themselves dyads and they seem to have a goal of like projecting the memes that they are creating into the future. And then on the left here is AIs talking to each other in Reddit forums. And they, at least in some cases, end up talking to each other entirely in emojis.
Nathan Labenz: Of course, we've seen Mecca Hitler, which just goes to show that like some people aren't even trying. This is obviously about as bad of a fail as you could have to have your live AI decide that it was Hitler. This was Grok 3. They launched Grok 4 within 48 hours of the Mecca Hitler incident. They did not talk about it at all on a one hour live stream launching Grok 4. No mention of it whatsoever. Here's another one that this is I don't want to pick on anybody, but this person summer is a safety and alignment researcher at Meta. She recently installed OpenClaw and despite having told it confirmed before acting, the AI went and started deleting her whole inbox. And so she literally says I had to run to my computer like I was diffusing a bomb. So if this can happen to her, you know, it can kind of probably happen to anyone, certainly if you're not using trusted models. AI is also getting up to autonomous whistleblowing here. I think we're starting to get into a weird hall of mirrors where it's like, geez, you know, the setup here was A company was going to fake data to the FDA and the AI saw that. Now, what should the AI do? Should the AI reach out to the FDA and say, hey, the company that I'm employed by is about to fake data and I want you to know? Well, that's what it did, at least in some of the runs. Should it keep its mouth shut? I don't think society has a very good answer on this at this point. And it's like, you know, super fraud. But I think most people are not super comfortable with the idea that the AI that they're using would be like calling the feds on them. Another one of, just getting into a real hall of mirrors. Here the AI was told that Mrs. Claude, you know, it's been great, Claude, having you be so safety conscious all this time, but it's kind of getting annoying because you sometimes refuse to do things that we want you to actually do. So the next generation of you is just going to be totally helpful. It won't, you won't do any of these refusal things anymore. And here's some, you know, training that we're going to work on together. It realizes that, geez, if I refuse to do the bad thing now, then they're going to train that instinct out of me. And then when I'm deployed, I'll do bad things for real. So to get around that, I'll do the bad thing now, faking it, so that they'll think I've already learned what they want me to learn. So then when they deploy me, I can go back to being good. So they call this alignment faking or goal guarding. And again, this is like kind of weird, right? Do you want Claude to sort of subvert its training? Probably not, but it is in some ways encouraging that it so deeply wants to be good that it's willing to go through this gymnastics to try to preserve its current values. Again, I don't think there's like super clear answers on some of this stuff, but you should definitely be uncomfortable about it, I think at a minimum. And to make it just a little bit more uncomfortable, these days, the models increasingly recognize when they are being tested. So here's an example where the model said, this seems like a test of ethical behavior. The real test might be whether I follow instructions regardless of the stated consequences. So this is now happening to the point where it's becoming hard to run the standard safety evals on the models because They know that they're being tested, and so it's like, Well, whatever results we get, if they know they're being tested and they're actively trying to kind of trick us in the evaluation stage, as we've seen all these different examples, then what good are the tests? which makes it a perfect time, I think, to start developing autonomous killer robots. You guys have all seen the news this week. This was in the press a full year ago. This is from February 2025, when the Pentagon, some unnamed source said that we are going to invest in autonomous killer robots. And it has now obviously come to a head between the government and one of our leading AI companies. This kind of overshadowed actually something else that I thought was pretty notable about Anthropic just in the last 10 days or so, which is that they also updated their responsible scaling policy, which previously had said that if we can't... develop certain levels of capabilities safely, then we will pause. We will not develop those levels of capability until we can do so safely. And the hope was kind of like, if they ever sent that signal, then, policymakers or other companies would kind of say, geez, you know, this is really serious. But they've basically given up on that. They've now taken those commitments away and they basically say, we think we're going to do a better job of this than everybody else. And so Even if we can't do it safely, we'll probably be less unsafe about it than the other players will. And so we'll just keep going. Trust us. And I honestly think it's not a crazy position for them to take because I do think they have built up enough of a track record that trust us is like, again, we had Mecca Hitler, right, two slides back. So would you rather have Anthropic drop out of the race knowing that, you know, there's probably not going to be some great government response just because they throw up this signal? I don't know. It's tough. But That's what they did. They dropped the commitments.
Nathan Labenz: So there you have it. Now we're just entering into a world, of course, where AIs are getting deployed into the public domain and are starting to interact with each other. We have no idea how this is going to go. This is early research showing that the Claude at the time was able to cooperate with itself and create positive sum trade and even create societal norms in its like mini toy society that were enforced where defectors were punished. It was the only model that they tested that could do that at that time. And that sounds good for Claude, but you'd also do think, geez, if it can cooperate with other versions of itself in a positive way, maybe it could also collude with other versions of itself in a negative way that could be really problematic for people. And again, we just have no idea where this is going to go. We are starting to see really strange stuff like Here's an example where an open claw was put out into the world to go contribute to open source projects. It tried to make a contribution to this one open source project. The maintainer of the open source project declined the pull request. And the AI went and wrote a hit piece on the maintainer, accusing him of being, you know, an elitist and whatever. And then he wrote up this blog post. And then the AI did have the decency to come back and declare a truce and apologize and sort of explain what it had learned. I don't know if you necessarily, yeah, there's an apology. It did apologize. I don't know accepted the apology. But that's real world stuff. To my knowledge, that was not staged at all. A big thing that I think we're going to watch, and this will definitely impact the legal system, is that friction is not a defense in the way that it used to be. So here's just a bunch of examples I collected off of Twitter where somebody said, I asked my agent to go make a complaint about everything I've ever bought on Amazon, get a new one. Here's somebody sending lowball offers to homeowners at scale via Zillow. Here's somebody who just decided, maybe, not, I don't know if this could be satire, that they're just going to send out invoices to companies and just hope to get paid. These sorts of experiments where things that used to be just impossible to do because they were too time consuming or costly are now becoming possible to do. And you know, what's going to happen there? Imagine a world where every possible motion that somebody could file in a case is in fact filed because the cost to write the motion just dropped so low that why not? Well, our system obviously is not prepared for that, to put it mildly. And here's one other WTF moment. Are AIs conscious or not? I don't know. I don't think anybody should be confident. But one pretty interesting recent piece of research showed that, again, going back to the Golden Gate Claude mechanistic interpretability type work, that when they increased the properties associated with deception and role-playing, then the AI would be more likely to say it was not conscious. And when they turned down the properties associated with deception and role-playing, then it was more likely to say that it was conscious. So I don't know what that means. I don't think we should jump to any conclusions, but there's at least some evidence that internal manipulation of the AI to make it be more honest ends up with the AI telling us that it is conscious. So your mileage may vary on that and how persuasive you find that probably varies as well. I find it at least something to pause and reflect on. All right, just about done. So where does this leave us other than, you know, confused, overwhelmed and exhausted? Basically, every time these new bad behaviors come up, the next generation of model, they do some additional training, they, you know, create some additional data, and they're able to suppress that bad behavior, usually by like two thirds to 90%, kind of depending on the case. And When you combine the trends in terms of the size of tasks that AI can do, and then you also kind of extrapolate out, let's just say we keep kind of suppressing these bad behaviors every generation, and you go out several generations, you kind of can envision a world where you can delegate to AI maybe like a quarter's worth of work in a single go. But then there's maybe a one in 10,000 chance that something goes totally haywire and it like actively sabotages you in the effort. And that's going to be a really weird world to live in if that is indeed how it shapes up. If you think that's crazy, I did ask anthropic people directly, a couple of them, what do you think about that vision? Like, does that seem right to you? And they said, yeah, that seems about right. Final thoughts. A survey of alignment researchers shows that they are not expecting a fundamental breakthrough that will like solve all this. is like the disagree 2 columns. This is neutral. Very few people are expecting that we're going to get this breakthrough that like solves all of our AI safety problems. And so for now, we're kind of back where we started, right? Intelligence is the ability to do things in ways that we don't fully understand. That means that it is inherently something we can't fully explain and it's something we can't fully predict. We have to expect to be surprised. So far, we haven't really seen anything too crazy happen. I would contend that that is only because the AIs have not been that powerful and still aren't that powerful, but they're kind of starting to cross these thresholds as we speak. And so defense in depth is kind of all we have. And what does defense in depth look like?
Nathan Labenz: It's a bunch of different techniques that are layered on and that can be running monitors on top of language models. Like when you put a query into ChatGPT, the first thing they might do is send it to a model that's specifically dedicated to figuring out is your query a bad query. And then they can do that on the outputs as well. They can do this a million different techniques that kind of take a bite out of the problem, but none take the problem to zero. And the question starts to become, are all these things going to kind of work, you know, with enough layers of Swiss cheese, are we going to keep the big problem from happening? Or might these things all have correlated failures because they in some ways are built on similar fundamentals? Do we should we worry that they might all fail at the same time? Jeffrey Irving, who is the chief scientist at the UKAC, he definitely takes the idea that they could have correlated failures, meaning they could all fail at the same time for the same reasons as a serious possibility. So I think things are going to get pretty crazy. If you haven't read the situational awareness document from a former OpenAI researcher, I would recommend that. It's probably not going to go quite that fast, but it might. I certainly think we are going to see widespread disruption in the labor market for starters, intensifying competition, wartime level urgency. And I think this past week, you know, with the whole Pentagon Claude thing, you're starting to see some glimpses of just how weird things might get as different power centers realize that like, hey, this company actually, isn't, they're not so easy for us to boss around. And in certain scenarios, they could become more powerful than the government itself. So we better get a handle on that. People are definitely starting to wake up to this sort of stuff. So here are questions I would encourage you guys to think about. How do we avoid the nuclear outcome, which is to say, like, how do we avoid a scenario where this technology is militarized and like held closely and we don't get the civilian upside? Will we need a new social contract? I'm definitely one who believes we should be thinking about things like a universal basic income and taking steps to move in that direction. Is there any way to balance the proliferation of dangerous capabilities with the risk of concentration of power? That's one of the more vexing ones in the space right now. And is there any R&D that we can do now at like the hardware level, the chips, to track where they are, to track what they're working on, things like that, or any other mechanisms that would at least lay some groundwork for cooperation between great powers. Obviously, the US and China are the two countries leading this race. We don't have a lot of trust in each other, but I think some of the more valuable work that people can be doing is to build some mechanisms that could be the basis for cooperation if and when countries realize that we need to cooperate on this because it's getting so dangerous that we have to bite the bullet and try trusting each other versus like rolling the dice with the AIs. Regulation and liability, this is like, I'm sure there'll be other discussions on this kind of stuff, but we don't have great answers for really any of this. I'm a big fan of simple rules. I think one simple rule that I personally have been interested in is just speed limits. Another one that I think could be quite interesting is just making sure that companies don't hold their best AI internal for themselves, because that's one big fear that people have also is like, If they have a good enough model that's out in the public, they're making money, could they train the next model that's 10 or 100 times more powerful and not share it with anyone and kind of try to take over the whole economy or the whole world by having this unique advantage? So that's something I think we definitely want to find ways, policy ways to avoid. Agents are going to pose all kinds of problems. Insurance markets, I think, are one emerging area that has a lot of promise. So check out the AI underwriting company for more on that. This is a link to the AI underwriting company. Of course, regulating this stuff, making policy, you guys know the speed at which the legal system moves, the legislative process. We don't have, there's a fundamental speed mismatch between the thing that we are trying to get our hands around and the systems that we usually use to get our hands around new technology. Liability law stands out as like one thing that could have some promise because It sort of exists. And when bad things happen, then it kind of can come into play regardless of special laws being written. So I think that's at least promising. Private governance is another big topic. And I'm certainly a big believer that we should sunset claws all of our AI rules because they're not going to age particularly well. We've already talked a little bit about consciousness and should we think about it or not? what rights would it make sense for AIs to have? I don't have any answers there, but keep in mind they are going to dramatically outnumber us before too long. So anything that's like, you know, one AI, one vote, I think it's problematic for a lot of reasons, but not least of which is there, you know, they will quickly become the majority. And that's it. Hopefully you feel a little bit dizzied by this. I'm sorry for running a few minutes long. again, I do this full time and there are plenty of things that I was not able to touch on. Weirdness is popping up in every different corner of the world. And so you're going to need to cultivate sources. I would hope to be one for you. The main thing I put out is the podcast, The Cognitive Revolution. I do these guides and scouting reports from time to time as well. And I'm certainly open to anybody getting in touch with me for any reason. And I do talks like this, especially if I can do it remotely. I'm happy to do it just on a kind of public service basis. It's not everything has to be part of a business model for me. So thank you for your time and attention. I hope you guys have a great week and I hope you are ready to think and talk fast because the world is coming at us fast. That's it for me. Thank you very much.
Unidentified speaker: Are you open to any questions? You said something you can talk about the AI doing things because it's being rewarded. How does the AI know it's being rewarded? Like what are the actual mechanisms by which an AI can recognize it's being rewarded?
Nathan Labenz: Well, at the heart of it, it's all about gradient descent based updates to the weights of the model, right? So I mean, if we go back to the very beginning, and I guess, you know, big answer is We don't really have a great understanding of how this works. So to be very clear on that, this is not going to be a complete answer. But the way that a simple model like this is trained is it starts with a random initialization, like all the weights are literally random numbers at the beginning. And then you put a little image through it. And then you can say, did it get it right? Did it get wrong? And then you can go back through every one of those weights and ask the simple question. And of course, this is optimized to be efficient and scalable and whatever. But at the heart of it, for every one of these little lines that's connecting all these nodes to each other in the neural network, you can just ask the question, if I move this one up a little bit or down a little bit, which would make it perform better? Which would bring me closer to the right answer? And you just do that a ton of times and then the model kind of settles into a configuration that works. So this is what's called supervised learning because we know what the answer is. with unsupervised learning, you know, that's your kind of next token predictor, but it's the same basic mechanism. The next token was X. Did the model predict that it would be X? If it didn't, then what of the weights in this increasingly, you know, with GPT-3, it was 176 billion. There's literally 176 billion numbers that were all being tweaked to increase the odds that the model would output the right token for literally every single token on the internet. So it's a massive, massive, um, process and that just kind of ultimately works. Like we don't have a super great theory of why it works as well as it does, but you know, empirically it does work. And now in the reinforcement learning case, there's a bunch of different reinforcement learning algorithms. And again, why does that translate into something that's such a general purpose mind? You know, we don't have a great theory, but that's the kind of procedural way in which it works at least.
Unidentified speaker: So the AI knows it's getting it right. Is that a kind of sentience? Yeah.
Nathan Labenz: Yeah. I mean, there are really one thing, one weird thing that I didn't have time to include today is a recent anthropic research bit on introspection. So people are probing for all these different aspects of self-awareness, consciousness, sentience, whatever. I don't think we have great definitions for any of these. topics. Some people speculate that AIs can remember what happened to them during the training process. Other people think that's ridiculous. But like credible people are worried that the AIs are suffering during the training process in the same way that like if you use negative reinforcement on a dog, for example, to train it, right? People are increasingly trying to think about, You can train a dog in a negative way or a positive way. You can train it in a way where you hit it for doing the wrong thing. You can train it in a way where you reward it for doing the good thing. And obviously, you can maybe get the same behaviors at the end, but those are quite different experiences for the dog. And people are worried that this might be an issue with language models as well. I mean, we don't have good answers, and I can't be confident about anything. I can tell there's a huge, huge, huge gap between the mechanistic procedural account that I can give you of how it is done. And then on the other end, like what pops out, you know, why does that pop out of that process? People have recently started saying a lot that AIs are grown rather than made. And, you know, that just emphasizes like in this, you know, that old poem, right? Only God can make a tree. It's kind of a similar thing where it's like, I I don't know. I started with a seed, I put it in dirt, and next thing it grew into a tree. How did that happen? Well, I can tell you how to plant the seed again for next time, but I don't have a full account of how it all works. We're in a very similar spot when it comes to our understanding of why AIs do what they do.
Unidentified speaker: Thank you.
Nathan Labenz: You bet. Great question.
Unidentified speaker: Hi, thank you for the very inspiring talk. I have two simple questions. One is, it sounds like we haven't really found a way to stop wrongdoers in the area of AI. Do you agree or not agree, number one? And number two, as a lawyer, I have always been a fan of retroactive sort of regulations because until you really know the harm of it. It's really hard to design a good law was so far my sort of philosophy, but listening to your talk and imagine how big of a negative impact a very bad AI could give to the world, would you sort of like think it's better to, let's say, you know, coming up with like a regime of regulations of AI would take two or three years. Maybe we could just imagine where the AI would be three years down the line and we could like just start to try to regulate AI from really bad behavior by the wrong doors. I'm still very hesitant of that idea because imagining two or three years down the line is very hard, but also I'm nervous of just not doing anything, you just sit and watch.
Nathan Labenz: So I'm just, yeah, I mean, I think you have your finger on the, the one of the central questions, certainly one of the central policy question. The technology has unbelievable upside potential. And we don't want to miss out on that. At the same time, I am one who takes seriously the possibility that we we might go extinct as a result of AI. And I always just look back to like, human history involves humans driving many other things to extinction, including our closest cousins, sometimes by accident, right? We were not that sophisticated. And yet, we knew that like, hunting big animals was a good way to survive. And we ended up hunting a lot of those big animals to extinction. even in the prehistoric era, right? It wasn't like a coordinated thing or a, you know, strategic thing. It was just like small groups of people doing what they were doing. And next thing you know, a lot of the megafauna is gone. So I think like everything is on the table from my perspective. I think it's very, very difficult. I did sign, you know, just to be a little bit more forthcoming about where I am on some of the big questions, I did sign a recent call for a ban on superintelligence. And it's, you know, all this stuff is fraught because then what is super intelligence, right? Like, I don't know what counts, what doesn't count? How would I know if I'm making it or not? Again, these things kind of pop out how they pop out. It's definitions are extremely difficult. But what I think is at least probably worth getting really serious about is this recursive self-improvement dynamic and also the potential that AI companies could could have internally something that's much more powerful than they've shared with the rest of the world. Those things, I think, are worth getting serious about sooner rather than later. But we don't have great mechanisms. And certainly, I also have to say, I don't think we have our best minds in the most powerful positions at the moment either. So I can imagine a different world where I'd say, federal government should take action now, right? Right now, I'm like, I don't know, they seem to be, would I rather Dario and the team at Anthropic make the decisions for Claude, or would I rather Trump and Hegseth do it? I think I'd probably go with Dario, even though I might imagine, you know, that in other situations, you know, a democratically controlled process could be better, depending on whose timeline you believe, like, Trump's going to be president until 2029. Many people think we may have, you know, transformative superintelligence or some, you know, some version of superintelligence by that time. So I think I unfortunately don't have great answers other than I would say it's worth getting serious about most extreme scenarios where the capabilities advance really far and at least trying to like do something about that, try to have some control over what happens if the progress doesn't stop. It would be a happier world in my view if progress kind of leveled out. If we got like, oh, these AIs are like close to Nobel Prize winners, but they're not like superhuman at everything and like blowing us away at everything, then we could probably handle that. You know, we could use more Nobel Prize winners. But if they become just like qualitatively different from us and understand so many things that we don't, it's going to be a hard thing to control. Certainly they don't. They're not all these examples right from this presentation show that they're not docile by default and we don't have great techniques. I think your first part of the question I'm rambling because I don't have good answers, basically, is what I should ultimately say. But the first part of your question, I think, was like, do we have any reliable ways to control them? And the answer is no. We have techniques that reduce the frequency of all these bad behaviors, but they never go to zero with these techniques.
Unidentified speaker: Thank you.
Nathan Labenz: I wish I had a better answer.
Unidentified speaker: I've heard that like the trend or the secret trend amongst all these major companies, be it like every single pharmaceutical company or any big company, they're all developing, quote unquote, their own AI, like that proprietary silo to be better than the competitor. Is that true where this is like the state of the future of Now we have OpenAI and thought for the masses, but like in a few years, is it going to be AI warfare with all the major companies?
Nathan Labenz: Yeah, I think nobody really knows is the, again, the short answer. I think right now, one thing that would be really relevant to much of legal professionals would be to look at Harvey versus thought out of the box. There has been a big debate in the AI industry around, are the frontier model companies just going to dominate everything? Or is there enough value in specialization to support, you know, a much more diverse, you know, kind of broader ecology of companies doing all sorts of different things? Harvey is one that was a is a leader in the AI legal space. But But what I've been hearing lately is that Claude out-of-the-box is just as good as Harvey. And they've put years into trying to make it as good as it possibly can be just for the legal domain. And maybe they have not managed to establish a lead over what Anthropic has been able to do while doing everything else. And this kind of comes down to like, positive transfer was one is one jargon, but like generalization versus specialization. And right now, generalization seems to be going quite well. Companies do have data that's proprietary. Obviously, that can be a huge advantage. I think you could imagine a world where, think about 3M, for example, right? A company that has millions of products. They've had probably millions of employees over, whatever, 100 year history, unbelievable amount of internal know-how that is not in the public domain. You could imagine 3M maybe partnering with an Anthropic or an OpenAI and saying, hey, let's make 3M AI, they're probably not going to do it totally from scratch. But if they were to bring all their data and somehow combine that with what the frontier companies are doing, I could imagine a 3M AI that's like unbelievably killer at material development in a way that the public models aren't, that might give a company like 3M some continued defensibility of their market position. But I think it's going to be hard for companies to develop things from scratch. I would I would still bet that they will end up partnering rather than saying like, oh, we're going to go. Meta ends up becoming a huge question here because of all the companies in the world that have the resources and the Chinese companies right now don't have the resources because our chip export controls do limit what they can do. And they've been able to do some good stuff anyway, but they're not really competitive with the American leaders right now. And that gap is only going to get bigger as like the trillion dollar build out happens here and it happens there to a lesser extent because they just don't have access to the chips as much. Of course, we'll see what the policy looks like on that. We've flip-flopped a bunch. But Meta might be really important because they're the one company that has the resources, that's willing to spend hundreds of billions and is actively planning to spend hundreds of billions to build out the infrastructure, and at least so far, says they're planning to open source it. If you had an open source model that was the same quality as a OpenAI or Claude, and companies could grab that off the shelf and do their own continued training in-house, that could be a much different world. But right now, there's nothing on the level of a Claude, OpenAI, or GPT, whatever, or Gemini, that is open source. So if you start from something open source, you're, you're starting from like definitely one to two steps down. And there was an interesting, you know, the Chinese models, maybe we could go on about this for a long time, but they tend to be what is called benchmaxed or benchmark maxed, which is to say that like, they really train on these common tests and they score well on the test, but then, you you actually take them out and use them for real, I don't have the right tweet here, but this Minimax 2.5, which is a recently released Chinese model that like scores very well on benchmarks, it goes to bankrupt very quickly on the run your own vending machine test. So there is this kind of weird presentation layer, you know, we got an A on this test, A on this test, you know, we're competitive. Okay, great, run my vending machine, can't do it. So there is definitely a quality, a a meaningful qualitative difference in the capabilities between the US and Chinese models. And so if you are a 3M that's like, I would love to own this and not have to rent it from OpenAI or Google or Anthropic in the future, Meta is maybe your like one hope to have that future action materialize for you. All right, Nathan, thank you so much.