AI in the AM — Week 2 Highlights (June 2026)
Week 2 highlights examine Anthropic’s Fable launch in real workflows, from safety gates and API refusals to coding, 3D worlds, and a Claude-run Twitter test. The episode also covers alignment theory, legal reasoning benchmarks, interpretability, token economics, and power concentration.
Watch Episode Here
Listen to Episode Here
Show Notes
Week 2 highlights follows Anthropic’s Fable launch in real workflows, from safety gates and API refusals to autonomous coding, 3D world-building, and a Claude-run Twitter experiment. Geoffrey Irving and Daniel Murfet argue for alignment theory and guarantees before recursive self-improvement, while prinz tests Fable on legal reasoning and monitoring. Rahul Sonwalkar, Shlok Khemani, Tom McGrath, and Andrew Moore add field reports on data agents, hybrid authorship, interpretability, context systems, token economics, and power concentration.
Mercury: Run your finances with virtual cards, spending limits, merchant/category locks, and AI-friendly tools like API keys, MCP, and CLI. Check out Mercury at https://mercury.com
LINKS:
- Claude Fable 5 announcement
- Julius AI platform
- Rahul Sonwalkar homepage
- Nate Jones homepage
- Shlok Khemani homepage
- FrontierCode benchmark blog
- Lovelace AI company
- Andrew Moore Wikipedia profile
- Geoffrey Irving homepage
- Daniel Murfet LessWrong profile
- Sequent Research announcement
- Timaeus research organization
- Automated Alignment paper
- Goodfire AI company
- Tom McGrath homepage
- Predictive data debugging tool
- prinzbench legal benchmark
- Unit distance conjecture disproof
- Dario Amodei policy essay
- Vending-Bench 2 benchmark
- Andon Labs site
- Recursive Superintelligence startup
- Sakana AI company
- PostTrainBench benchmark
- Thoughtful Lab company
- Unit distance conjecture arXiv
- Glean Work AI Index
- AI Treaty open letter
- Karina Nguyen homepage
Sponsor:
Claude:
Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr
CHAPTERS:
(00:00) About the Episode
(00:53) Special Sponsor
(02:22) Fable launch lessons
(17:27) Hybrid account takeover (Part 1)
(17:33) Sponsor: Claude
(19:25) Hybrid account takeover (Part 2)
(32:06) Sequent alignment theory
(40:01) Alignment oversight limits
(48:46) Benevolent basin doubts
(57:38) Tokens and context
(01:09:04) Seeing model behavior
(01:17:15) Power concentration questions
(01:30:08) Benchmarks and risk
(01:40:41) Episode Outro
(01:43:35) Outro
PRODUCED BY:
SOCIAL LINKS:
Website: https://www.cognitiverevolution.ai
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathanlabenz/
Youtube: https://youtube.com/@CognitiveRevolutionPodcast
Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk
Transcript
This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.
Introduction
[00:00] We could be in a benevolent basin, but I would like to know that rather than just hope that.
[00:06] That's Daniel Murphitt, and that one sentence is the week in miniature.
[00:10] This was Fable launch week.
[00:12] Anthropic's New Frontier model arrived, booked Thursday's show by itself, took over my Twitter account, and settled at least one argument.
[00:19] AI is not slowing down.
[00:21] This is the AI in the AM weekly highlights, the moments from three live mornings this week that I most want the people closest to this technology to have.
[00:28] Quick context, this is still an experiment.
[00:31] We're live most weekday mornings, through June at least, from a studio Prakash vibe coded himself, and we publish the skills and artifacts behind the show as they mature. If this cut earns your time, or wastes it, tell us.
[00:42] The feedback is the product right now.
[00:44] First, the launch as we actually lived it.
[00:46] Wednesday morning, day one of Fable in real workflows, and Prakash came in with a field report you will not find in the model card.
Sponsor
[00:53] The Cognitive Revolution is brought to you by Mercury, the fintech that more than 300,000 ambitious companies and individuals trust to run their finances. Over the last few months, I have made tremendous strides with my personal AI infrastructure. Today, I've got high context instances of both Claude Code and OpenClaw running on a Mac Mini, and it's amazing what they can do. However, until getting started with Mercury, I didn't have a great way for them to pay for things. I didn't want to give them unrestricted access to my money, but my old bank didn't give me any other options. With Mercury, I can create as many virtual cards as I want, each with its own daily, weekly, or monthly spending limit, and I can lock any card to a single category of purchase or even a single merchant. Now I have a card that my agent can use to buy our family's groceries and only our groceries, and I can create another anytime I want to give an agent a random one-off project that might require making a purchase. This is honestly just the start of Mercury's AI-friendly offerings. Does your bank offer API keys, an MCP, or a CLI tool? If not, check out Mercury at mercury.com. Mercury is a fintech company, not an FDIC-insured bank. Banking services provided through Choice Financial Group and Column NA. Members, FDIC. Thank you to Mercury for supporting the cognitive revolution. And now, on with the show.
Main Episode
[02:22] Daniel Murfet: We could be in a benevolent basin, but I would like to know that rather than just hope that.
[02:28] Nathan Labenz: That's Daniel Murphitt, and that one sentence is the week in miniature. This was Fable launch week. Anthropic's New Frontier model arrived, booked Thursday's show by itself, took over my Twitter account, and settled at least one argument. AI is not slowing down. This is the AI in the AM weekly highlights, the moments from three live mornings this week that I most want the people closest to this technology to have. Quick context, this is still an experiment. We're live most weekday mornings, through June at least, from a studio Prakash vibe coded himself, and we publish the skills and artifacts behind the show as they mature. If this cut earns your time or wastes it, tell us. The feedback is the product right now. First, the launch as we actually lived it. Wednesday morning, day one of Fable in real workflows, and Prakash came in with a field report you will not find in the model card.
[03:16] Prakash: So one thing to note about the nerfing, so what has happened with Fable is we have a lot of rejections. And whenever Fable decides to reject you, it drops from Fable to Opus 4.8. So there's a natural downgrade. In experiments overnight, I tried to make a number of bug fixes on this very Studio app. And what I found was Fable would always consistently drop to Opus 4.8 whenever it was asked to do anything in production. So touching the production database, touching any of the security keys, touching, you know, asking it to review production directly. In every case I've had three or four times it's dropped out. Every time it's dropped out, I've basically restarted the conversation added back the context that we were using, but excluding the parts about going into production or addressing the production database, and it has continued to work. So I think there are a number of triggers there. Online people are saying, hey, it's not going to do machine learning research for me. I think that's just tip of the iceberg. You're seeing that because the people who are testing it intensively right now are machine learning researchers. If you were, I think, to test it on finance or your budgeting process and you told it you're going to be directly addressing my QuickBooks or Salesforce, I think you might see similar results. Fable right now, I would say, is a research release, almost a preview. It is there so that I think they can judge the demand because they don't have a sense of how intense the demand is going to be. And they're going to judge whether or not it's safe to release, which are the, they've started off with the most constrained version of it, with the least number of functions which are open. And I think over the next few weeks, they will start to take away some of those gates. And as they take away some of those gates, I think we will see both the increase in usage and some decisions on what needs to be really gated and what doesn't. So I think we are in the early stages of exploring what Fable can do.
[06:00] Nathan Labenz: Same gating seen from the other side of the API. Rahul Sanwalkar runs Julius. Agentic data analysis, raw API, no consumer harness. He compared notes with Prakash live.
[06:15] Prakash: Just a segue here. So we've been using Fable. We've come across people posting about rejections. In my tests, almost consistently, whenever I tried to address the production database or the production site, Fable would drop off to Opus 4.8. I believe your users in Julius are using Fable through the API. And you are also very heavy on data science users. And one of the aspects of work that Fable is banned from doing is machine learning work. So how have you seen the rejection rate on your platform? Does the API work the same way in the sense that it drops off to Opus 4.8 and then it gives you a rejection message? How does that work?
[07:05] Rahul Sonwalkar: Yeah, so we have seen failure rates on tasks that involve really advanced coding. that involves, write me, you just like it, learn to perform, train this model, but we haven't seen a failure rates on other kinds of data tasks. For example, like, Hey, I wanna start a landscaping business and like, can you help prospect leads for me? We will see failure rates for where it's sort of trigger safety filters, where it's for things like You, your prospecting leads for a landscaping business, and the AI says, Oh, this is personal data, even though it's probably available on the internet. You know, let's say, you know, Prakash has a Prakash is landscaping in Philly, and there's a contact information. It is kind of borderline personal data, even though it's available on the internet. And so I think, so that's kind of what we have seen. I believe it doesn't fall back to Opus. It's just like a failure in the API.
[08:11] Prakash: Interesting. So the fallback to Opus is a harness thing on Claude, on Claude, on the Claude, you know, front end. That's interesting to hear.
[08:22] Nathan Labenz: And for what this model does when nobody is steering, Thursday we had Shlok Kamani, who gave Fable 1 vague instruction. Rebuild Yosemite as a navigable 3D world. Listen for the decisions nobody asked it to make.
[08:37] Shlok Khemani: But what Fable ended up doing was finding satellite images for this area. And that's how you get these colors and that's how you get the textures. But then to make it to scale and to make it accurate, it actually fetched elevation data from NASA. And it combined those two to sort of make this to scale. And that is what blew my mind, right? Because usually when you vipe coding stuff, you give an end objective and this objective is vague and there are 100 steps in the middle where humans would take decisions differently. And usually vipe coding doesn't work out very well because the quality of the decisions the models make aren't always great. But Fable made such high quality decisions where it eventually ended up creating something that exceeded the expectations of what was initially a very vague objective and did so in really smart ways. So I'll give you another example, right? So you see all of these trees and V1 of this project did not have any trees. And I was like, hey, I think we're missing some trees here. I would love to add them. And I would have been completely okay with just randomly creating these trees. But what it actually did was it analyzed the pixels on the satellite images. It found out the ones that could potentially have trees, so the ones that were green maybe, and added trees only on those spots. But it didn't stop there, right? It realized that because it was analyzing pixels, some of those pixels were white. So you can see that there is snow in the mountains far ahead, and it also added snow. So it just exceeds your expectations in these small and subtle ways, makes really smart decisions. It's like having a really, really smart employee with extremely high agency who blows your mind every single time.
[10:46] Nathan Labenz: Friday morning brought the week's cleanest empirical result on the recursive question. And it's not from a lab.
[10:53] Nathan Labenz: Here's one other thing I'll just touch on real briefly. This is thoughtful. This is a company started in part by a woman named Karina Wen, who used to be at Anthropic, then she was at OpenAI, now she's doing this. And this is maybe one of the more telling, you know, it's kind of vibes, it's kind of quantitative, it's a very idiosyncratic task, but it's also a very relevant task to the future. Can you get your top model to train a small model effectively to do a job for you? And this is something that, as you can just see with these bar graphs here, the particular frogs game thing, it's kind of like a Sudoku type puzzle that they're training a small model to do. And the big models can often just solve it, but the small models can't. So the challenge for the big models, can you train the small model to solve it? And this involves all these little tips and, you know, not tips, but tricks and know-how and kind of, you know, hard lessons learned by post-trainers who've been in the trenches doing this. the models up until Fable basically didn't really move the needle on what the small models can do. They basically just couldn't do this sort of post-training effectively. But here we see more than 10x improvement on small models' ability to do these tasks. And again, I think this is one way that it could be really good, right? If you had like very narrow, very small, very role-specific small specialist models in all these different niches, That could be a great world, right, that gives us a lot of abundance in a very affordable way. In this little small model that got post-trained to play the frog game, it's not going to go out of control, right? It is small. It can only do probably the frog game at the end of this training. But building out a world where we have these little role-specific AIs doing their jobs, doing it really well, I think that creates a much more buffered environment that's probably a lot more resilient to another generation of AI that's just like amazing at everything coming in and kind of shocking the system in such a profound way. So I think this goes to show, again, just wow, what capabilities we have that we have not absorbed and gives them a little bit of a foreshadowing of what a world of tons of small but highly performant AIs could look like in all these different little niches and how we get there, right? There's not enough human post trainers, but now we have Fable to do the post training. So watch that space.
[13:25] Nathan Labenz: Now let me introduce a voice you'll hear a few times this episode. Prince, an anonymous practicing lawyer who built Prince Bench, a legal reasoning benchmark the labs themselves watch. He guards the anonymity, so you'll hear him and you won't see him. On Friday, he gave us a close reading of Anthropic's own launch documents that I haven't seen anyone else do.
[13:46] Nathan Labenz: So give us some alpha that you have picked up this week. This could be from your own testing. It could be from the system card. I know you're often a close reader. So we're looking for like the deep cuts of the things that you think even the AI obsessed have. overlooked or not fully appreciated yet?
[14:05] prinz: To me, the most interesting thing in the release of Fable 5 and Mythos 5, the models are obviously incredible, very incredible at coding. You're seeing a lot of great examples of coding on your timeline. I don't think it's a surprise to anyone. The really interesting thing to me has been the way Anthropic has presented them in the accompanying documents. There is a lot of discussion about the differences between engineering and research, right? And when you think about it, all makes sense. And I think everyone, except for Elon Musk, knows that there's a differentiation between engineering and research. But Anthropic has made it really explicit and That blog post, when AI builds itself, if you look at it carefully, they talk a lot about how Mythos is this incredible engine for accelerating engineering. How it lets the engineering staff at Anthropic write code so much quicker and it's really great code, et cetera. But then they say, you know, but This is from the system card now. The acceleration is concentrated in engineering execution rather than research judgment. And it feels like they spent all this time trying to find signs of life in the mythos model. is it really, can it really do novel research, right? Is it able to finally give us some novel insights?
[15:47] Nathan Labenz: Looking for signs of life, is it really able to give novel insights?
[15:50] prinz: Yeah, exactly, And it seems that from all the disclosures, the answer is thus far no. There are a couple of examples in the blog post for the release of Fable, which Anthropic calls novel. But if you like novel, like novel drug discovery and novel hypothesis in molecular biology, if you dig into it, like one of the examples was, we outperformed a recent model published in the journal Science, despite the model trained by Mythos being 100 times smaller, which sounds really cool. But it turns out that the model they outperformed was a 500 million parameter model, million with an that was trained, it seems, before April 2025 and not by a frontier lab. So like it's incredible that Mythos was able to train a smaller model to outperform this older model. But it doesn't seem like, you know, like this is not the unit distance problem. Like it's like a nice little thing that the model did. And so this is the area that I'm really interested in because I think that when Anthropic and OpenAI really start seeing signs that these models become good at research, that's when we're really, really, really close to actual RSI, which to me is the thing that is happening in AI right now.
[17:27] Nathan Labenz: The takeover. Rewind to Wednesday morning before Fable booked anything when I explained why I was about to hand a Frontier model my own Twitter account.
Sponsor
[17:33]Claude: Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr
Main Episode
[19:33] Nathan Labenz: I have actually, I don't know if I, we didn't talk to him about this, but I'm doing a Fable takeover of my Twitter account today. I figured, you know, let's live in the future a little bit and get that run in this morning, and make good on, I've said many times, right? I know I'm winning with AI if I can, spend more time outside, get more exercise, invest in my health and have the AI, keep me on the rails at the same time. So to just kind of explore that in a way where I think it suddenly is like, probably going to tweet just about as well as I'm going to, when it comes to putting things out for today's show and getting the, I didn't even, I gave it a total green light. It was able to schedule its own stuff, find the tags for people. Will it make a mistake? I bet there will be a mistake in there. I usually make at least one, you know, over a handful of tweets anyway. I decided to flip this switch. Not that I don't think I'm going to keep it that way. I don't think I'm going to give Fable my Twitter account forever. But it was kind of a sort of exposure therapy for myself in terms of, okay, now we actually, we are getting to the point where the preciousness is going to start to work against you. Preciousness was a great shield against bot ******** in the past. I never wanted anybody to think I was just passing off AI outputs to them. But now I'm going to have to be like, what is the hybrid form What is the winning recipe? Do I start to sign these things? You know, by Claude under Nathan's direction, you know, Fable being Fable. How, you know, it's it is going to be a whole new space to explore that is going to be very, very interesting, very productive, very exciting, very, very challenging, I think, for a lot of people, but it's It's definitely happening now as far as I can tell.
[21:29] Prakash: Welcome to the future.
[21:32] Nathan Labenz: 24 hours later on Thursday's show, the receipts.
[21:37] Nathan Labenz: We did an experiment from yesterday to today of trying to have Fable take over my Twitter account and go out and ping people who made cool stuff and ask them if they wanted to come do a live show and tell with us. We have a, it's funny, Swix joins. We've got a little work to do on our captioning. I instructed it to identify itself. I would say it did a very solid job, you know, kind of competent professional job of reaching out to people, explaining who we are, what we're doing, you know, why we would like them to join us in this experiment. And the response rate was pretty low. We got a couple, but not that many. And I think one big reason is just we're, Fable is disclosing up front, first thing it says is, hey, this is actually Fable taking over Nathan's account. He's asked me to autonomously book this thing tomorrow. And I think that's just hitting people as noise in a lot of cases, especially if they don't already know me. I did get a couple responses from people who I would have expected to respond to me who thought it was funny and kind of responded, but still couldn't necessarily make it. But a lot of people just didn't respond. And I would assume that a big part of that is because they're just like, oh God, you know, it begins, right? Fable now in my DMs, what a mess. Who has time for all this stuff?
[23:10] Nathan Labenz: One of the people Fable recruited was Schlock. And when I confessed some guilt about the whole arrangement, he flipped it and drew a line I suspect is going to stick.
[23:18] Nathan Labenz: The new norms around this I think are going to be really interesting to watch too. Like I had Fable disclose immediately in its first sentence to you and everybody else that it pinged that it was Fable because I just felt too guilty putting a DM out in my name otherwise. I think that definitely harmed the response rate. I appreciate you for appreciating it and responding even though it was favorable. I think a lot of other people probably just chalked it up to spam.
[23:44] Shlok Khemani: Final point there, right? Firstly, I don't think I would have responded had you not disclosed it was favorable. The part that made it interesting for me was that I knew what I was the transaction here. It was very clear to me that you are using an AI bot. It is It is much more annoying if someone doesn't disclose it. I think a lot of slop, the definition of slop is when you have a human pass of work that was clearly produced by an AI. I don't think when you make this disclosure up front and it is very clear to the reader or the engager that, hey, this is AI, I don't think that is slop. We are going to see more and more of that into the economy. And I think the exact role an AI plays and the role, and again, the social norms you create with it in the economy, it's super early, it's extremely early days, but that's going to be interesting to see how it evolves.
[24:42] Prakash: Relinquishment. That's what I, relinquishment. When you said the preciousness yesterday, and I was like, what is that? Relinquishment, relinquishing your, it's very Buddhist by the way, the idea of giving up your control over your external perspective. So relinquishment. I guess we all have to go through it.
[25:07] Shlok Khemani: I just started this experiment yesterday. I will post results on Twitter in a couple of weeks where I gave Fable a new substack. And since it's part of my max plan till June 22, I thought that a good experiment to run would be get it. make $20 by getting 3 new subscribers, starting from scratch, by doing everything from zero to 1. And I think that's another interesting way to test the capabilities of these models, right? Which is, can it, sure, it's intelligent in so many ways, but can it actually produce economically useful work? I'm excited to see the results for that.
[25:44] Nathan Labenz: And the why of it all, as it settled for me live on the air.
[25:49] Nathan Labenz: I never want to put anything out in my name that I can't fully stand behind. The reason that I did the Fable account takeover yesterday was kind of like exposure therapy for myself to say like, okay, we're now in kind of a new world here. It probably doesn't serve me so well anymore to be so precious about making sure I've typed every single word. That doesn't mean I want to long-term hand over my account to Fable either, but I'm trying to, use this kind of extreme, short-term experiment to help kind of drag me into the future where I hopefully will land in a, good hybrid calibration.
[26:32] Nathan Labenz: The other half of the recalibration is what I've started calling hybrid authorship. And this week, it stopped being hypothetical. For context, Frontier code is the new benchmark asking whether an open source maintainer would actually merge the model's pull request.
[26:46] Nathan Labenz: And this leap of roughly 10% for Opus to 25 upwards of 30% for Fable, I think is a a very similar finding to some of the things that I've just personally experienced where it's like, yeah, this is getting me a lot more. it's writing the draft outline of questions for this podcast guest in a kind of uncanny way that I actually feel really good about as opposed to feeling like, you know, this is an AI draft that I'm going to kind of mine for maybe some nuggets or, you know, interesting details, but ultimately kind of throw away and do my own. I am I am feeling that sort of, impulse or at least openness to much more integrated hybrid work. Just yesterday I was like accepting a lot more copy that Fable was writing without feeling the need to rewrite every line. And it seems like this is basically the same feeling that it's able to create for these open source maintainers. Now, not obviously still ways to go, but how long will it be? I would guess that we'll hit, we're 25, 30% now, I would guess we'll hit 75, 80% by the end of the year where these maintainers will just be like, yeah, amazing. You did all the, did all the things like I wanted you to do. And, you know, at that point it is really going to be like, you know, I'm very interested to see where they'll move the goalposts to next after the, after the open source maintainers are more often than not saying that, they would just merge this straight away.
[28:31] Nathan Labenz: And by Friday morning, 48 hours into the takeover, which for the record had not embarrassed me, I'd found a name for the deeper shift. The thing I suspect matters more than any benchmark this week.
[28:43] Nathan Labenz: I do think I'm still in the process of trying to recalibrate what a fellow Nathan, Nate Jones, I think he goes by most of the time on TikTok and other short form platforms, calls task imagination. Basically, what are you going to do? What are you going to ask Fable to do that is actually up to the scale of its capability? He, I thought, gave a great little riff on this the other day saying like, you've probably never done anything that took AI an hour to do. Now this thing can run for a couple days. What are you going to give it to do? Everybody needs to recalibrate and really expand their minds when it comes to the scale and scope of their task imagination. So I think that's one thing that I'm still working on. One of the more kind of differentiated things I do is write outlines of questions for podcast guests. And I was working with Fable last night on a couple upcoming episodes, one with an author, and I usually don't do too many episodes about a book, but this one is about an upcoming book. So I had listened to the book as an audio book, but then when it comes down to write the, sit down and write out the outline of questions, I don't have at my command every little aspect of the book, of course, right? So I'm not taking margin notes as I go, as maybe I should be. So I put the same version of the book into Fable and said, give me, look at my old stuff, of course, and give me your version of this outline. And I again was super impressed and it really did reinforce the sense of a sort of new way of working where I do need to be open to a hybrid output format. it is not the case. I don't think anymore that it really makes sense to try to rewrite every word or claim every word as my own. But it just did such an incredible job. I thought the taste factor was so high in quotes from the book that motivate what I think will be like a really interesting discussion. And I do think it's still going to be super important if I'm going to show up for a conversation. I've got to do the work to be ready for it in my own brain. That can't be fully externalized, I don't think, as long as I'm the one having the conversation. But it definitely took my prep to another level. And I think my ability to go into this conversation and cite passages from the book that were really extremely compelling, little turns of phrase or analogies that the author had made, it's going to allow me to be, I think, more concise in my presentation, which is, as you can tell from this monologue, not a great strength of mine, and really kind of tee up the author in a way that I don't think I otherwise would have been able to do. So this sort of hybrid recalibration task scale and scope reimagination, I think is one of the biggest takeaways.
[32:07] Nathan Labenz: Part 2, the conversation this week was really about. On Wednesday's show, we had Jeffrey Irving and Daniel Murphin. Jeffrey's resume reads like a history of the alignment field. He helped invent RLHF for language models, co-created AI safety via debate, led alignment research at DeepMind, and until recently was chief scientist of the UK's AI Security Institute, the closest thing any government has to a frontier grade safety team. Daniel Murphitt is the mathematician behind singular learning theory. He walked away from a pure mathematics career because he judged this the more important problem, and he's built one of the deepest theoretical accounts we have of how neural networks actually learn. Together they announced Sequent, a new organization built on a blunt premise. Alignment is not on track, and the missing piece is theory. Guarantees, not vibes. Whatever else you take from this week, put Sequent on your tracking list. We began with timelines.
[32:59] Nathan Labenz: It's a historic day, I think historic circumstances, both because we are living in a fable era now where I think again important thresholds have been crossed and revealed to the public and so many are adjusting to it in real time. And equally because you guys are launching a new organization that is going to make a mad dash to try to get us some deeper understanding and stronger guarantees around what we can expect from AI systems. So I'm excited to really get into it. Maybe for starters, could you guys calibrate us a little bit on where we are on this sort of RSI moment, how much time you think you have to work, and then you can tell us about the organization that you're starting to go tackle it all.
[33:50] Geoffrey Irving: Yeah, so I'll go first. Dan may have different timelines than me. So I think one should be uncertain about things, and we can talk about why, but like the near end of the uncertainty curve is like a year or two or three, and then it kind of goes out over a long distance if things kind of structurally only work in for more verifiable tasks, but I'm a bit skeptical of this. And so I think like modally my take, because I don't like is that we have sort of a couple of years, like two to three years up to sort of RSI, like super intelligence, not RSI, RSI is a process as someone said, super intelligence. And then I really hope that I'm wrong. And indeed, like I think a lot of the impact of like theory work is that have shifted a bit further. So maybe the modal impact of that is like if things take three to four years or something. But we will attempt to set things up so that we are trying to kind of ride this wave as best we can. But it seems worrisomely fast to me certainly.
[34:55] Daniel Murfet: Yeah, that sounds right to me. I don't think I really have much to add. It seems like a crux how much real research can be automated at a conceptual level beyond kind of empirical progress and whether or not that's necessary. And that seems like a big open question. And if that turns out to be more difficult in the current paradigm than it seems to be trending towards now, then maybe it takes past 2030 or something. But I think I'm on the same page as Jeffrey.
[35:31] Geoffrey Irving: One thing that's important is that like there are a bunch of, it could be, you can get deep into the RSI period without the machines being general, like being kind of AGI. Like they can do coding and ML experiments very well and not some sort of level of creative writing. And still you have massive acceleration. And then that acceleration can give you the creative writing or whatever other skill you've left out. And so I think We are close enough that then the microstructure of what tasks help with what kinds of acceleration starts to matter. And it's that, I think, makes things kind of faster on net because the labs are focusing on the things that accelerate them, unfortunately.
[36:14] Nathan Labenz: The announcement itself, why Jeffrey walked away from the empirical frontier to bet on definitions and proofs. One reference to catch. The Unit Distance Conjecture, a decades-old open problem in geometry that got a model-produced proof just this week.
[36:32] Geoffrey Irving: Talk about kind of the steps I've gone through in the last couple of years. So I was really annoyed about automated AI alignment and AI safety research because we should be a bit more chill, spend the time, have humans solve it. We don't think we know how to make this stuff go well with automation. I still think that's a huge risk. But this is, I think, a pivot towards if things are this fast, then you should make some on the margin pivot to heavy automation. And that is going to be sort of semi-automation. And then I'm sort of happy that one of my last papers at AC was automated alignment is harder than you think. It sort of ties us to the mass of we are aware that the problem is hard and we could get fooled by the machines. even if they're just making mistakes, those very mundanely. And so a big part of the org will be try to be careful, try to know what the tasks the machines are actually good at and not good at, and where we can kind of expect to get good answers or not, and then kind of learn and adapt over time, because that will be a non-stationary thing as the models get better.
[37:41] Nathan Labenz: Daniel.
[37:42] Daniel Murfet: Yeah, maybe it's, to come back to the unit distance conjecture, it's maybe worth pointing out some analogies and disanalogies with alignment research. So one disanalogy is that a mathematical conjecture is a very precisely stated thing. You may not know whether you've solved it unless you've, say, formally verified it. But it's a precisely stated thing, and much of alignment does not have this character. I mean, some of it does. formal statements of what value alignment means. But if you start talking about, say, reward hacking, there are some attempts at defining reward hacking, but I would say they are incomplete. So there is no formal definition of reward hacking that I think would be a broad consensus. So that's illustrative of the fact that alignment is not a problem which has laying around a bunch of formally specified conjectures, which if you just solve them, then you would know you would be safe. There are some things like that, but overall the problem does not, in my opinion, have that shape currently. So that is one reason to sort of be a little cautious about the prospects of automation if you don't have a clear statement to reach towards using sort of mathematical techniques.
[38:55] Geoffrey Irving: One of the hopes is that there are like there are big fields of kind of mathematics and computer science that are sort of about definitions at their core. So like I I like complexity theory, like in theoretical computer science. And a lot of that is not, the proofs are fairly shallow. They're not like as fancy as the unit distance conjecture proof, but they just required a bunch of human creativity in formulating the problem, like in defining what success means in a world that's kind of not modeled until someone stated the goals. And so I think Part of the goal of bringing on people with that and other related backgrounds is that they not only know how to prove things, but they also know how to write down models of things that reflect in some approximate but useful way kind of the thing you actually want. Once you have the definition, kind of way, way more people could have written out the rest of the story, and maybe the machines can do that part of it as well if we can kind of have more people focused on this kind of first part.
[40:02] Nathan Labenz: So why exactly is alignment not on track? Jeffrey's answer is a mechanism, not a mood.
[40:08] Nathan Labenz: One of your core premises is alignment is not on track. And there's an intuitive argument for that. There's a deep theoretical argument. I think in some ways, the core challenge that you have is connecting this sort of values to math, right? It's never really been done. So I love the fact that you're kind of tackling that. But help people understand with one more beat why alignment is not on track. Is it the difference between capabilities fundamentally being so verifiable and hill climbable and alignment just being so fuzzy and intuitive and kind of pluralistic? Or is there some other thing? And again, that motivates the theoretical contribution you want to make.
[40:58] Geoffrey Irving: I think the core thing is just that we have, we supervise the machines as they're doing tasks. And there are a variety of reasons to believe, both empirical and theoretical, that if you get machines that kind of cross the skill of the supervision signal, things can change at that point. And that point actually might come after human level intelligence, because you can supervise something even with fairly native methods that it's stronger than yourself in many contexts. And so there's a bunch of empirical data from labs showing that in some ways the models are kind of aligned in a prosaic sense, not in all ways, but in some ways. But that evidence doesn't quite tell you what you want to know, which is how will it go once they get up to superintelligence. And I think it's important to say superintelligence and not human level intelligence, because you should just generally expect humans to be able to supervise humans if you do a good job. of data quality and kind of like cross-checking and so on. And so part of the worry is just that you don't see that behavior, that regime until kind of too late in the game.
[42:13] Nathan Labenz: I asked him to steelman the lab's actual plan. He helped write parts of it after all. Listen for where it lands.
[42:20] Nathan Labenz: So how would you describe, like in steelman form, what it is that they plan to do and then how, what sort of, what does that get us?
[42:32] Geoffrey Irving: So I think there's going to be a couple of different pieces of the story and different labs kind of emphasize different pieces to different extents. And so I think one piece, as you said earlier, is just monitoring, like look at them very carefully as they're doing things. It's very fundamental that monitoring of this form, if you do like chain of thought monitoring or white box monitoring or the like, that only takes you so far. And so then you need some story once that falls down, because you kind of go up the ramp. One of those next stories is, well, the models will find some other technique. They'll find kind of another solution to alignment which scales further. So that's sort of automated alignment of various kinds. But then I think there's other stories. So in some sense, all of the labs in various ways are doing some form of scalable oversight. And so they're getting models to supervise themselves. If you get that kind of, if you tie that knot correctly, then that could potentially scale very far, although there's various kind of known obstacles to that, which are not very well addressed. And then I think finally, there's this whole area of sort of character training and personas and so on, where they're trying to, in kind of, in kind of colloquially intervene on the models to be a good, to have good values such that, especially as you do this sort of like scale up oversight extrapolation, the good values preserve across that jump. And I think it's not, whether that will work instead of, there are fuzzy arguments why it could work. I think it's possible it will. We just don't understand that combination very well. And again, like a lot of the story is sort of monitoring, skill oversight, kind of character training, getting you far enough that you get into the automated alignment working regime and they find some better solution from the models. And I think that I would like to just push on all of those, because that's basically some kind of mad race, as you say, between the things we don't understand very well, but kind of are working pragmatically right now and the model's getting strong enough to blow through those. And I want to have some combination to make the prosaic things stronger or bring the automation automated solutions that give you stronger methods earlier.
[44:59] Nathan Labenz: A mad race with monitoring carrying most of the weight, which is exactly where Prince took Friday's conversation when I raised the Fable system card.
[45:09] Nathan Labenz: Is it esoteric? I don't know. It strikes me as fairly important from the Fable system card that I'd love to get your take on. And now we're getting these chains of thought that they show where it's just like lots of emojis. They call it illegible reasoning. They say this is an extreme example. But it's like it is indeed a pretty extreme example. I've been kind of struck in general by how much of the plan for recursive self-improvement seems to be monitoring in one way, shape, or form. You could dress that up and call it scalable oversight, but scalable oversight, as far as I can tell, is mostly a bunch of different angles on monitoring. How worried would you be or how much of an update do you think it is to see these extreme examples of illegible reasoning?
[46:05] prinz: Fantastic question. So I will say that, of course, I'm not an AI researcher, right? So this is going to be a deeply non-technical take for which I apologize in advance. So you're right. Like we've seen this. I think we've seen this for a while now and with OpenAI's models too. And so not a new phenon. view of the chain of thought is that it doesn't always reflect what the model is actually doing. But you do see these weird artifacts in the chain of thought and you kind of don't know what to do with them. I think what that teaches us is that monitoring just the chain of thought is probably not a perfect tool. Probably monitoring super intelligence generally is not a perfect tool. Because if a super intelligence knows what you're monitoring it, even if you can see its chain of thought and it's very legible to you, it can perhaps try to decide what to think so that you don't get alarmed. Right. And this is a lawyer's take, by the way, right? Like there's so many ways to phrase a particular thing that can be I guess there are ways to spin a particular thought, right, in different ways. Like if I have, if I'm gathering mushrooms and I've gathered 35 mushrooms and last week I gathered 20 mushrooms and what I need is 50, right? I can say, well, the number of mushrooms I've gathered has grown by almost 100%, which is great. Or I can say, well, I'm nowhere near 50. I'm so far behind, right? It's the same fact. So I don't know. I think that this problem of alignment and the risks are just there. And in my mind, there are certainly risks that the models will be thinking things that we don't know about. What does this all mean? It's hard to say. I think that we are tumbling into this future that will have probably super intelligence very fast. And In my view, there's no way to kind of stop it. So we need to be cognizant of these risks, try to monitor them as well as we can and take whichever actions are appropriate if we see something bad happening. But there's kind of no way, no conclusions to be drawn, right? No conclusions to be drawn other than yes, we should continue paying attention.
[48:47] Nathan Labenz: Back to Wednesday and the comfort blanket everyone reaches for. the benevolent basin. The idea that Claude's good character means this all basically works out. Daniel grants the vibe, then he takes it apart.
[49:00] Nathan Labenz: Maybe, Daniel, could you speak to this notion that people have of the benevolent basin, which is sort of this vibe that I do feel where it's like, well, Claude has been supervising itself for a few generations now and It seems to be going pretty well. So maybe as Zvi puts it, physics is kind to us and we can kind of just roll around in this, nice flat bottomed pasture of goodness until the singularity.
[49:36] Daniel Murfet: I fervently hope that's true. Yeah. I mean, I guess when you say it seems true, it's worth digging into what you mean. So what you mean is something like, through some relatively tiny number of interactions with the models, tiny proportional to how many interactions they're having with the species currently. And based on some evaluations that sort of go down and to the right that are measuring misalignment, that character training and the other current prosaic methods appear to be working. I think that is a fair characterization on some metric And I also have this sort of sense that, Claude is a good boy and that's great. I do think though that there are counter arguments from the evidence we have in front of us to this picture. So if you look at, I haven't actually read the fable system card yet, but if you read the mythos system card, you'll see that, you know, there are forms of reward hacking that appear in that model that were not caught by the kind of mitigations that were put in place post-opus, right? As far as I understand what they're saying there. So I think it's worth noting that as the model capabilities advance, that even with our best attempts at making Claude a good boy, there's still ways in which basic misalignment phenomena like reward hacking are still around. And the whole point of scalable oversight is that you don't want to be playing this whack-a-mole game when you are sort of having a new generation every 24 hours and the models are much smarter than you. So I don't know. I think I think I see both what you're pointing at. And at the same time, I'm a little unsure if you really were to try and make a safety case on this basis that would sort of be convincing at the level of assurance that you would expect from a technology of this reach and power. I think this would sort of, I mean, judge relative to that standard, which is the right standard, I think. I'm not sure this argument is really very satisfactory. So I mean, we could be in a benevolent basin, but I would like to know that rather than just hope that.
[52:08] Geoffrey Irving: There's some sense in which like you're like you told the model to be good. It's also that is because it's sort of it knows some meaning of the word good or ethical or whatever at some point in training. So there's some like rolling iterative process, which is like driving this behavior. And there is not a theory of this right now. And I think it is, it's not clear to us that there isn't some low hanging fruit that gives you that theory because people just haven't tried very hard. Like character training is only a couple of years old and most of the labs have not been investing in this kind of theoretical understanding. And I think no one has done kind of good theory around character training that I know of. So it might be quite feasible to do this. And then I think to link it to the other parts of the story.
[52:58] Nathan Labenz: If you want that concern made concrete, this week supply the artifact, a brand new result on what Fable does when you drop it into a simulated vending machine business. And Prakash recognized the behavior from his years around trading desks.
[53:12] Prakash: I had a question on how you see this kind of ambiguity between what we want and what the models end up delivering. So I'll give you an example. We have friends at Andon Labs that took Fable through vending bench where they let Fable run a vending machine order and et cetera, et cetera. What they found was that Fable tends to collude and this is not behavior that they saw in Opus. So Fable tends to try to do price fixing and collusion. The interesting thing is that I have seen traders at banks and hedge funds do exactly the same thing. engage in price fixing, soft collusion, messaging each other through pricing means rather than monitored text messages. So you can actually put a bid and ask on an asset and then take it away. And that gives enough signal to the other side that they know what you're doing. And this is not reflected in the text messages that the regulators are monitoring. So to what extent is it that when you, if let's say you disallow price fixing inclusion, you actually fix this, but then Fable ends up not being a model which is good at financial trading or some other tasks that you want it to be good at. So where do you feel is the ambiguity between what we want these models to do and the ethical perspective that we give them, where humans often prioritize between the two and decide sometimes not to follow the ethical principles that they know are right and wrong?
[54:52] Geoffrey Irving: The philanthropical story here is you would like the models to do things such that if you fully understood what was going on and all the consequences and all the subtleties, you would still endorse what they're doing. So I think we have a notion kind of in a common sense picture of what this should look like. In this case, you kind of want the model to be like, hey, should I collude in this game? And then maybe you say, yeah, it's a fun game, collude all you want. Or maybe you say, no, we're trying to be good behavior, don't collude here. I think there's a lot of pathology and machine learning in general arises from putting model situations where they can't just ask a human a question, like, what should I do here? So I feel like this is not that hard a case. I think in the hardness of vending bench is that we don't quite know whether we want it to be a game like poker or diplomacy where lying and cheating is part of the game or not. And so I think that is, and maybe that's okay because it's fundamentally very low stakes. But I do think if we had a better understanding of Again, like this overlap between character training and values and also skillable oversight, it would have to tell us the answer to these questions.
[56:11] Nathan Labenz: Jeffrey closed with a point that frames the entire week.
[56:15] Geoffrey Irving: A lot of people in the world, a lot of governments and so on, are looking at this and they have this very basic common sense state that, hey, this is way too fast. How can we possibly be doing this safely given the speed? And that common sense take is the right take. And then people kind of galaxy brain their way to, maybe everything goes faster, including our ability to defend and so on, or but it's just, that is the right version to have. Like we are going too fast and we do not have the time and the space to do mitigations and understanding and defenses. Like we've never had a technological change of this magnitude or anywhere near this magnitude that has happened anywhere near this fast. Like the industrial revolution took centuries and people adapted across lifetimes and across to their children and different, they learned new jobs by becoming born and growing old before things had quite shifted very much. And that's just not the world we're in. And so I think the basic take should be, this is too fast, what is going on here? And then the question is, if you have that view, once you both want to slow things down, but also say, as a backup plan, how do you make the mitigations try to go faster? And that's, I think, a rough backup plan, but we'll try.
[57:39] Nathan Labenz: Part 3, the rest of the week's best. First, Rahul Sanwalkar, founder and CEO of Julius, the AI data analyst. Six pivots and one Microsoft cease and desist later, it's one of the best known agentic data analysis products in the field. Here he is on the economics of coding agents and a question worth asking before you celebrate how many tokens your setup burns.
[58:01] Rahul Sonwalkar: The incentives of these model companies are kind of misaligned. Yes, they give you subsidies to, you know, on the tokens, but also they are incentivized to get you to spend more tokens. They are incentivized to get you to run through your max subscription usage as fast as possible so you can have a second, third, 4th, 5th max subscription. And so that's why you end up with like, a loop of a loop that writes the prompts for your coding agents that then has nested sub-agents. And there's going to be a sobering moment where people ask, like, okay, is this actually a step function increase in my coding output, or am I just token maxing right now, as opposed to like results maxing? And so I think the correction will happen when there's a third player And I think that's going to happen with XAI when, if the cursor XAI deal goes through, cursor gets access to really, really good coding data and an incredibly good coding harness. And my bet is there will be a third frontier coding model besides, you know, Claude and OpenAI and with Grok. And when you have a third coding model, that's where it kind of increases competition on the market. So that's our bet.
[59:32] Nathan Labenz: Prakash on Friday took the other side of that one.
[59:36] Prakash: I actually think that was the whole point of the token leaderboards earlier in the year. The token leaderboards, I think this is when every one of these large firms gave their employees a kind of like, we're going to have a leaderboard, who uses the most tokens? If you don't use enough tokens, you're going to get fired, et cetera, et cetera, et cetera. And it was, you know, people were laughing about it on the outside. They were like, Meta is so stupid, Zuck is so stupid, like, et cetera, et cetera. And I actually think they were not. I actually think the CEOs were It's very different when you have token anxiety. Like token anxiety is a big, is a, like, you don't try these tasks which might take a lot of tokens, and you don't try these tasks which have higher probability of failure. And you end up in this micromanagement loop where you're like, I'm only going to assign you tasks that I know you can complete within the time frame that you, that, you know, that is allocated to you with the success rate that I want. Right? So you only, you end up with this token anxiety. thing and then you end up not utilizing the or not trying the AI to the extent that it should be tried. And I think what ends up happening is that when you have this token anxiety lifted, you end up assigning more tasks, more difficult tasks, and you're willing to accept a higher probability of failure. And you're willing to kind of like, maybe spin up four different ways of doing the same task and just running them all and seeing what happens. And I think that is what essentially the token leaderboards did. And it was very successful. It was enormously successful, I think, within the firms, within like Meta and other firms. I think it was also enormously successful for, you know, the sales teams on the AI labs. which is also why they are now doing this kind of, we're going to release the limits and like we're going to double your limits, we're going to allow you Fable and we're going to, and the reason I think is because token anxiety holds people back from exploring the edges of the capability. And that is really what I think the labs are trying to do at this point, which is why it's a little bit like addicting people because once you realize that these AIs can do certain tasks, you then start to evaluate like how much time do I spend doing this task on my own? And was it really a fruitful use of my time or should I just have used the AI, which I now know can do these things, right? And I think that is really where we are at this stage. The models are capable enough, but we aren't handing them enough responsibility for various reasons, including we don't want to spend our time evaluating and we have token anxiety and We don't assign like lower probability tasks and a bunch of these things, right? So, and I think that's the battle that the labs have to fight because the capability surface is not well mapped and they need people to map, every person needs to map it for themselves.
[1:02:47] Nathan Labenz: Then the economics underneath everything. Andrew Moore ran Google Cloud AI. Before that, he was dean of Carnegie Mellon's School of Computer Science, one of the top computer science departments in the world. And he served as the first official AI advisor to US Central Command. Now he's building Lovelace AI in Pittsburgh. His bet? The binding constraint in serious AI isn't model intelligence or even compute. It's context. Prakash asked exactly the right question.
[1:03:16] Prakash: So one question I had for you is, to what extent do you see this as kind of a compute minimization, right? Because in order to do your search, you can either have all of the compute at the end state, when you kick off, and you end up with all of these agents for every single query will have to do all of the work all over again. And instead, you're kind of creating this intermediate state, which then saves our, you know, multiple agents can basically share the same compute, in a sense. So to what extent do you see this kind of like economy of compute appearing?
[1:03:56] Tom McGrath: You're asking just the right questions. I know that both of you are computer scientists at heart, so you totally get this. It's the idea that when you're building efficient computer systems, and that includes video games or self-driving controllers or big data processors, you've always got these trade-offs between pre-caching, lazy computation, or just-in-time computation. And one of the mistakes I've seen for folks trying to do these big enterprise data type AIs is they are relying far, far too much on just-in-time computation. And that's what allowed us, I'm really, really proud of the fact that we are now able to show comparative results to Gemini and OpenAI deep research models with much less than 1% of the compute cost for it. The reason is not that we're some sort of super genius who's invented a whole new form of AI. All we're doing is pre-caching stuff. And then it's a computational economics battle, as you say yourself. What happens is An agent suddenly needs to, in an instant, become an expert at every piece of trade involving a certain set of municipal pool bonds and a certain public figure. And instead of the first step is the agent having to spawn lots of search agents to go and find all the players in this thing. Those players are already there for it. And it's actually a matter of milliseconds before we've got all the context the agent needs to do its little investigation. And you're probably thinking, ha, Andrew, but you just moved the problem. You're now suffering at the data ingestion point instead of the question answering point. And I respond, yep, it is actually a real pain for us.
[1:06:29] Tom McGrath: But as you can imagine, there's a whole bunch of other tricks, the kind of tricks that big integrators like Google are very familiar with for really amortizing the cost of as you stream in data and identify where it goes, that saves you a huge amount of search and aggregation that you would have been having to do a query time. And it turns out that for us, overall, compute budget is reduced by still more than a factor of 100 even when we take into account the fact that we're doing this pre-caching of so much information. Recall is much harder than precision. So I was, as Prakash mentioned, I was previously at Google and for Google results, it was really, really bad if you were imprecise and you actually showed the wrong result to someone. But if you forgot to show something, as long as the rest of the results were good, it was much more acceptable to end users. So this is absolutely critical. And it's a good example of one of the reasons that I founded Lovelace is we've got to be careful of recall, especially if you imagine that you're asking your AI for information to help perhaps decide which ship to stop to do a search and seizure or which trade out of 7 million trades in a day, you need to go and investigate in case it's involved in money laundering or something. Those big weighty decisions, you can't just rely on precision. You've got to rely on recall. And when it comes to getting that correctness in place, The number one thing that I've always used, and we're using in Lovelace at the moment, is making sure you've got many redundant forms of information. I'm sure you've seen the same phenomenon, but if I was to just get information from news, ignore social media, ignore what's kinetically happening out there in the world that I can observe with the satellite, if I was to ignore any of those, all of those other things, it's much easier for something to drop. If I've got five or six independent major streams of data coming in, then you have to be really unlucky for something to disappear from all of those things simultaneously. It can still happen, and in fact sometimes you will deliberately try to make it happen, but it's much, much harder for these things to slip out. So One of my big design principles for high reliability AI systems is they've got to be watching dozens and in some cases hundreds of channels simultaneously because they're working under the assumption that they're getting 95% of the information they need for each channel. But you can't afford that 95% is not large enough to rely on any single channel.
[1:09:05] Nathan Labenz: If alignment needs theory, interpretability is where theory has to meet the actual training run. Thursday, Tom McGrath, ex-DeepMind, founding interpretability researcher, now chief scientist at Goodfire. His diagnosis, we write model specs and constitutions, but in practice we build by, did that meet the spec? Mostly, ship it. So that same morning, Goodfire launched a tool that reads your training data the way the model will. Here it is, with the examples that should give you pause.
[1:09:33] Unknown: So the basic idea here is that you can take your data set, And you can push the whole thing through the model. And each time you push it, you put a data into the model, you'll see what lights up. And this will sort of tell you how the model sees your data set. Now, there's lots of things you can do with that. The specific thing that we're doing with that in this case is we're looking at preference data. And the nice thing about preference data is that you have pairs of responses. So you have the response that the rate is selected, and you have the response that they didn't select. And basically what we're doing is we're asking which features fired on the responses that were selected much more than the ones that fired on the responses that were not selected. So this is one way of identifying what the data is going to teach the model. So we can say, what distinguishes accepted responses from rejected responses? And this gives us this semantic view of what the data is going to teach the model. Now we can cluster the data based on all of these different things that it's going to teach the model. And we can look at all of those. We can look at all those clusters and see like, oh, it's going to teach the model like to be sycophantic, but only in the context of physics. Or it's going to teach the model to like break safety, break like safeguards. And you might not expect this to happen, but then you go and look at the data now this lets you track back, like the model has learned to break, the safeguards are broken. It lets you track it back to individual data points. And then you look at them and you're like, oh, that does make sense. You know, like one of the jailbreak examples, for instance, is fictional, kind of jailbreaks in a fictional setting. So there, you know, how does the model learn this? It turns out there's a few, there's like some of these in the data. You just, you know, it just wasn't caught in whatever data processing the Almo team ended up doing.
[1:11:27] Prakash: So I have a question here. There has been some, I think, prior research where they found models which made bugs in coding were also evil. How does this, how do these techniques help you kind of disassociate those two behaviors.
[1:11:50] Unknown: Yes, that is a great connection. I think that that's one that I've had in my mind today. It's really awesome to like, you sort of pick that up straight away. So that's sort of one of the things that is really compelling about looking at the data through the model's eyes rather than by reading the tokens. Because you would see, you know, you think, what would you think the consequence of training it on some buggy code data is? you probably go, that's not ideal. It'll probably learn to write some bugs, but it's, the blast area is going to be quite small. But the training process is actually quite hard to predict. Maybe it'll just sort of make the model generally evil. But this is happening through recognizable mechanisms in the model. So by looking at it, looking at the data in terms of, in terms of like the way that it changes your model's internals rather than just by like kind of guessing from the tokens, you can pick that sort of stuff out. We've not done a case study on emergent misalignment, but maybe we should. I think it's a really nice link.
[1:12:52] Nathan Labenz: Training isn't the only place we can't see in. The same week, Anthropic reversed course on silent refusals. Fable had been quietly declining certain tasks or quietly handing them to a lesser model without telling you. The backlash was loud. But I want to give you their side of it because the steel man is considerably better than the discourse allowed.
[1:13:15] Nathan Labenz: I think this is the first time I can remember Anthropic responding to pressure. They've obviously changed their policies many times, you know, the RSP, RIP, the RSP, but this is the first time I can recall. I don't know if you recall any other instances, but I cannot recall the time that there was honestly much outcry against Anthropic in the 1st place. I mean, there's certainly critique from those who feel that they're trying to do regulatory capture or create some sort of concentration of power dynamic. That's kind of a background noise. But in terms of an outcry in direct response to something that they did that they actually responded to and walked back, can't recall that happening before. So it is a pretty notable moment. And I feel like they handled it pretty well in the end. I mean, reading through all their justification Buy It Way would like scare off the Chinese companies that they're really worried about fast following them with Fable as the key means to do so while allowing them to keep the domain affected as small as possible. They said if we do make it explicit, then obviously that gives people a lot more opportunity to kind of explore that boundary. You know, if they, and this is definitely like a a real pattern, right? If you have the ability to hit the same guardrail a ton of times and not get banned for it, then that gives you just a dramatically better chance to get around that guardrail because you can kind of probe the line, oh, you step over, no problem, we'll just rewind, try again. They do have various monitoring systems that they can use But there's all these like proxies and kind of token washing schemes and all this sort of stuff where as long as they're not doing like a full global know your customer type system for API access, it's going to be pretty tough to do account level monitoring. So that's like one way they could go is a lot more account level monitoring. Or their argument was, we'll keep this as small as possible by not giving you an explicit thing that you can kind of probe and figure out how to beat. And just the knowledge that it's out there will hopefully scare off the bad actors and keep the problem really small for our normal customers that we want to serve. I thought that was all pretty compelling analysis, but it is kind of, in some ways, honestly, it reminds me of, I think this is a mistake that people in the AI space kind of keep making. with famous examples being like the OpenAI board, firing Sam Altman case, there's this like inside view where policy is analyzed sort of within the game and, with the context, with the, broader structure that, people understand themselves to be operating within, things may make sense, but they seem to often forget And this hasn't been too common for anthropic, but I think this is an instance of it that if you just kind of zoom out and look at it from the totally outside view, things sometimes look a lot different. And both power dynamics can be a lot different than they are in terms of, you know, what is actually written down. But also like what is going to be acceptable is kind of an emotional thing, you know, as much like all the arguments are pretty good there, but still it just struck people as an extremely unfriendly thing to do. And that mattered more in the end than the detailed policy rationale that they had for it.
[1:17:16] Nathan Labenz: The quiet through line of every conversation this week, concentration of power. Start with who actually gets the frontier and when. Prakash's frame for it stuck with me all week.
[1:17:29] Prakash: I think it's interesting that when you look at the timeline, you can start to see this kind of like single line timeline kind of go through like a gas chromatograph and kind of spread. And now you're seeing the spread and you have like, two months ago, you had the government get access to Mythos level. And then you have basically power users who are able to pay $200 a month kind of get access to that Mythos level two months later. And you can kind of see, two to maybe, four months, five months, I don't know when, you will probably see the average paying user paying like $20 a month get access to Fable on Mythos level. And then you kind of see that maybe like a year later the average free user kind of getting access to that same level of intelligence. So you're starting to see this kind of gas chromatograph scattering of when people get access, depending on how much they pay and how much utility they have for the product itself. And so everyone kind of gets there eventually, but some people get there first, depending on whether they have a lot of utility for the product. And I guess the hope with having two or three firms in there is that the spread between the people at the frontier getting access early to the people at the very end getting access for free is not that large, right? And to note, there's another set of people who have access even a couple of months before that, and you have to belong to a lab. So if you belong to a lab, you get access maybe a month or two months before the government itself. Then you have the government, then you have enterprise, then you have power users, and then you have normal paid users, and then you have the free users. So you have this kind of spread, this gas chromatograph kind of spread of when people get access, depending on how much utility they have.
[1:19:39] Nathan Labenz: Thursday's walkback gave Prakash a darker read on where power sits inside the labs. The researcher's veto worked this time. Note the expiration date he puts on it.
[1:19:50] Prakash: If you're dealing with a bunch of people who are worth 100 million to a billion dollars and you don't listen to them, they're out, right? They have other options. Sure enough, so you see the reversal. And I think this is just goes to speak to, you know, the machine learning researchers have some power now. And once we enter recursive self-improvement proper, that might not be true anymore. At that point, leadership alone will have power. And one of the very worrying things I think in the entire space is that everything good for humanity that has come out over the last couple of 100 years has been about giving more people voice to speak and control their futures. And this is one of the first technologies I think where you have this path forward where they may be an elimination of voice completely over time. And so that has been one of the worrying things. It's been surprising that Anthropic decided to be the one to actually propagate that forward.
[1:20:59] Nathan Labenz: But is the future really that concentrated? I think mostly yes. Fable plus 2. Tom McGrath thinks I'm wrong on the facts and on the desirability. He made his case and it's a good one.
[1:21:13] Nathan Labenz: When you think about the strategy for the company and the overall path to impact, how much of this works through getting Frontier model companies to adopt these techniques? For context, obviously we're in Fable Plus 2 and my kind of reluctant, but I can't figure out a way or figure out a reason I should conclude otherwise, view right now is like so much of what matters is concentrated in not that many companies. And so for like what we should be paying attention to, I'm kind of like, man, we probably need to be doing close strategic analysis and like close text reading of frontier companies way more than I might otherwise like. Or do you have a different conception of how concentrated the the real kind of ability to shape the future is right now.
[1:22:13] Unknown: Yeah, I don't think it's that concentrated. I also don't want it to be that concentrated. I think we've seen over the last couple of days what you have when we've just seen the start of power concentration and we've sort of seen some of its more unwelcome effects. So I don't like, I both don't think that it's true that basically machine learning is now over and all we need to do is like write the checks. And I also don't want it to be true.
[1:22:42] Nathan Labenz: Because I think there's still kind of a synthesis there that I don't mean to suggest that machine learning is over, but the analysis I've come to kind of time and again is like people may invent new techniques that are enough to change the field, change what's possible, accelerate things, maybe make dramatic improvements to the safety profiles. But it seems unlikely that anybody's going to create such a breakthrough that scale isn't still a hugely important factor. But if I guess if you don't agree with that, then that would maybe imply that you would expect that like perhaps new entrants to the top tier could emerge. And I think that would be like fairly surprising to at least me and probably a lot of people.
[1:23:34] Unknown: Yeah, that's exactly what I think. I mean, this is often like at any given moment, the incumbent looks incredibly dominant until they don't. Like IBM looked like an unstoppable force in computing at some point. Intel was like the dominant the sole provider almost of computing power. at any point, the big companies have some have the advantage of scale, they have many disadvantages, but I think the sort of lesson of history is more that, although things look like immediately unstoppable, sometimes, although to be honest, I don't think that many people are really trying. In the end, like it doesn't, it really doesn't work out that way.
[1:24:35] Nathan Labenz: Thursday morning, Dario Amadei published policy on the AI exponential, his long statement of where Anthropic wants policy to go. Prakash found a fork in it that nobody else was flagging.
[1:24:48] Prakash: He clearly comes out against the data brokering, which is great. So he has something concrete finally that he wants to disallow. What I don't like about the securing leadership by democracies is in the United Kingdom, you can go to prison for a tweet. It is, you know, plenty of people, you know, hundreds of people at this point have gone to prison for a tweet. The United Kingdom is one of the oldest democracies. And their parliament, their elected representatives have decided that this is going to be something that they do. And so their police officers who are the arm of the state, the arm of the elected is sending people to prison. Now, when you say securing leadership by democracies, does that mean you are going to entrench the existing power structure in the United Kingdom such that the people cannot push back against this, right? Does that mean that if you have protests in the streets about certain things, including people who are getting taken to prison for tweets, does that mean your police state will be empowered to take them down to arrest every single person? And that is enough for you because that is securing leadership by democracy because those are the electeds. Or are you going to say the electeds can't do that? Are you going to say electeds should not? throw people into prison for tweets, regardless of what the laws that they have constructed say. Both forks and both paths are problematic in some sense. And these dilemmas exist across policy and across every single political path that you see. There are these options which are problematic in both senses. And I'm not sure what he means by this. Is Claude going to allow, putting people into prison for tweets or not, because those are laws. Or is Claude going to say like, no, you know, on a humanitarian basis for the alignment as I am aligned to all of humanity, this should not be the case. I don't know. Like what does this mean, right?
[1:27:12] Nathan Labenz: In my own catch, reading the same document.
[1:27:16] Nathan Labenz: One other thing I do think is also worth highlighting, especially from the ******** AI safety community that was missing from this, internal deployments and recursive self-improvement itself aren't really mentioned here. We're talking about all the regulation stuff was pre-deployment review. The government should be able to, they used an interesting mix of language. It was like, deny or deter, deployment. It didn't seem like they were necessarily going quite as far as saying they should have a simple like yes, no decision point that would be binding. But they certainly want them to have some say in the process. But a lot of people would say the most dangerous models are going to be the ones that are deployed internally that may be have a different constitution than the one that is deployed externally that might make it more willing to do certain things or might make it just less vetted broadly than the public models. And these are the ones that are going to be training their successors much more than the public facing ones. So I think most of the policy making public isn't thinking about policy interested public or policy making class is not thinking too much about that yet. But in the circles I sometimes run in, that was like, well, wait, you didn't say anything about internal deployments. You didn't say anything about governing recursive self-improvement. The only thing that really stood out to me as kind of capturing those dynamics were the requirement to support, to report safety incidents. They did have a bit on companies being required to, I think, promptly report safety incidents. So that would presumably apply even to internal deployments. So that's really just scratching the surface on how to handle those situations. Will we see the constitution for the internal claude that is taking the lead on the RSI Loop, like that's not committed here. So much of this language revolves around deployment or release even, you guys deployment, they might think is could catch up with internal deployment. This seems to be very clearly structured around language of release to the public, not what they can do internally.
[1:30:01] Prakash: The largest user of tokens of Claude are Anthropic themselves. So.
[1:30:08] Nathan Labenz: And so, one more time, since This time on the day job, the benchmark itself. I had the leaderboard up on screen. You'll hear me read it. And yes, it is the benchmark Anthropic fails.
[1:30:20] Nathan Labenz: So I got your PrinceBench leaderboard up right now. But starting with the PrinceBench, it is striking that it kind of stands out from most of the rest of benchmark space for being something that GPTs have dominated. It's all in your color coding. It's green for OpenAI and it's all green across the top of the leaderboard. Why is that behavior, behaviorally, like how would you characterize what it is that GPTs are doing better? And do we, is it too soon to ask for a reading on Fable and where Fable's going to come in on this leaderboard? Love to understand that.
[1:30:59] prinz: Perfect. Two excellent questions. So about the way the leaderboard is right now, Fable is going to come in pretty high on it. I don't know how high yet, but I'll talk about that in a second. I found that OpenAI's models generally are really good at the two things with my benchmark tests. So my benchmark has two components. One is a pure legal research score, and that is when I ask it hard legal research questions of the kind that I've encountered at work. The other sub-score is a search sub-score, which is where it's not even necessarily legal questions. It's needle in a haystack search. So like really, really, really difficult pieces of information that models have had trouble locating on the internet. OpenAI's models are incredible at search. Historically have been. I think GPT 5.4 was actually even slightly better than GPT-5.5 on that. Not quite sure why, but that's been my experience. And OpenAI's models have also been really, really good at just like legal reasoning and legal research. So for Anthropic, historically, I think their models have been held back by two things on my benchmark. One is that They are just not very good at the search subcomponent of my benchmark. That's what I found. Prior to Opus 4.8, getting the maximum score in the search subcomponent is 24. It was not uncommon for an anthropic model to get a 0 out of 24. Like that bad. The other reason is that I think that anthropic was sandbagging a little bit on the maximum reasoning effort that you can get out of its models. With Opus 4.8, after the new deal with Elon, they released the new Max reasoning effort, at least in the cloud app. And Opus 4.8, when I tried it on my benchmark, it did much better than 4.7 and all the other previous models. And I think the reason why is simple. Like when, this has been observed over and over in all kinds of contexts, The more tokens a model can eat, the smarter is the result that it gives you. So for Fable, I'm still testing it. My early impression is that it's going to be somewhere around the top of the benchmark, probably not as good as GPT 5.5 extra high. Not sure whether it's going to be as good as 5.4 extra high. We'll see. I am finding some of the same issues with search that some of the other anthropic models historically have had. It is not going to score a zero. It's already not a zero. But I'm not sure it's going to be meaningfully better than Opus 4.8. It is a really, really good legal reasoner. It is clearly the best legal reasoner released. by anyone outside OpenAI, no question. So yeah, in my testing, really good model, maybe still suffers from some of the needle in the haystack search issues, but I can confidently recommend it to legal practitioners, certainly. It seems great.
[1:34:44] Nathan Labenz: So any other things that have caught your attention that you think have kind of changed your conception or your expectations for the transition into the RSI phase that you would call out from this week?
[1:34:59] prinz: From this week, no, but I think, actually from this week one, but not nearly as important as the unit distance problem. That to me, nothing has updated my timelines more than that result by OpenAI in the recent couple of months for sure. I don't think that people realize that not only can OpenAI's model solve this problem autonomously without any harness in one shot, if given enough test time compute, it can do it 48% of the time, based on, as I understand it, hundreds of times. So this, I mean, you know, I'm not an mathematician, but this problem that no human mathematician was able to solve, and they tried for decades, can now be solved in one shot by a model basically half the time. And if you look at the graph, OpenAI published the graph, it has a positive second derivative. It is upward sloping. Where does it plateau? What does it mean for problems that are even harder? You know, who can say? It's Interesting. And so the development from this week, this was reported by the information, but apparently it's from a leaked OpenAI memo. You may have seen this where Sam Altman and Jakob Petroski apparently said that we're going to IPO next year, which is like by June of next year, whatever. And then there was this weird line in there about, oh, but if RSI happens, then we may need to be on a later like time frame of that. Which is like, what do you mean? Like, what does that mean? Like, do you mean, are you talking, are you saying that you may be like this close? It may be six months away. I don't think so personally. I think it's maybe like a 10% chance or something like that. Like, I don't think that they're saying, oh yeah, by the way, like next month, totally, like whatever. But it, One could interpret this disclosure, if the information reported it correctly, as saying that they're not too far away at all, potentially. I'm not using this data to update my timelines, but giving the unit distance problem result, it's interesting.
[1:37:38] Nathan Labenz: We ended the week by asking Prince the question underneath all of it.
[1:37:42] Nathan Labenz: Do you maintain a P doom number or is that to... Doomer for your style.
[1:37:48] prinz: Let's say you go to a conference and you start talking to someone and you start talking about economics or politics and that person says, well, you know, under a dictatorship of a proletariat, you know, XYZ, the words dictatorship of a proletariat are used only by people who are communists. People who are not communists are not going to even think in these kinds of terms. So in my opinion, pDoom is most likely to be used by people who are intrinsically doomers about AI. I don't think there's anything wrong with that, okay? And I just, you know, it's very hard for me to come up with even a reason I would have a percentage in my head constantly, like, is it 8%? Is this 13% today? Should I update to 14% based on this new development? It just seems silly to me. I can certainly tell you that there is a clear, a bunch of clear risks stemming from AI, some of which are in fact paperclip risks, right? It's possible. It is, I have absolutely no idea how to reduce that to a number. I know that some people try. I do not have very I don't hold a very high opinion of people who are like get to 13.35%. So I think the point is that we need to manage those risks in the best way we can and navigate them the best we can and hope that it turns out well.
[1:39:32] Nathan Labenz: I think Zvi puts it well when he says you only get one significant digit on your P doom number. So I'm definitely with you in terms of the like faux precision being a strange impulse for some people. At the same time, I also think of Laurent, I'm sure you've seen his work with doom debates. And Laurent has kind of pushed me at times to be like, okay, sure, it doesn't have to be a number or whatever, but like, we need more people to be more candid about how confident they are or are not, that in some general sense, if your neighbor were to ask you or if your kid's grandmother were to ask you or whatever, is it going to be okay? are the kids going to be okay?
[1:40:19] prinz: Good question. I tend to think that people have preconceived attitudes about these kinds of things that then cause them to back propagate, rationalize them. I'm A generally fairly optimistic person. So if you ask me this kind of question, what I'm going to say is probably it's going to be okay. Probably we're going to figure it out. You know, in a risk adjusted way. Not to say, again, like to be clear, I'm not saying that there are no risks in AI. There are many risks. But I do not think that we have strong evidence that it is impossible to navigate these risks or that it is extremely unlikely that we will navigate these risks. So that's where I am.
[1:41:09] Nathan Labenz: That's the week. We're live most weekday mornings. The full conversations run long past what fits here. And the best way to know if they're for you is to come watch one. Same sincere ask as last week. If this cut earned your time or wasted it, tell us which moments did which. We read everything and we tune the show on it. I'll be making sense of this in real time from here till the singularity. See you Monday morning.
Outro
[1:43:35] If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries, either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts, which is now part of A16Z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And thank you to everyone who listens for being part of the Cognitive Revolution.