AGI Lab Transparency Requirements & Whistleblower Protections, with Dean W. Ball & Daniel Kokotajlo

AGI Lab Transparency Requirements & Whistleblower Protections, with Dean W. Ball & Daniel Kokotajlo

In this episode of The Cognitive Revolution, Nathan explores AI forecasting and AGI Lab oversight with Dean W.


Watch Episode Here


Read Episode Description

In this episode of The Cognitive Revolution, Nathan explores AI forecasting and AGI Lab oversight with Dean W. Ball and Daniel Kokotajlo. They discuss four proposed requirements for frontier AI developers, focusing on transparency and whistleblower protections. Daniel shares insights from his experience at OpenAI, while Dean offers his perspective as a frequent guest. Join us for a compelling conversation on concrete AI governance proposals and the importance of collaboration across political lines in shaping the future of AI development.

Check out:
Time Article - 4 Ways to Advance Transparency in Frontier AI Development: https://time.com/collection/ti...
Alignment Forum Article - What 2026 looks like: https://www.alignmentforum.org...

Be notified early when Turpentine's drops new publication: https://www.turpentine.co/excl...

SPONSORS:
Shopify: Shopify is the world's leading e-commerce platform, offering a market-leading checkout system and exclusive AI apps like Quikly. Nobody does selling better than Shopify. Get a $1 per month trial at https://shopify.com/cognitive

Notion: Notion offers powerful workflow and automation templates, perfect for streamlining processes and laying the groundwork for AI-driven automation. With Notion AI, you can search across thousands of documents from various platforms, generating highly relevant analysis and content tailored just for you - try it for free at https://notion.com/cognitivere...

Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive

CHAPTERS:
(00:00:00) Teaser
(00:00:53) About the Show
(00:01:16) About the Episode
(00:04:47) Introducing Daniel Kokotajlo
(00:09:29) Daniel's 2026 Prediction
(00:16:11) Sponsors: Shopify | Notion
(00:19:07) AI Propaganda & Censorship
(00:26:58) Internet Balkanisation
(00:35:38) Sponsors: Oracle Cloud Infrastructure (OCI)
(00:38:24) AGI Timelines & Futures
(00:48:15) Automated R&D
(00:54:48) Superintelligence & AGI
(00:58:25) AI Transparency Proposals
(01:06:11) Four Pillars of Transparency
(01:19:02) Red Teaming Transparency
(01:41:07) Whistleblower Protections
(01:46:32) Internal Information Sharing
(01:54:55) External Oversight & Governance
(01:58:56) Future Outlooks
(02:00:44) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://www.linkedin.com/in/na...
Youtube: https://www.youtube.com/@Cogni...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...


Full Transcript

Transcript

Daniel Kokotajlo: (0:00) I worked at OpenAI on the policy research team, like strategic thinking about policies we should adopt to get ready for and handle AGI well and make sure that it's beneficial for all the world and safe. I left in April 12 of, this year because I gradually lost hope that the company would be the way that it needs to be in order to handle this all responsibly.

Dean W. Ball: (0:22) There is like a Shakespearean relationship between the intention of public policy and then what actually happens. And there is an extent to which, like, any rules you create are very likely to make the thing that you're trying to fix worse in some important way.

Daniel Kokotajlo: (0:36) This is something that I really want people to think about more is once you have this level of capability, think about the effects that's gonna have politically. Who controls that? What do they do with all that power? We're not necessarily in, like, standard capitalism where the companies put it up on an API and compete with each other sort of mode.

Nathan Labenz: (0:53) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost, Eric Torenberg. Hello and welcome back to the Cognitive Revolution. Today, I'm excited to present a conversation about AI forecasting and the oversight of AGI Labs with Dean W. Ball and Daniel Kokotajlo. This is Dean's fourth appearance on the podcast. He's probably best known to our listeners as a critic of the since vetoed s p 10 47, but I also really recommend the episode we did together on brain computer interfaces and neurotechnology several months back. Daniel, meanwhile, joins us just a couple of months removed from his headline making departure from OpenAI, where he had worked on policy research and strategic planning around AGI safety. In what I consider to be a truly admirable move, Daniel declined to sign an OpenAI exit agreement at the personal cost of millions of dollars of vested equity in order to preserve his right to speak freely about his concerns that OpenAI will not behave responsibly around the time of AGI. This principled stand, as it became publicly known, ended up catalyzing policy changes at OpenAI, such that departing employees are no longer asked to sign nondisperagement clauses to retain their vested equity, and Daniel's individual equity has also since been restored. While we do discuss that story and also look back on Daniel's prescient and often cited 2021 essay, what 2026 looks like for context, our main topic today is a set of 4 proposed requirements for frontier AI developers, which Dean and Daniel have recently published in an op ed in Time magazine. I love this project for 2 big reasons. First, on the object level, my personal experience has led me to believe that greater transparency for frontier developers would be a good thing. And second, it's awesome to see people who start with quite different perspectives come together to hammer out concrete AI governance proposals that both can get behind. The first 3 proposals would place new transparency requirements on frontier AI developers. First, to disclose important new capabilities observed while training frontier AI systems. Second, to disclose the training goal, model spec, or other document that defines how the developers are trying to get their systems to behave. And third, to publish safety cases and risk analyses so that they can be subjected to public scrutiny. Finally, the fourth proposal would enact whistleblower protections along the lines of what s b 10 47 would have created, but for governor Newsom's veto, so that insiders have some way to raise alarm bells from within the labs without fear of legal reprisal. I find these recommendations very compelling, particularly given Daniel's experience at OpenAI, and I hope they are enacted. But as you'll hear, I still have my doubts as to whether companies would consistently follow these rules in good faith, and I ultimately still feel pretty strongly that some form of third party testing should be mandated as well. In any case, this episode demonstrates something that I believe is extremely important, particularly as we move forward into a second Trump administration. AI policy remains relatively nonpartisan, and people of different political persuasions can find meaningful agreement on concrete steps forward. This is exactly the sort of constructive collaboration we need to see more of as we work to ensure beneficial development and deployment of these powerful technologies. As always, if you're finding value in the show, we'd appreciate a shout out online, a review on Apple Podcasts or Spotify, or a comment on YouTube. And, of course, we always welcome your feedback via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. For now, I hope you enjoy this exploration of transparency proposals for frontier AI development and lots more with Dean W. Ball and Daniel Kokotajlo. Daniel Kokotajlo and Dean W. Ball, welcome to the Cognitive Revolution.

Daniel Kokotajlo: (4:52) Thanks. Excited to be here.

Nathan Labenz: (4:54) I've been really excited about this 1. Counting down to it, Dean, you know, is a returning guest, so listeners will know him from our SP 10 47 episodes. And, I would also say the real ones would probably, also remember the episode on neurotech, where we just tried to figure out at what point are we gonna merge with the machines. People that are following AI closely, I think, will also have a a sense of who Daniel is, because he certainly has made some, contributions to the news. And now you guys are here together with a collaboration where you have put out an editorial in Time Magazine calling for a new set of AI transparency measures. So I'm excited to get into all of that. I think it would be really useful maybe to start, Daniel, with just a little bit of your background for anyone who maybe hasn't heard your saga of your tenure at OpenAI, how that ended, and a little bit of what happened after it. But I do think it's, certainly very relevant kind of motivation for the proposal that you guys are putting forward now.

Daniel Kokotajlo: (5:56) Sure. Yeah. To be clear, I think I would be still very interested in this proposal regardless of what my specific experiences were. I think that it sort of stands on its own. I worked at OpenAI for about 2 years. I was there on the policy research team, mostly doing forecasting and strategic thinking about what sort of policies we should adopt to get ready for and handle AGI well and make sure that it's beneficial for all the world and safe and things like that. So I did a bunch of forecasting and scenario planning. I also did a bunch of strategic documents about the importance of doing evals for dangerous capabilities internally and how that could possibly be the basis of a regulatory regime. I also did some alignment research trying to get people to do experiments on faithful chain of thought in particular. And then I left in April 12. And the reason why I left was because I had gradually given up, gradually lost hope that the company would be the way that it needs to be in order to handle this all responsibly. And there's a lots of things I could say there. 1 high level thing would just be we haven't actually solved this technical alignment problem right now. We have reason to think that if some of our training runs were to actually succeed and get us AGI or something similarly powerful, that we wouldn't understand what we're doing well enough to actually control it. There are reasons to think that we would lose control and that we'd end up with a system that's just playing along and biding its time instead of actually aligned. And there's a lot of technical work that needs to be done to figure all that stuff out, iron it out, come up with better training methods, better eval so that we can tell whether or not our methods are succeeding, etcetera. And I think that we're just massively under investing in all of that stuff. There's a lot more I can say besides that, but that's 1 example. When I left, the exit paperwork seemed quite unjust to me. Maybe unjust isn't the right word. It seems like not the sort of exit paperwork that a nonprofit focused on benefiting all humanity would have. And in particular, it had a sort of secret non disparagement clause to it. It was backed by the threat of taking away your equity, your vested equity. So I decided not to sign that. And because I didn't sign it, I was free to talk about it because I didn't sign the thing that said you're not allowed to talk about this. To my surprise, it blew up in the media. I think that people were a lot more upset about this than I predicted they would be. And in response to that huge pressure, OpenAI changed their policies. So I got to keep my equity after all. And going forward I think that there isn't any sort of non disparagement stuff, at least in the standard paperwork as far as I can tell.

Nathan Labenz: (8:30) Yeah. I said this just before we started as well, but to say it on the air again, I think that was a really admirable move and definitely appreciate what you did there. I don't know the exact details of the finances, but I understood from just what Chad or I followed that it was millions of dollars worth of vested equity and that it was the majority of your personal financial position at the time. So to be willing to walk away from that on the grounds that it feels important to be able to speak out, I think, obviously, lends some real credibility to what you're going to say and turned out to be quite catalytic as well. So I think that's really admirable and appreciate that. On behalf of everyone who's trying to figure out what's going on, thank you.

Daniel Kokotajlo: (9:15) Yeah, thank you. And in that case, while we're in the business of thanking people, I should also thank my wife who made the decision with me and who took on as much of the risk as I did because she's my wife and so we're in this together. And also various friends who were like advising me throughout the process and stuff.

Nathan Labenz: (9:28) So you are the author of a legendary less wrong post.

Dean W. Ball: (9:34) I don't

Nathan Labenz: (9:34) know, is less wrong? It's on alignment forum link. Are always like

Daniel Kokotajlo: (9:37) It's on less wrong. It's on both.

Nathan Labenz: (9:38) Yeah. Synonymous almost in my mind. What 2026 looks like. I believe this came out in 2022 originally, maybe 2021. Yeah. Okay. 2021. So in that post, you basically lay out kind of year by year, pretty detailed, pretty specific. This is what I think the world is gonna look like. I went back and reread it in preparation for this. I'd say it's often cited for a reason. How would you say you feel about that in retrospect? Like, what would you say you've gotten right, wrong, what edits you would make for, say, 25, 26 as we look ahead?

Daniel Kokotajlo: (10:13) Exciting question. Let's dive in. And I'm also interested in hearing Dean's opinions on this actually. First First, I'll say a bit about the context of why I did that and why I still think it's a cool methodology. And then I'll say a direct answer to your question. So most AI forecasting and most forecasting in general is not really like this. Most forecasting is extrapolating trends or making predictions about events. Things like, will there be a war in this year? Or will AI capabilities achieve this milestone in this year? Or things like that. And I think that's correct. That is the bread and butter of forecasting. That's a very important thing for people to be doing. But I think it's also helpful to try to holistically depict how it all fits together. Like this benchmark score, that amount of compute. How does it all come together? What does it actually look like as a unified picture? And I think that you often learn things from trying to write out such scenarios. You notice inconsistencies in different things that you had thought, for example. Or when forcing yourself to write things in detail causes new questions to arise in your mind that you may not have considered before. And then that's like very fruitful to answer those questions and think about those. There's more I can say about why I wrote it. I wrote about the methodology and the motivation for it at the top of the post. I feel like it was quite successful. I feel like I learned a lot from writing it. I also think it was reasonably accurate. So to answer your question more directly, I think that I correctly predicted that chatbots would be a big deal before they were. And of course, I correctly predicted like the scale. Like back in 2021, could read the scaling laws papers and then you could look at GPT-three and you could be like, wow, these things are just gonna get better and better in in general over the next few years because the companies are gonna be investing more and more in it. And basically, was just taking that insight and then turning it into a scenario. And that part I think worked. I think my biggest miss in terms of accuracy was the stuff about AI powered censorship and propaganda. I think what I would say is that I fell for a classic fallacy of people trying to predict the future where you look at what you correctly get what would be technically possible and then you assume that someone's going to actually be doing it. Whereas in fact, it takes a while for people to actually start doing the thing that's technically possible. And I would say today that the level of propaganda and censorship and balkanization of the Internet that I described in 2025, 2024, 2026 in that story is totally technically feasible today. LLMs are in fact smart enough to do that sort of thing. But the powers that be haven't really deployed it with gusto. Not as much as I cynically predicted. At least that's my current guess. But I'm quite curious if Dean What do you think, Dean?

Dean W. Ball: (12:47) Yeah. I think it's interesting because reflecting that piece, which I think I read concurrently with this publication. You were writing in 2021, which was a very wildly different sort of media environment. We were coming out of COVID and the election and the social unrest of the 2020 and all that stuff. And I think it's is also the prediction error of over indexing on the trends that you see around you in a specific moment. And also though, the interesting part to me is that to a certain extent, what you're talking about is like a technology diffusion story, because it's not about the capability, it's about the actual like institutional changes that need to happen and the changes just to the way that senior people at companies and governments think about stuff. And that's just like part of my thesis on like why I think both some of the AI hype and also some of the AI risks are overblown is not so much because I doubt the capabilities. It's because I have a keen sense that the diffusion of these things just takes time. It just takes longer than you think, and even if you think it's gonna take a while, it's still probably a little bit longer than you think. You just have to constantly be correcting for that. I think you're both somewhat understating it. I encourage your listeners to go and read that post and especially the 20 24 1, the section on 20 24, realizing that it was written 3 years ago, and just understanding that like Daniel got it, like the details of what it has been like to live in 2024 following AI, like dead right in so many ways. So that both is a very impressive piece of intellectual labor. And to me, it also makes it so that when I talk to Daniel about forecasting AI doom scenarios and AI catastrophic risk scenarios, I cannot easily discount this because this is a man with some pretty serious predictive chops. So it's definitely been like a good corrective just for me personally and my own epistemics.

Daniel Kokotajlo: (14:48) Thank you. And as a bit of a tangent to that, a different reason, a motivation for doing this, which I didn't mention, is that it's just very helpful for communicating worldview. So I really want other people to do this sort of thing too. I want everybody to write their own little story of how they think the future is going to go by default. Obviously, it's hard to predict the future. There's lots of branching possibilities. Just pick your favorite possibility, the 1 that seems most plausible to you, write it out in detail, and then it gives you, like, a little thing that someone can read to, like, quickly come to understand, like, where you're at and, like, what you expect. 1 of my big regrets about this what 2026 looks like story is that I didn't have the guts to finish it. I had an unfinished draft of 2027 in which AGI is actually achieved and things start going off the rails. But it was like too confusing and I didn't feel like it was ready and there's too many unanswered questions and it didn't make enough sense. And so I just abandoned it and just published up to 2026 instead of going further. But I wish I had forced myself to spend a few extra months writing out that part of the story too. Yeah. Anyhow, that's basically what I'm doing now. I'm trying to make up for it. My my current project is to do a new, better, much bigger version of all this that goes all the way through AGI. And then my hope is that if nothing else, it will be a concise way to communicate my views to other people because they can just read this story, and then they can be like, oh, Daniel's worried about something like this happening.

Nathan Labenz: (16:12) Hey. We'll continue our interview in a moment after a word from our sponsors. First, it is striking to me how little AI propaganda we've seen. If you had asked me, I would also have said, and even more recently, like, beginning of the year, it seems like all these video generator technologies are getting there. There's a lot of opportunity. We now see these things like the latest runway thing basically allows you to just convert 1 character to another. I haven't red teamed it, so hopefully, have some guardrails on there where I can't just go convert myself into Kamala or Trump or whatever. But it seems like the core technology is there. We haven't seen a ton of it. That is a little strange to me. I don't if you have any explanations as to why.

Daniel Kokotajlo: (16:50) Yeah. First of all, let's distinguish between propaganda and censorship. I'll talk about censorship in a sec. Propaganda. Classic example of this would be like Russia having bot farms to help make fake Facebook accounts pretending to be Americans. And they have been doing this. And they were doing this in 2021, and I believe they're still doing this now. But it doesn't seem to have been a huge deal. It doesn't seem to have substantially changed the election cycle, for example. And I'm interested to be corrected on this. I'm not an expert in the election cycle. I don't follow politics as closely as I could. It's possible that this is having a substantial effect, I just don't know about it. But as far as I feel like I would have heard about it by now if it was having a substantial effect, which is basically my my reasoning for thinking it's not having a substantial effect. I assume that they are using LLMs now. Instead of paying a bunch of humans to write these things. Presumably you'd have humans working with LLMs. It'd be crazy not to. So I think the lesson from this is despite the fact that the marginal cost of propaganda has gotten lower, the overall effectiveness of it hasn't really gone up that much, it seems. I'm curious whether this is a result of them being lazy and incompetent at implementing this versus a sort of red green race where the tech companies are getting better at noticing it and stamping it out in parallel to the propaganda getting better? If anyone has any information or thoughts about that, I'd be curious to hear it.

Dean W. Ball: (18:08) I think that the fact that we haven't seen as much propaganda as you might think, or like manipulated media, makes me think that, and this has been my intuition for a while on this stuff, that the information environment is more robust than people understand. And I don't want to say people have a better immune system than we might think, because it's not exactly that. But it's like people can sniff out stuff that is fake for reasons that don't necessarily relate to the quality of the media. Like, the ability to produce an image that that looks very real, a, people can sniff out AI generated images pretty easily, I I think. The other thing is this is my world. Right? Like, my friends can do that, but my friends are in the AI world. Yeah. I don't know about people that are not plugged into this stuff. Probably they can't as easily. But the sources of information do actually continue to matter somewhat in the social media era. Maybe the New York Times doesn't have a monopoly on the truth anymore, but that doesn't mean that literally any person can go out and tweet a picture of Joe Biden doing something outrageous, and people will just believe it because it's on the Internet. It is, in fact, a little bit more complicated than that. And certainly, also, I think the platforms are combating this in in various ways. I will say I get a lot of bots. I get a lot of definitely MLM generated stuff on Twitter. The country I see doing it most frequently is The UAE. This might be just because of my specific niche because I tweet about AI policy a lot. Responses being like, yeah, California, they're gonna regulate AI into the ground, but I don't think they're gonna do that in The UAE. But it's just stuff like that.

Nathan Labenz: (19:46) Falcon for I

Dean W. Ball: (19:48) do see things like that, but again, I feel like it's not having a major impact. It might also be 1 of those things though where it's it's actually somewhat harder than it would seem for an LLM to convincingly mimic a human in a social media environment. Like the crazed environment of Twitter discourse in particular might be like a little too high entropy for LLMs. But that could also just be a matter of nines of reliability. That might well be the case the GPT-five or GPT-six can handle it with aplomb. It's just that right now, they're just still a little too dumb to do it convincingly. Those are my theories.

Daniel Kokotajlo: (20:24) Building on that a little bit, I think that fake images that attempt to create a scandal when there isn't 1 seem like they'll follow a natural cycle. If someone were to do 1, it would cause a brief scandal that would then get discovered and cause a backlash that would have the opposite effect. If you tried to fake a scandal with a politician, for 1 day the story would be, oh my gosh, look at this horrible thing this person did. And then for the next week the story would Wow. Look at this horrible fake that someone did. The politician's great actually. They totally didn't do the thing. And so I think that this is why Dean said the AI stuff isn't good enough that it would actually be disproven relatively quickly after a day or so. And I think this is probably a big reason why I'm not so worried about the deepfake images that much, with the exception maybe of doing them literally on election day. But that hasn't happened yet either.

Dean W. Ball: (21:12) There is the 1 interesting example, and I like this because it intersects with with the well intentioned public policy that ended up biting you in the ass. Somewhere in Eastern Europe, there was an election about a year ago where I think it was deepfake audio of 1 of the candidates, was dropped, like, over the weekend before the election, and the country had a law mandating a media blackout 48 hours prior to election. So nobody could report on it. No 1 could report on and the idea behind this was, like, no propaganda 48 hours before the election. And then someone drops propaganda on the Internet, and the media companies can't report on the propaganda because it's illegal to do so.

Daniel Kokotajlo: (21:52) Is it a successful example of a deepfake affecting an election?

Dean W. Ball: (21:55) It's obviously the causality is hard to tease out, but the candidate that got deepfaked lost. So that is interesting. 1 other scenario that I think is worth the under discussed here is actually the opposite scenario, and we see politicians doing this already. So I think it's something to keep an eye on where rather than being the victim of a deepfake, you are the victim of a legitimate leak of something incriminating or scandalous about you, and you say that it's a deepfake as a plausible judo.

Nathan Labenz: (22:24) Yeah. We've seen that, even before it.

Dean W. Ball: (22:27) We've seen president Trump's campaign intimated this. We've seen president Biden do it. We've seen other politicians do it. I think you're gonna start to see that in criminal defense settings at some point. Criminals will take longer to adapt to this because they are generally not as intelligent as other people, but I think you'll start to see that. I think people that are accused legitimately of crimes will say that things are a deepfake. I don't know how courts are gonna be able to adjudicate that. This can be a challenge. Anyway, sorry, I cut you off, Daniel.

Daniel Kokotajlo: (22:56) No, that's a great interjection. So the thing I was gonna say though is astroturfing. So you mentioned you did a bunch of astroturf LLM bots on Twitter. And my thing here is it's definitely happening. The question is like how much of an effect is it having on the discourse? And I guess I would dare say something like it doesn't seem like it's having a huge effect, but how would I know? Like we would need to do some sort of like study. Like ideally we'd do some sort of control group thing where it's The UAE turns off their chatbots in this state but they leave on their chatbots in this other state or something. And they do this for all 50 states. And then you see the difference in how like what the polls say about AI policy. And they're like, statistically significant. That's the sort of data that I would love to see to tell whether or not it's working. And perhaps the entities that are doing this sort of astroturfing have access to data like that. But if they do, they obviously aren't going to be publishing it for the benefit of science. So yeah, leave open the possibility that it is actually having a noticeable effect. And I just can't tell. But again, I'm going off this heuristic of things don't seem that different. Maybe I would have heard about it by now if someone would have done a big investigation and written more about this if it was really having a big effect. The other thing I was gonna talk about was censorship. So I would say something similar about that. And I think I also thought this back in 2021 that like the censorship half of the propaganda and censorship coin is the more important half. So there, the concern that I had back in 2021 was already there's lots of recommender systems that choose what content people see and in what order and how many people see it. And already there's like censorship systems in all the social media platforms that like automatically delete or hide undesired content for various definitions of undesired. There's a sort of like sliding scale where perhaps on the 1 end, like you're literally just censoring like racial slurs. But then you can keep adding things to the list of stuff that you censor until it's 1984 and you're censoring anything that like disagrees with the ideology that you are promoting. Right? And it's not clear to me like how far on that spectrum different social media companies are. LLMs allow them to like more cheaply go farther on the spectrum if they so desire. Right? Because of an LLM, you can have more complicated and nuanced concepts being part of your censorship and or recommendation algorithms. Right? You can have your LLM promote conservative content or or or liberal content or something. That's a concept that the LLM actually understands and will be able to do a decent job of. Whereas 10 years ago, that would not have been technically feasible. So again, I'm not saying this is happening already. I had predicted that it would be happening already back in 2021. Whereas now, I feel like maybe I would have heard about it if it was happening. But also I don't think that companies are super transparent about the details of the recommendation algorithms or censorship systems. So it's possible that it is happening to some extent and we just don't know about it. But yeah, those are my thoughts.

Nathan Labenz: (25:54) I've heard a few comments from Zuckerberg over time that sort of amount to a good guy with a big AI can get rid of the bad guys with the small AIs as they try to overwhelm the Facebook platform, for example. It is really hard to say how much of that is happening.

Daniel Kokotajlo: (26:10) I called this in 2021. I said that propaganda and censorship are 2 sides of the same coin because the response to the propaganda and the astroturfing is going to be to integrate LMs more deeply into a censorship complex to automatically detect and filter out the propaganda. And so it's gonna feed off of each other. I predicted that the censorship would win. Like you wouldn't end up in a world where Russian astroturfer bots are running amok in America. Instead, I predicted that you'd end up in a world where American bots are guarding everything that happens on the American Internet and making sure that the Russian stuff gets crushed.

Nathan Labenz: (26:42) Yeah. And that does seem like a pretty plausible description of where we are today. It seems consistent with all of the observations that they might be trying but just not succeeding super well.

Daniel Kokotajlo: (26:54) I predicted that in 2026 or so, the internet would be balkanized until the Chinese internet, maybe the Russian internet, and then the western left and the western right. And so far Yeah. I'm curious to what extent you think that is true versus false. I think in some sense that the internet was already balkanized in 2021. And also in some sense it's not as balkanized as it could be. So the question is, has it become more balkanized in the last 3 years? Are people more in bubbles? And I'm not sure. Dean, you think yes. Was going say Elon buying X feels like a counter example to me where it seems like people from all across the political spectrum are still on Twitter. Anyhow, yes. Tell us, Dean, what do you think?

Dean W. Ball: (27:34) Sorry, I think X, it's changed significantly in the last year too, but I do feel leading up to the election that it's become quite a conservative echo chamber. And when Elon first acquired X, I was like, this is great. It's a real free speech platform, all that. And I still feel that way. I still think it's awesome. But, yeah, it does increasingly feel as though the algorithm is leaning heavily on boosting conservative stuff. And that worries me in the long term just because I think it lowers the quality of the platform and the information environment and many other things. So I hope Dylan has a bone to pick in this election, but maybe once that's done, we can I don't know that probably not because there's gonna be a butlerian jihad through the institute there's gonna be a march through the institutions that happens if Trump wins that's gonna require a lot of social media boosting? But I would also just say as a matter of public policy, we've seen the Internet balkanized more in in important ways. You know, the European Internet is increasingly its own world. I don't know how recently either of you guys have been to Europe, but it's a pretty different place at this point, and it will become more so because the Europeans are going to start mandating censorship, and they're going to really start bringing the hammer down. It's important to them, and they have social unrest problems that they need to deal with that aren't as palpable here in The US, but they're very worried about it. And even within The United States signs that it could happen, there are state laws. Yeah, I'm obsessive about state government. And we've seen laws from Florida and Texas, for example, that really did try to create mandate new content moderation policies, let let me say in the patois of public policy, rather than saying mandate different kinds of censorship than the old kinds of censorship. Republican censorship, not Democrat censorship. We've seen that from those states. They've tried to do it. And we also see blue states, Maryland, New York, California, others that are doing these algorithmic design laws, which ultimately are going to be about government influencing the parameters that govern social media algorithm development. So we'll see that kind of stuff. It's an interesting question how far it's gonna get. I cover this issue pretty closely on my substack. Whenever there's new news, I follow this quite closely for this reason. The laws in Florida and Texas that I mentioned went to the supreme court earlier this year, and they lost big time. But there are some nuances that make me think, oh, there might be certain versions of these laws you can pass that the Supreme Court is actually okay with. Politically public policy mandating different kinds of political algorithm design is not going to fly at all. Like, that is not gonna happen in America because, you know, unless like Kamala Harris like packs the court with, you know, people, you know, like, or Trump does. But like right under this Supreme Court, there's no way in hell. We live in the golden age, frankly, of first amendment jurisprudence in America right now. But there could be exceptions. There could be different ways of attacking the problem that the court decides don't invoke first amendment problems, but are just as insidious in the long term. So we will see. I do think geographically, you kinda have Europe, China, The United States, Internet, and they're going to, I think, increasingly fracture.

Nathan Labenz: (30:52) It'll be interesting to see what exactly we see as we hit the election and get past it. It occurs to me that potentially the real window when this could get the most intense might be between the election and the certification. It's 1 thing to try to convince people that they should vote a different way. It's maybe another thing. And maybe given what we've seen in the last cycle, potentially lower hanging fruit for those that wanna harm the social fabric of The United States to come in after the election, and I wanna don't give anybody any ideas. But just from news stories I've seen in the last few days about fires in ballot boxes or or whatever, I mean, that kind of stuff is going to be a lot easier to create, a lot harder, I would think, to disprove, and definitely has a sort of receptive audience that doesn't need to have their mind changed but can potentially be played directly into what they already wanna believe. And, yeah, that's often like the most powerful form. Right? It's it's sometimes less about changing anybody's mind and more about riling them up about something that they're like already predisposed to be riled up about. So we may not be out of the woods, just yet.

Dean W. Ball: (31:58) Oh, yeah. No. I mean, nothing makes me feel like an old person more than the fact that a significant number of your listeners were likely not alive for Bush v Gore, the supreme court, case, the case that went to the supreme court over the 2000 election.

Nathan Labenz: (32:13) Think I'm that hip with the young people? I'm not so sure. I suspect most of our audience remembers that. But let me hey. If you're listening out there and you are too young to remember Bush v. Gore, let me know.

Dean W. Ball: (32:23) If you're 25, you were 1 when Bush v. Gore happened. But, yeah, look it up for those that don't know it. But I genuinely don't know that our country can handle that type of scenario. Again, right now, I think the country might just fall apart if in the event of another Bush Vigore style scenario. And a lot of people are to blame for that on both sides of the aisle, unfortunately, but, yeah, it is a risky scenario for sure.

Nathan Labenz: (32:47) Hey. We'll continue our interview in a moment after a word from our sponsors. Let's Yeah. Zoom back out to the kind of outlook for the next couple years. Daniel, give us a little bit of maybe a teaser of the 2027 worldview just so we know where you're coming from. You can maybe give us a little bit more about kind of the AI futures project as well. Is that a you thing? Are you like building a team? And then with that in mind, let's then go into the transparency proposals and talk about like how they will hopefully address some of these concerns that we have. And maybe from there, we can even go to looking at probably OpenAI most, but frontier labs today, like, are they already doing some of these things or are they falling short? What would they have to change? What would we see that would be different if these proposals were to become mandated? But, yeah, still, let's start with kind of, you know, my crystal ball gets real foggy even just a few months out. I think the 2027 thing, there's definitely been a coalescing of people saying, like, that might be around the time when we start to get AGI and things start to get real weird. So give us a little bit of a sketch as to who you're working with to figure that out and what the current outlook is.

Daniel Kokotajlo: (34:00) Yeah. So it's me, Eli Liffland, Thomas Larsen, and probably Jonas Wilmer. They're people I know, friends of mine, recruited to join me on this with expertise in AI alignment and AI forecasting. There's nothing special about the year 2027. It's just like the middle of my distribution, so to speak. But here's maybe a qualitative depiction of what I'm talking about. So Anthropic just released their first sort of like crappy computer using agent a few weeks ago and in principle can do anything on your computer that you can do, moving around, clicking, typing, etcetera, but in practice it's not very good. It gets stuck or makes mistakes and doesn't notice that it made the mistake and then that sort of ruins its entire thing. Imagine something like that except actually really good. That would be AGI or if you don't wanna call it AGI, whatever. The point is imagine something like that except that it actually really works across the board at all sorts of things that you can do on a computer. So how far away are we from that? That's a question of like, how can we get from Anthropic's current crappy version to 1 that's like that except that it actually really works? And all of the major companies are working on things like this and have been working on it for years. And there's a couple of different things they can do to get there. So 1, they can just continue scaling the models. More parameters, longer training runs. I think it's pretty much confirmed at this point that you can continue to get more across the board skills by just doing that. Unclear whether just doing that would be enough though to go from current Claude 3.5 Sonnet compute using agent to hypothetical actually good version. Another thing you can do is you can do more targeted, what I would call agency training. So instead of just doing the usual sort of text prediction and then like chatbot fine tuning, do some sort of reinforcement learning, for example, to use the computer. And they already did a little bit of something like this with Claude. And presumably this is what OpenAI was doing to create O1 as well. Reinforcement learning on a sort of like chain of thought environment, doing lots of chain of thought to get to answers. So the companies are all experimenting, tinkering around with different types of reinforcement learning algorithms and environments and so forth. And they're also scaling up what they're doing. So having bigger, richer, environments and pouring more compute into longer training runs in those environments. I personally would think that some combination of bigger models with more compute and models that have been more specifically trained to operate in these types of environments will succeed. The And question is just how long will it take to succeed? I think you could totally succeed in the next 12 months. But probably it will take a few more years than that because everything always takes longer than you expect. That's sort of 1 way of summarizing my view. I would imagine that if you could go inside these companies and look at their optimistic timelines, that it would be something like within 12 months, we will have something like Claude 3.5 Sonic computer using agent, except it actually works really well and we can just delegate tasks to it and have it running autonomously in the background, doing all sorts of useful things for us. I would bet that there's like having a roadmap to try to get that this year. But realistically things take longer than you expect and there's going to be unforeseen difficulties and so forth. So where does the 2027 number come from? It's a combination of a bunch of different and a bunch of different trends and guesses and so forth. 1 thing I would say is that if you just do the obvious and very good thing of taking various benchmarks and extrapolating performance on those benchmarks. Some of them get to superhuman performance this year, some of them get to superhuman performance next year. But like around 2027, in my subjective guess based on all the benchmark extrapolations I've seen, is when it feels like I can say all the current benchmarks will be saturated.

Nathan Labenz: (37:47) Yeah. Another way to put that is we're sampling out of this exponential every 2 years and that would basically be 2 full generations from a GPT-four, which feels like it there's not that much more headroom. Right? If all of this is roughly accurate, like, 2 more Yeah. So there's a

Daniel Kokotajlo: (38:05) couple of Actually, wanna a big difference.

Dean W. Ball: (38:07) I would just say the last time I was on your podcast, I think, was August, Nathan. And I've had a pretty substantial update in my thinking about all this, about timelines since then. And it's not so much that my timelines have changed, it's that I've become more confident in them, and I now feel as though my confidence is enough. My previous intuition was like, yeah, probably short timelines, but my hunch here is like, I don't trust my hunch enough to think that should be guiding public policy decisions, and I now feel differently. My hunch is now sufficiently strong that I think it's okay, we should probably be making public policy decisions based on this. And yeah, I mean, to me between computer use from Anthropic and OpenAI's 1, I think you can just see the future. I think you can just start to see it take shape. An interesting empirical observation I have about computer use is that the first thing I tried to use it to do was use my laptop with, I have the latest version of macOS, and that comes with a feature called iPhone screen mirroring, where you can have your iPhone UI as a window on the Mac. And Anthropic said in their blog post that they trained this thing on desktop UIs. And so my question was like, I wonder how it handles an iPhone user interface. I wonder if it will be able to figure that out at all. And it did. It totally did. So that makes me think, I there's some I mean, you know, it wasn't great. It still is flawed, but just yeah. The fact that it even figured it is is an interesting observation. And so it's like to me, the things that we see in front of us right now will get better. I don't think you need, like, actually that much more. I think a lot making that stuff better is, like, fundamentally probably more of an engineering problem and a geopolitical problem than it is a science problem at this point. Yeah. And what will happen is that they'll get better, and they'll also get cheaper and faster, and we'll parallelize them. The sort of company of agents, the bureaucracy thing will start to become real as we get into the late 2020s. And in a certain sense, I just don't have any interest in whether or not that's AGI exactly, because I think you could have a philosophical conversation about how that's not at all AGI, depending on what your definition of AGI is. But it is going to be super intelligent in ways that are palpably different from anything we have encountered before. And so that's basically where I am right now. And I certainly might update, you know, negatively on that if I see something that makes me think there's gonna be more roadblocks. But right now, as I see it, this is primarily an engineering problem. The only thing left is, okay, there's the long horizon tasks, you know. Like us humans, will they get stuck in the mud and spin their wheels on stupid stuff and not really know what to do, like how to act in the world just like us? I think the way that will get resolved in at least a meaningful part is like, you have the reason OpenAI and these other companies partnering with organizations like, you know, the National Labs and Moderna and sort of science companies, is that I'm sure they are having their scientists using these things as apprentices, as grad students, and they're modeling the long horizon tasks, right? People are using them for long horizon tasks and they're learning to model what that actually is. And like a grad student, they will learn. I just think all the pieces are there. The data acquisition for long horizon might be extremely nontrivial. There might be long horizon things that are hard, but if the thing can think in 2 minutes for more than I've thought in my entire lifetime about a problem, that is still a profound thing. So that's my basic estimate for the next couple of years.

Daniel Kokotajlo: (41:45) Would add on to that, which is that already these big frontier models are accelerating research within the companies. People within the companies use them a lot for helping with coding. But you ain't seen nothing yet. Once they're able to do 99% of the tasks in the AI R and D loop, once they're able to do a 100% of the tasks in the AI R and D loop, then the overall pace of AI progress is going to accelerate dramatically, I think. It's an open question how much it will accelerate, but I think it will probably accelerate quite a lot. And it's already a pretty fast pace of AI progress. But imagine something 10 times faster, for example. Imagine having a single year where we make a qualitative and quantitative leap in AI abilities comparable to the last 10 years. And I think that's a conservative estimate, actually. I think it could be more like 10 times bigger than that still or something like that. So that's 1 of things I'm thinking about. And that's 1 of the things like the AI futures project is thinking about is like how can we try to make any sort of grounded estimate of how fast things will be going once you have fully automated AI R and D?

Nathan Labenz: (42:43) To dig in a little bit on what that automated R and D looks like, I have 2 different stories in my head that I think about. I wonder which 1 resonates more with you. 1 would be like the things are now capable of eureka moments, I call them, where it's like making these sort of never before seen go move insightful things where it's, wow, we would never, you know, have thought that, and it turns out to be right. And then the other is a more sort of mundane, like, the architecture space is vast and we just need to explore all the little nooks and crannies of this space. And you don't necessarily have to be like, eureka moment. If you could just code super fast, can run a ton of experiments in a compressed timeframe. And that alone might be enough to just really speed up a flywheel. Which 1 of those or maybe a third 1 would you how would you describe what that acceleration actually looks like where the rockets

Daniel Kokotajlo: (43:39) are? Like the second thing. So, like, currently, AIs are doing, like, first drafts code or short snippets of code, but a human engineer is managing the whole thing and making sure that it's actually good before they launch the training run or the experiment or whatever. Gradually more and more of that will be automated until they just do all the code. And human engineers are basically more like managers than actual engineers. And they've got teams of AIs working for them. And the humans will just be saying here's the type of experiment I want to run today. Then they'll literally have a conversation with their AIs about how to set up the experiments and stuff. AIs will just go write all the code, come back to check-in with any clarifying questions, fix all the bugs, etcetera. And then human will say, okay, launch the experiment. And then they'll launch the experiment. The And AIs will be sending progress reports about here's the training curve so far. It's a little bit different from what we expected. Here are some hypotheses about why it might be different. But at that point perhaps the human is still making the overall judgments because maybe the AIs don't have good research taste. Maybe they don't have good ability to evaluate the experiments as well as the humans can even though they can write all the code for it. But then gradually that's its automated way too. Instead of smart enough that they're just better able to tell what you should learn from these experiments than your human scientists are. And so the human scientists are more just like in the backseat reading the white papers that the AIs are writing about how we did all this work today. Here's the experiments we ran. Here are the lessons we've learned. Here's our next steps. And the humans are like, yep, sounds pretty good. I could try to weigh in with my own opinions, but from past experience I think I would only mess things up if I did. That's what it looks like when the process is completely automated, but it doesn't stop there because that's the moment in time that I just described was if they're basically at human level but a little bit faster and cheaper. But qualitatively, there's not reason to think that humans are at the max. And so beyond that, as they continue to improve, then you might start getting things like the Eureka moment that you described where they're able to just sort of leap, have some sort of like physics style genius insight into like how machine learning works that causes them to just leap into a new paradigm of artificial intelligence design that like humans just like never would have stumbled across in all of their usual researching and searching. Maybe that'll be happening at some point too, but I imagine that would be, like, in the superhuman realm rather than in the human realm.

Dean W. Ball: (45:55) If if we stay in the deep learning paradigm or not even deep learning, but if we stay in our current LLM paradigm or something vaguely like it, and we don't go into Carl Friston style active inference or some other totally crazy branch. The 1 thing I think about I I I do think automated research is going to happen, and I I I can't I don't understand. There's people on the Internet who tell me that's science fiction, and it's, oh my god. It's super duper not science fiction. It's happening right now, and there are benchmarks. Anyway, it's a serious thing. But the I I think that compute bottlenecks are often not considered enough for people that are thinking about this. Because I think what the automated research, at least in the the near ish term, gets you is, like, all these little engineering enhancements that make it so that you that get you from GPT-four to GPT-four O mini, right? All of that stuff, the thousands of little tricks that are involved in doing that, you automate that. And so that instead of taking 18 months or a year takes a month, right, or 2. But ultimately there's going to be some point of diminishing return on that kind of work. There's not an infinite possibility space there, I don't think. I think somewhat counterintuitively, they'll make us use compute more efficiently, which will actually end up increasing the demand for compute quite substantially, because we're going to get to the point where you need to level up to another model level, like, you know, another order of magnitude to really make any further progress. Sam Altman goes around talking about 7 to $10,000,000,000,000 in compute infrastructure development. I think that sounds about right. It's not obvious to me that is happening. It's certainly not obvious to me that is happening in a way that America will be able to use it as it sees fit and wishes in the late 2020s. So I think we might still be like profoundly compute constrained. For all the investment that's going into things, I think we're going to have problems there. I think we're going to go through a couple years where it's more bottlenecked than it feels right now.

Daniel Kokotajlo: (47:56) I sort of agree and I sort of disagree. So I certainly agree that we'll be bottlenecked by compute, and that'll 1 of the limiting factors at the overall rate of progress. However, I think that progress will still be quite fast on an absolute scale given the current amount of some compute that have already been assembled. I think perhaps we disagree about exactly how I'm talking things like super intelligence is being created with the compute that already exists rather than, yes, they'll be trying to do the $7,000,000,000,000 compute investment or whatever, but I'm saying even if they don't do that, we could still get to superintelligence by the end of the decade.

Dean W. Ball: (48:31) Agree with that completely. I think it's more like after that. After you've gotten to that point, what happens next? And I think it's like, Oh shit, we're actually

Daniel Kokotajlo: (48:38) Yeah, quite so superintelligences will be extremely interested in accumulating more compute for similar reasons to why modern industrial society is very interested in accumulating more sources of energy. From the perspective of the super intelligences, they'll be like, there's so low compute in the world. We have so much to do to produce more compute so that there can be more of us and so that we can be smarter and so forth. But from the perspective of the humans, the world will already be a very different place, think.

Dean W. Ball: (49:02) Yeah. It's gonna be it's an auto catalytic effect. It's just like energy. It's the more energy you use, the more energy you wanna use. It's, oh, we have all this energy. Look at all these other things we could do with exponentially more energy. It's gonna feel like that, I think.

Daniel Kokotajlo: (49:16) Yeah.

Nathan Labenz: (49:16) So if I had to put you guys in a quadrant, Daniel, you send pretty clearly short timelines, fast takeoff. Dean, you're like, short timelines? I think you're also kinda coming around to fast takeoff.

Daniel Kokotajlo: (49:29) I mean, you said superintelligence by the end of decades.

Dean W. Ball: (49:31) Yeah. I just think these terms have so much baggage associated with them. Like, it might not be like a Bostrominian style, like exactly what we were all imagining in 2014 kind of super intelligence. But I think I will be able to pull up my laptop, the laptop that is in front of me right now in 4 years, and I will be able to instantiate like 10,000 very intelligent agents to go do stuff for me. And they can use computers and tools and write code at an expert level, and I believe I will be able to do that, and I believe it'll be cheap for me to do that. And I'm uninterested in the question of, is that AGI? Is that super intelligence? I think that's super intelligence. I think I'll be able to do unbelievable amounts of stuff with that. And, yeah, I I feel like super intelligence is gonna come before AGI in some important sense.

Daniel Kokotajlo: (50:14) So on that note, maybe we do need to talk about definition of AGI would be as good as the best humans at every cognitive task. I feel like it's a fine definition. And I think that by that definition, 2027 sounds like a reasonable guess to me. Could be sooner, could be later. And then I would say super intelligence, I guess I would just say much better AGI. So perhaps not just as good as the best humans at every task, but like substantially better insofar as it's possible to be substantially better because perhaps there are some tasks that like can't get like, maybe a tic tac toe, you can't get better than the best humans because they're at the limits of how good it is. But insofar as it's possible to be substantially better than a human on that task, then the superintelligence is substantially better.

Dean W. Ball: (50:53) I just think of it like people are not thinking about what are going to be the implications if, like, every plumber in America has access to, like, 100 Harvard trained micro economists to do a pricing model for him and actual coders and all that stuff.

Daniel Kokotajlo: (51:09) Right? This is, the 1 part where I disagree with, and now things take a turn for the spooky and the dark. Your specific claim was something like, In 4 years, on this very laptop here, I'll be able to spin up 10,000 agents to go do stuff for me. I'm not so sure because I think that politically the world might be changed in 4 years. I think that such agents might exist, but you, sir, won't have access to them because they'll be locked up inside this like government corporate military racing AGI complex or whatever. And they won't be accepting money in return for their services because they'll have plenty of money already. In fact, they'll be printing the money that they need. Instead, they'll want your physical labor to go to the construction site and do the welding or something so that they have the new robot factory in the special economic zone that's going to be producing more widgets or something. Unfortunately, I do not think that you necessarily will politically be able to spin up 100,000 cheap agents or whatever because I think that this is something that I really want people to think about more is once you have this level of capability, think about the effects that's gonna have politically. Who controls that? What do they do with all that power? And there's different ways it can go obviously. I'm not making a very strong claim about particularly it's going to go this way, but I guess I just want to raise the topic of we're not necessarily in standard capitalism where the companies put it up on an API and compete with each other sort of mode.

Dean W. Ball: (52:37) Yeah. It's like, you know, it might well be that the best use of my time when we're in that world is not so much being a think tank scholar, but instead doing janitorial work while wearing an Apple Vision Pro to create training data for a robotic foundation model. Yeah. Like, totally. No. No. No. And and so Yeah. We get into this is a good segue into the transparency stuff because I think this is where we share the concern. Because I said that as, yeah, that's my prediction if things go well. But there are many scenarios where this does not go well, and it is not obvious that it goes well by, quote, I hate when people say by default, but by default, because we're not a computer program. There's no defaults. But, yeah, like, I I worry quite a bit about that, and that's, in a certain sense, that's why I got into writing about this stuff, because I had the intuition, like, 2 or 3 years ago, oh my god, if if we have centralized control over this, this could end up being very, very dark, where artificial super intelligence exists and it's like the CIA, Jane Street, and Moderna have it and not you. And yeah, like you're cleaning bathrooms while wearing an Apple Vision Pro, totally. That's why like open source has been important to me for it's why I fought hard on open source and I'm very sensitive on that issue still. Because to me, it's like that's retaining the optionality to some extent. And I do understand that there might be scenarios where open source creates additional kinds of risks, but I think those risks are exceptionally worth it if you think about the scenarios Daniel is describing.

Daniel Kokotajlo: (54:04) Yeah. So on that note, by the way, I used to be quite against open source AGI and I've sort of warmed up to it recently due to these sort of transparency concerns, basically. And there's lots to talk about here. The concentration of power inherent, I would say, in the technology. Or at least if we don't want the world where there is I mean, think about how this works. Right? You spend billions of dollars on a giant training run. And then you have this thing which can be cheaply copied, which is this artificial mind that can do all sorts of cool stuff. That's like the inherent shape of the technology these days. And inherently, it's going to lead to a situation where there is a giant compute cluster, maybe the biggest in the world, and it does this giant training process, and then you get this thing which can be cheaply copied, but whoever owns that whole compute cluster is perhaps not going to want to let it be cheaply copied everywhere because then how did they get their money back? You know? Basically, there's this, like, single point of failure, so to speak. There's inherent in the technology, this moment where you have centralized control over a single node that controls this data center, that controls the stuff happening on it. The AIs are probably all copies of the same frontier model for example. If they're not copies, it's because the people in charge of the data center decided to make different variants for example. So I think it would be a very different world if there was an open source mandate. If the companies had to open source their models as they were being trained, then we would have a very sort of decentralized, lots of different compute clusters everywhere running all sorts of different models. And there would not be such concentration of power. But instead I think that the default trajectory is that we end up with some sort of public private partnership, some sort of mega corporation combined with the executive branch of the US government and the people in that room are sitting at the top of a hierarchical organization that reports to them And then that organization controls all this compute. And then on that compute runs this like civilization of super intelligences. It's like the more concentration of power than has ever happened in all of history. In history there have been totalitarian dictatorships where there's a society that's controlled by a military, that's controlled by the supreme leader. But even in those places, the military is more integrated into society and also somewhat more diverse. And the leader's control of the military is less than absolute. Whereas here we have the leaders, the army of AGIs, and then the rest of society. And if we solve the alignment problem, then the leader's control of the army of AGIs will be absolute in a way that's more intense than any dictator who's ever controlled his army. I'm curious what you think of that actually, Dean. Perhaps I'm telling too scary of a story here, but I just think that inherently in the technology is this insane concentration of power

Dean W. Ball: (56:47) And

Daniel Kokotajlo: (56:47) you have to land against that somehow.

Dean W. Ball: (56:50) If I disagree with some of the specifics, I directionally agree entirely that this is the big challenge and how you deal with the economics of all of it. I mean, 1 thing I'm not 100% sure about is like Anthropic talks about, can you distill these models to sub current models, to sub 1,000,000,000 parameter, just like kind of the raw intelligence ability, and then get the world knowledge in some other way, like some vector database or something like that, and are just like searching the web where the world knowledge from whence it came, and make that thing like extremely fast and do, you know, sort of like RL based tuning on that. And can you instantiate like 1000000 of those? How does that compare to the bigger thing, which is maybe like 1000 GPT-5s versus 10,000,000 little sub 1,000,000,000 parameter things? I don't know. Like, I think that's all going to be interesting, but I don't rule out scenarios like that where it actually, like small becomes the new cool thing. And actually like there's some degree of competition. I mean, this is kind of already happening. Like there's a little bit of competition of who can actually make the smallest model, the best smallest small model, that might get more that might become like more significant vector of competition in the field. I don't have a strong opinion on which direction it goes or and the answer is probably both as with most things, but I don't know which will be more salient. But that could solve some of the economic issues. But nonetheless, you're talking about enormous amounts of capabilities that I think it's extremely important that human beings have access to this, that all human beings have access to this and not just some select few.

Daniel Kokotajlo: (58:23) I agree. And then I would also go further and say, like, what kind of access exactly? So currently, there are, like, terms of service that you have to, like, sign up for to to be a user. And then also there are, like there's a model spec that OpenAI has. And there's the system prompt and so forth. And currently there's not much transparency about what these things are. Right? So with the Gemini incident, which you're all familiar with I assume, this was a case of the system prompt for Gemini basically telling Gemini to put in a bunch of racially diverse pictures even if that's not what the user wanted. This sort of blew up as a scandal, but it's also kind of standard practice in the industry. OpenAI was doing something like this earlier and lots of these AI products and companies basically have this sort of well, OpenAI explicitly calls it a chain of command. If you go read the spec document, OpenAI's model spec, which to their credit they put up online, it says like basically there's this chain of commands. Like there's gonna be a set of instructions and values and principles given by the organization, OpenAI, that comes first. And then there's the developer that's making their wrapper app And that comes second. And then there's what the user wants. And that comes third. Also, not only is there this chain of commands where you're supposed to, in case of conflict, go with the higher level, there's currently no law and not even an industry norm that you have to be transparent about what those levels are. And in fact, it's almost the industry norm for the AIs to lie to users about what their instructions are at the higher levels. Right? Gemini, you're not supposed to reveal the system prompt. You're not supposed to let the user know what your higher level instructions are. And right now, this is all just kind of funny and like silly and it leads to things like the Gemini racially diverse Nazis stuff, but it's terrifying if you fast forward to AGI. And you imagine that Dean does have AGIs that he can talk to on his laptop, but they have their own set of instructions and their agenda that they're following that they conceal from him. And the only people who know the true agenda of the AGI is that everybody and their grandma is talking to are the people in the room who made the calls about what that agenda should be. That's terrifying in terms of like concentration of power situation. So we have our transparency proposals.

Nathan Labenz: (1:00:42) Yes. Great segue. I wanna pick it apart and and beat it up in various ways, but why don't you guys just give the high level pitch and run through the 4 points, and then I'll try to to poke at different aspects of it. And then we can Take your idea. We can go into the applied realm as well.

Dean W. Ball: (1:00:58) Yeah. So I think that the in terms of the high level pitch, we probably just made it. We are we are concerned about all kinds of concentration of power scenarios. There's also kind of the important but somewhat more prosaic issue of just we want the public policy debate that is unfolding right now to be better informed than it currently is. And a lot of the debates, even among extremely informed people, really come down to questions about what's going on inside of the frontier labs. So some degree of frontier lab transparency is important. And I think for me, and I actually, I'm gonna say this is me personally, this might not be something Daniel would second, but for me too, I think that it's just going to be so hard to craft good public policy to guide us over these next These transitional years are just so hard. Like, I just don't think you can write rules for how this is supposed to go that are like going to be useful or good or, like, not too many. It's going to be very hard. You know, I'm I'm skeptical of the whole guardrails concept. But I think 1 thing that you really can do is you can shape the information environment, and the information environment shapes the incentives of all of the actors here, whether it's the policy making world, whether it's the frontier labs themselves, their relationships with 1 another, their relationships with their employee you know, the management and the labor relationship inside of these labs, all that. I think if the information ecosystem is healthy, I think that makes for better incentives overall. So there are 4 pillars to what we put together, and I don't necessarily know that you need to do all of them. We threw out 4 ideas. 1 of them, I think we think is essential no matter what, and I'll get to that 1 last. The first 1 is about setting up a mechanism for companies to inform the public and the US government about capabilities thresholds that are reached. Basically, if a lab sort of breaches a new a novel level, you know, responsible scaling plan as Anthropic calls there. So if it's we're at ASL AI safety level 2 now, and we jump up to ASL 3 and Anthropic sees that for the first time and they're really convinced of it, there would be a mechanism of informing the public that has happened. And maybe not in detail and certainly not focusing on the how it happened, but informing the public that has been reached and that the things that we associate with that kind of an AI capabilities regime may well be on their way. That would be 1. And an attendant part of that is that you would have to publish a responsible scaling plan, which most of the labs have done at this point, but not all of them. So that would be another important part of it. Number 2 is the publication of model specs. So Daniel was talking about that. This is a document. OpenAI is a good example of this that basically just outlines how do we want our models to behave, how do we want our models to conduct themselves. And when there are principles, I think very importantly from OpenAI's perspective, we have these principles, but we're going to order them hierarchically. And so we're going to tell the public when these principles collide or conflict with 1 another, this 1 is going to be more important than that 1. And I think being In the long term, I would hope that model specs can actually become both a kind of governance tool and like a user interface tool. And in some sense, actually think the user interface and the governance are like not as different problems as people think. But I would hope that 1 day people can actually customize a model spec within certain parameters to their own liking. But yeah, mandatory, at least publication of what the company has decided. And then we have some ideas generally just about the idea that researchers should be able to share just more publicly in candid ways about what they think is going on, not as representatives of the company necessarily, but in their own personal capacities, and without sharing any intellectual property or anything like that. I think the big tension here is I am concerned about socializing the intellectual property of these companies, particularly because it has national security salience and it has economic competitiveness salience. So it's not like I want everything that's going on inside of OpenAI to be public knowledge. Also because that messes up the incentive structure in my view. The last 1 is whistleblower protections, and this is essential if you're gonna have a governance regime like this that doesn't involve, you know, government audits and government investigations into you to be sure that you're complying with the law, the way you ensure compliance, the way you enforce these laws is by having whistleblower protections that grant employees the right to you know, if if a company publishes a safety plan and then they do not conform to that safety plan, then a whistleblower can go to the government and say that that's going on and substantiate that claim, the government can then take it from there. Oh, I skipped 1. I apologize. And that's safety cases. So this is an emerging idea in the AI safety literature. It's an old idea in other fields, but it's emerging for AI. The basic idea here I mean, the way I think of a safety case, and this is People have yelled at me on Twitter for this, but I think about it as basically like we have a safety plan, which is a kind of general governing document for how our company will approach the development of frontier AI systems. And then a safety case is about a specific frontier system where you're saying, this is why we think this specific model is consistent with the safety plan and other governance documents that we have released as a company. So it is a kind of transparency like this for me. I just substantively believe that the next couple of years, like probably should be largely self governed, largely a self governance regime, largely a somewhat more informal governance system rather than 1 of hard law. And I think that transparency sets the information parameters for how to make that go well. Daniel probably disagrees with me on the self governance stuff and stuff like that. But beyond that, I think that probably represents it relatively well.

Daniel Kokotajlo: (1:07:06) Thank you, Dean. Yeah. That was great. And yeah, as you just hinted, I, by contrast with Dean, want much more like I want regulations that take more decision making power away from the companies and put it in the hands of the government and the people about frontier AI because I think it's going to be incredibly important. I also think that there are things like SP 10 47, for example, which I supported. But I would say the most important thing is the transparency stuff. I think that's part of why we're here today is that Dean and I agree on that much at least. If there's 1 thing that we're going to mandate, I would say make it be the set of transparency proposals that we just mentioned because I think like Dean said, that's the information environment and it allows society to flexibly adapt to what happens because society is now informed about what's happening and can learn as it goes. So in my opinion that like sets the stage for more competent weighty regulations or interventions later if you've got the transparency so that people know what's going on and are educated about it. There's something else I was gonna say. Oh yeah, the safety cases stuff. So I think about it slightly differently. Everyone's entitled to their own favorite way of thinking about it. And Dean, I've also been yelled at on Twitter for Funnily enough, people have accused me of safety washing because I use the word safety case to describe something that's actually pretty lax and not super rigorous. Whereas they were like, No, no. If we're gonna call it a safety case, it has to actually be this more rigorous thing that gives you some level of assurance. So let's talk about that difference there. I'm not an expert on other industries but apparently in other industries there are safety cases that like, there's like more strict rules for what can count as a safety case and it has to be like a quantitative estimate of the risk perhaps or something like that based on a model of the system and what the possible dangers are and so forth. I wasn't advocating for something that rigorous. I was just advocating for you should have a document that explains why you think the various bad things that other people are worried about are not going to happen as a result of this system. And so for example with GPT-four and other similar systems, that document could be very simple. It could be like a single paragraph being like people have talked about AI is taking over the world, people have talked about massive concentration of power, people have talked about these things. But GPT-four is too dumb to do any of that. So QED. Like for current level systems, that's sufficient as far as I'm concerned. But once you start getting towards systems that are actually quite powerful, you start to have to ask yourself the tougher questions about are we sure that it's not just pretending to be aligned, for example? Because it is pretty smart. It totally could be just pretending. In fact, there have been model organisms papers that show that the models are capable of just pretending for this sort of thing. So you start having to ask tougher questions as the models get smarter and so your safety case, in my opinion, should grow from a paragraph to many pages of like here are the experiments we ran and here's why we think that these experiments are good evidence that the bad thing is not happening and so forth. What I would like to see is just a norm perhaps enforced by law that the companies should be writing these documents. And it will start off very light touch with like almost no requirements on what actually goes in the documents because naturally as the systems get smarter, people will start raising their standards for what should be in the documents and people will start demanding answers and being like, we're quite worried about this, says the public. What do you have to say about this concern? And so then the company's PR team will be like, well, we have to say something here. And then the PR team will go to the technical researchers and say, give us some answers. Spend a couple days writing up your explanation for why this concern is unwarranted. And then the technical team will go do that and send it to the PR team. And then the PR team will publish it. That's the sort of cycle that I'm imagining that's completely voluntary, doesn't involve a government bureaucracy telling you what isn't good enough, but nevertheless, we'll sort of get this information out there. Information such as, when I was at OpenAI I had tons of conversations at the cafeteria about these sorts of things with all sorts of other researchers, right? And I'd be like currently we do RLHF or variants thereof. And I'm concerned that might result in a system that's just pretending to be aligned instead of a system that's actually aligned. Or a system that's trying to score highly by the RLHF training metric rather than a system that's actually trying to be honest and helpful. And these things will look very similar but maybe come apart in important cases down the road. I'll be raising these sorts of concerns and then some people will be taking them very seriously and saying we'll get there when we get there. Other people will be taking it very seriously and saying don't worry, think it's not going to happen because we're going to do x, y, and z. Then And other people will just be like completely blowing it off and being like, that's silly. Why would you be worried about that? There's no evidence that's happening now or something like that. And like this range of opinions present within the company, I wish this was like a public conversation that academia could be engaged with instead of just like something that people in the cafeteria table talk about. And I think the public deserves to know for like if this was more of a public thing and people were like seeing what the arguments were being made, then there could be some sort of like independent external check on them. A case that I honestly expect will happen, some variant of this, there'll probably be some red flags. There'll probably be some collection of incidents that the monitoring system picks up of the AI system behaving not according to spec in some way. The AI system being dishonest. So this has already happened, right? In the OpenAI spec in the OpenAI system card for 1, they mentioned like 1% of the time it like hallucinates a link and seems to be aware that it's a hallucination but then goes ahead and gives it anyway or something. So there's gonna be various red flags and warning signs coming up. And then there's gonna be some sort of internal cafeteria discussion about what should we do about this. And it probably won't be the whole company discussion. It'll probably be just a few people on the relevant team. And maybe they'll conclude we'll just sort of like throw that sort of stuff in the training data and train it away until it's not making those red flags anymore. And maybe that works but maybe that just trains it to be better at hiding the behavior, right? Or maybe it's just a shallow patch on the problem that's just going to re arise later when the system is smarter. I really want there to be like a lot of independent experts with scientific expertise thinking about these questions and talking about it together. And I think that a requirement to have a safety case and publish it is a big step in that direction.

Nathan Labenz: (1:13:25) I have a bunch of questions about the sort of taking a red team approach to this proposal, which I do generally support. Yeah. I was once upon a time a sort of whistleblower myself and definitely would have, you know, benefited from even just some guidelines. Like, 1 of the pieces of feedback I gave to OpenAI was like, you gave me nothing to work with in terms of, like, how am I supposed to proceed if I do have concerns? There was really nothing at that time. I guess it would be helpful for me to have a better sense of what you mean when you say that you don't have confidence in OpenAI to behave responsibly around the time of AGI, I think was almost the verbatim quote. Like, my model on the outside is that the goal of AGI and like being the first to do it and there's a little bit of obviously the kind of thrill of discovery, thrill of access, thrill of secret knowledge, as well as the sort of narrative of like it's going to benefit all humanity. My general sense is that it has become an ideological project or kind of a shared mentality within the company that like this is what we're going to do and we're going to go do it and it's going to be amazing. You could tell me that's totally wrong and there's like a lot more active disagreement internally, but what is the sort of nature of your how would you characterize the company? And that's a big question, but why are you not confident in the company to behave responsibly?

Daniel Kokotajlo: (1:14:49) So I think the high level thing I would say is that incentives matter a lot. And 1 of the ways in which they matter is that they cause people to rationalize to get to the conclusions that are incentivized. And this is a very normal human thing. This is not like an attempt to dump on OpenAI in particular, but probably applies to across the industry. If you are a company that's in a sort of existential struggle to beat other companies to market, then you need that's the most important thing right now and you need to tell yourself that this is really good. That's what they are telling themselves. And I think that as a result, they are not going to have a like I just don't trust their judgment when it comes to things like, Is this a good idea at all? Or, Should we be trying to slow things down? For example. Or, Is the latest newest system design that's going to be deployed internally to do a bunch of R and D, is that safe or not? These are like technical questions that we don't know the answers to that we're gonna require lots of careful thoughts and judgment calls based on limited evidence. And it's going to require like making cost benefit analyses of weighing the risks to all humanity with the benefits to all humanity and stuff. And like the bottom line I would say is that I just think that the people at these companies are going to be biased in their reasonings about this and are going to be systematically taking on way too much risk and underestimating the risks. And so I would like it to be more of a public conversation instead of something that they just decide behind closed doors.

Dean W. Ball: (1:16:19) I think this is 1 other area where we have some important disagreements. I have never quite understood these sort of AI safety and policy communities. There is like an obsession about OpenAI, and I understand the obsession, but there's this idea that, oh, they're like, they don't care about safety anymore, and there's these narratives that get spun up, and they just don't reflect the reality that I see. I feel very much like, you know, I spent a long time in my childhood watching Apple very closely, under Steve Jobs from 'one to his death in 2011, and then afterwards. There were it was always, you know, Apple, it was always like, the doom is right around people don't remember this now because now Apple's a $3,000,000,000,000 company, but Apple used to be a tiny little company. It was always like, the doom is right around the corner for them. Oh, they did the iPod, but they're gonna get something's gonna come and kill them because they all they do is they're just artists, it's just Steve Jobs, and there's nothing there that's defensible. People will copy it and blah blah blah. And it's different in terms of the specifics, but there is this dynamic of, I know this is happening in my soul. I know this is happening that, like, Apple is about to be doomed or OpenAI is about to have a safety nightmare. And I just don't quite see it. I see a company that is, like, a young company that got put on a rocket ship in the last 2 years and is going through all the tumult that is associated with that. I don't think any company grows that much that quickly, gracefully. But I think that as far as the incredibly tight balancing act that all these firms, DeepMind, OpenAI, Meta, Anthropic have to walk. I think that, like, I'm pretty impressed with with the way that they've walked it. They've stumbled in certain ways and not everyone is entirely happy, but they would not be achieving the successful balancing act if any 1 group were entirely happy. I don't know. From my perspective, they're pretty responsible actors. But at the same time, I don't know. That's just my hunch. At the end of the day, I certainly think incentives matter a great deal, and that's where I come down on the transparency stuff of why I think it's valuable. Because do I think that my hunch about the internal culture of OpenAI should be dispositive for how we approach this? Like, no, I do not. And I guess the last thing I'd say is like, Tim, there is this kind of interesting question in the AI safety world of like, is alignment this like exquisite problem that you have to solve like 99.9 or 100% rationally, like a priori? Or is it like a muddle through type problem? Is it just, yeah, you know, we hammered that, we kind of like put some fresh paint over there, and it's not great. It's not perfect, but we like mostly got there, and it's fine. It works. But is it like that kind of thing, or is it a different kind of thing? And I don't know the answer to that question. My intuition has always been that it's more of a muddling through kind of a problem. But Daniel has made the most compelling evidence that if we treat it that way, that will be bad. So we should definitely be I don't think either of us know the answer to that question, but we should definitely be looking for the evidence that can help us answer that question of is alignment this like really, oh, we gotta get this like a 100% right from the get go, or is it like we can just model through? I think running experiments and doing science to answer that question is extremely valuable, and I think the company should be incented to do that. And I would agree with Daniel in the sense that I don't think that they're obviously incented to do that. I think they're incented, even if they have the best of intentions and even if you think Sam Altman is the greatest guy ever, the incentive that you can imagine them having is to treat it as a muddle through problem, is to have my intuition about it because that's how they handle everything else. That's like how AI development, where it's like rigorously empirical, and we're gonna engineer our way through this problem and this problem and this problem. We're gonna do 1000 and deep learning works, and then we'll figure everything out from there, right? Like that's basically the mentality in some important ways. And so you can see how they would obviously have an incentive to treat it as a muddle through problem, and I'm just not a 100% sure that it is, and I think that the companies need to be investigating that question. And they're the only people in the world who can really investigate this question with any kind of intellectual rigor.

Daniel Kokotajlo: (1:20:21) Let me add some things to that. So I'm not sure if I like the framing of muddle through versus, I forget what the other thing was, But maybe I'd want to tweak it a little bit. Here's some words I would say that maybe gets at a similar concept. I think there's this dimension of like, and SpaceX by the way, I love SpaceX. I'll use them as an example. They have their Starship development program and then they have their human flight, human crew program that launches people into space. And I don't actually have any direct knowledge of this. I'm just a fan. But I would imagine that they put a lot more work into making absolutely sure that the human crew launch doesn't have any problems before they actually put humans and launch them on it compared to their Starship flights where they just clear the blast zone and make sure that if it does blow up no one's gonna be hurt. And then they see what happens. And then iterate from it and learn from that. With respect to AGI, I would say that the starship type approach is totally the right approach up until AGI. And then you need something more like the human flight approach when you're around the level of AGI. And the reason why is because of the consequences of failure. If you have GPT-four level systems and you do some RLHF to try to make them helpful and honest or something and then it turns out that they're totally not honest and they will deceive users in some cases and give hallucinated links or whatever. It's cool. Now you learned your technique didn't work. They're still hallucinating links and they know that they're hallucinated but they're still giving them to users. Great. That was important science that you just did by running into reality like that and watching it fail. And that's totally, I think, the right way to go for these levels of capability. But once you are, for example, creating a system that's so smart and so broadly capable that your next step is going to be to put it in charge of your data centers and have it take over the research and produce even more intelligent systems. Then the president's gonna come knocking and say, let's make a cyber web let's use it to make a cyber attack against China or let's use it to cure cancer. Let's have it give strategic advice to our military to win the war. It's not only going to be getting really smart autonomously but it's also going to be plugged into all sorts of things. Then there are human lives on the line and that's like an understatement. This thing had better work the way that it's intended to because if it ends up misaligned, it could end up plotting against you to make sure that you don't notice it's misaligned until it's too late, for example. Right now we're in the regime where any mistakes that we make and any errors in what happens or any unintended consequences will probably just be noticed before it really gets too bad. Whereas at some point we'll be in a regime where a single mistake or a single unintended consequence could snowball out of control quickly and have literally existential implications for the entire human race. And so since the regime is going to change, I think that companies such as OpenAI need to be starting the process of changing their methodologies to be ready for that new regime And I don't see that happening. And to be clear, while I am criticizing OpenAI and they're the ones I have the most direct experience with, I would say similar things about other companies. I think this is sort of like a whole industry problem. Another thing I would say is that there's this dimension of what style of research are you doing. Are you doing the iterative fail fast style or are you doing the more principled think it through, consider all the different possibilities and have a reason why they're not going to happen style? There's a separate dimension or a separate axis which is just like how much effort do you put into this stuff? And there, think the answer is we need more effort as the stakes get higher. And I think that we currently don't have anywhere near enough effort. For example, super alignment, for example, was good. I'm very glad that OpenAI had super alignment team, and the fact that OpenAI chose to make super alignment was a positive update for me about the, like, trustworthiness and responsiveness of OpenAI because they were like, we're gonna have 15 to 30 people whose job it is to, like, start thinking ahead a couple years to when we actually get AGI and start thinking in advance about the ways in which our current alignment techniques might not work for AGI and how we can design better techniques and then test those techniques and get them ready. And we are gonna assign 20% of our compute and 30 researchers to this problem. I'm like, those are good numbers. That's way better than what a lot of other companies are doing. I would like those numbers to go up even higher as we actually get close to to AGI, but it's a great start. I would say there's these 2 dimensions of style, fail fast versus principled, and then there's actual quantity of how much effort are you putting into Yeah, those are my opinions.

Nathan Labenz: (1:24:55) So do you think OpenAI has a plan that they collectively believe is gonna work at this point?

Daniel Kokotajlo: (1:25:02) This is like the sort of like crazy open secret. There is no plan for how I think that as far as I know, the closest that OpenAI has to a plan for technically how to align AGI is in appendix g of the weak to strong generalization paper, which super alignment put out around the end of the year, 2023 I think. It's like a 2 page appendix where they're like, Hypothetically, here's a little toy design where you have some sort of RL process that's training your pre trained model into being a powerful general agent. But before you get that going, you take your pre trained model and you turn it into a reward model using weak to strong generalization techniques which we are currently developing. That reward model will then be a reliable source of truth for things like is this an honest output and is this a helpful output and things like that. And then you can add that reward model in to the training process so that when you get your powerful general agent, you also get a powerful general agent that never tells a lie. They sketched this out in more detail than I just said over the course of 2 or 3 pages in an appendix. And basically everybody I talked to thinks, yeah, probably that won't work. Here are some reasons why it might not work and so forth. More research is needed to iron out those possible bugs and test whether or not this is really working or not. Plus the weak to strong generalization techniques themselves need improvement. There's a whole research agenda there and so forth. But at least there is a plan and it's written down it's something that can be iterated on and critiqued and so forth. But this is not like OpenAI's official plan or anything. This is just like something that some people on super alignment wrote up. And I would love to be in a world where the companies have not just 1 official plan but like multiple official plans for like how we would try to align our systems on a technical level if they were at AGI level today. And those plans were published so that people in academia could critique them and say I don't like this assumption here for these reasons or something. I think that would just accelerate the overall research process substantially and would result in, well, in lower probability of loss of control when we do get to AGI level. Like we would just, as a civilization, as like an industry, as I could scientific community would just be more on the ball if we had been doing this.

Nathan Labenz: (1:27:24) I remember reading that paper. I really wanted to love it and I didn't find it super compelling. I remember the 1 thing that stood out to me the most was that if I recall correctly, there was like a free parameter, free variable of how willing the strong student was going to be to override the weak teacher and the best results seemed to have been achieved in that paper by turning that variable up so that the strong student was more willing to override the weak teacher. And then I'm like, but wait, we're the weak teacher in this system. Right? Like even on the initial results, like if we extrapolated from there, I didn't feel like we were necessarily headed anywhere good, even though it was like interesting that you could do some weak to strong generalization. It did not feel to me at that moment like it was really even on the track towards something that I could tell myself was going to work.

Daniel Kokotajlo: (1:28:21) I think I feel like I'm probably more optimistic about it on a technical level than you are. Although the bottom line is, no. I don't actually expect it to work. Lots of people at OpenAI who were on the Super alignment team or who are off the Super alignment team also think, yeah, this probably won't work. But it's a start and it's important to have this sort of stuff published so that we can talk about it and critique it and stuff like that. Because otherwise, the alternative is that people sort of muddle through and behind closed doors, a small group of people pull something like this together at the last moment pulling all nighters because they need to meet the deadline or whatever. And then it doesn't get peer reviewed, doesn't get external input and it just gets the go ahead from leadership because leadership is moving fast and breaking things. And then the fate of the future depends on that not have any flaws.

Nathan Labenz: (1:29:05) So if I'm let me just give you a you can critique this or, you know, dismiss it or whatever, but for the purpose of red teaming the transparency proposal, I do see this, like, growing ideological sort of underpinning to a lot of what OpenAI is doing. Rune on Twitter just in the last couple days said I used the phrase total OpenAI ideological victory when referring to the fact that Anthropic is now, like, doing more sort of, you know, early deployment. And I publicly praise Sam Altman in the past for saying in response to questions about China, and this wasn't that long ago, like 18 months ago, that we shouldn't base our thinking on what we think China might do because we don't know what they're gonna do, and let's just try to focus ourselves on what the right thing to do is and not so much worry about them. That was 18 months ago. Few months ago now, he's got editorials out there saying, it's our AI or it's their AI, and there is no third way. And maybe that reflects a genuine update in worldview. It's been striking to see Miles Brundage leave and say exactly the opposite. In order to be he thinks we should definitely not adopt that mindset. But if I was gonna be cynical, I don't wanna be too cynical. I'm not as cynical as maybe the the following would sound, but it sounds to me like it's at least consistent with a picture of leadership that's we're in the pole position to make this crazy stuff. I personally experienced a little bit just with my GPT-four access on the red team 2 years ago of just how thrilling it is to have the earliest access to this, like, frontier technology that is gonna be world changing and we're just now figuring it out. And it was definitely caused me some personal psychological turmoil to think, geez, if I go, like, talk to the board about this, they might kick me out of this thing or worse. And they haven't done worse. Know? I have a customer relationship with them and all that stuff that's been intact, so I appreciate that. But, anyway, maybe this leadership is, like, just wants to get there, just wants to be the hero, maybe is willing to roll the dice, maybe is not being, like, super genuine about this, like, change in whether China, like, shouldn't be the bogeyman or maybe should be the bogeyman. And if that is true, then I do wonder how they're gonna respond to some of these transparency requirements. If that were to become a law, let's say, for 1 thing, if you have a whistleblower protection, like, 1 thing you might do to protect yourself from that if you are ideological and you are kind of willing to take more risk than the public would be willing to bear is, like, purge potential whistleblowers, which it seems like we've also seen a version of that sort of happening either voluntarily or involuntarily. What's your reaction to that characterization? I'm not, like, fully all in on that, but am I too far off to worry about that general characterization?

Dean W. Ball: (1:31:50) Yeah. Like, I I don't think entirely, but I would I guess I'd push back on a couple of things. I would push back, first of all, on the idea that Sam Altman's sort of apparent pivot on the China issues I think the reality like, there's a fundamental tension, right? If you think that AI should be regulated, what that means in an inherent sense, what that means is that you think that OpenAI as an institution and the people in the government that would be doing the regular, people in congress, the people in national security apparatus, the people in the department of commerce who are increasingly part of the national security apparatus, you want those people to have more relationships with 1 another. Right? You want them to go to the same cocktail parties and you want them to be in signal groups together, right? Like that's an inherent That would be an epiphenomenon of there being more regulation. This is sometimes called regulatory capture, but it's also just a social phenomenon and it's an inherent and necessary 1. And if you want that to be the case, like, there is no world in which OpenAI and the United States government have deeper relations, where OpenAI is regulated by the United States government, and Sam Altman's, yeah, I think we can be friends with the Chinese. That's just never gonna happen because that's not the political dynamic. So that has always been part of my concern, why I have always said that regulation has a serious trade off and that you are biasing, You are going to be biasing these companies to be part of American power projection, and they always will be part of American power projection. But if they see themselves explicitly in that way and our government sees themselves explicitly in that way, and to the extent they do, you create all kinds of weird problems that can exist. This is what I've always tried to say about the dangers of getting into bed with government. So, yeah, I basically think there's no world in which there's responsible AI regulation and also OpenAI is not essentially toeing the state department's line with regard to China. Whatever you think about that line, right? I'm inclined to agree that AI is a very serious national security asset and potential risk, and that we do need to be thinking about how to make sure that America wins. Right? Like, I basically do believe that. But also, I'm from Washington DC, I work at a think tank. I think the same dynamics apply to me. So, you know, you should always keep that in mind too. I am part of American power projection is what I'm saying. Right? Like, I'm part of the system too. You guys aren't as much.

Nathan Labenz: (1:34:16) Live a very monastic life. Yeah. So it's pretty quiet around here.

Dean W. Ball: (1:34:20) There was 1 other point that you made, though, that was just about the the specific thing of if you have whistleblower protections, do you incentivize the company to root out the potential whistleblowers? And that is a thornier 1. I am essentially making a statistical bet on that specific incentive problem with the basic realization that there are what, 2 50 people? Really? 2 50 people who like really matter? I mean, these companies have more employees than that, but it's a heavy tailed universe the top 1% are really important. And I'm basically making a bet that a lot of those 2 50 people are probably pretty high integrity and that there's a lot of those 2 50 people who the companies both feel as though they can't do without and who would blow the whistle if they needed to be. But it's a bet. Like everything in the world, it's a bet. It's a wager.

Daniel Kokotajlo: (1:35:07) Yeah, well said, Dean. But that's also what I would say is I think that the whistleblower protections are definitely not a total solution because I think it's entirely possible that all the people who would whistleblow get pushed out or fired or otherwise not included in the important decisions. But I also think it's possible that the wishblower protections do actually work as intended and that enough people have enough integrity. So I think it's still just a pretty solid thing to advocate for.

Nathan Labenz: (1:35:32) Yeah. I favor it, I think, over not having it for sure. I can talk myself in circles certainly on this stuff. What sort of protections do you have in mind? I think you're not like guaranteeing people's continued employment

Daniel Kokotajlo: (1:35:48) at So a company, there's different things on the table. And I'm curious what Dean thinks the best versions would be. I particularly like the anonymous hotline to the government idea. Again, I should probably do this. I should probably write a blog post that's here are 10 different things, incidents that could happen at 1 of these AGI companies that there should be someone outside the company looking at. Such as our latest alignment technique seems to have not worked. We were trying to make it honest and helpful and harmless or whatever, but it's clearly being dishonest in some set of cases. What do we do about this? We're using it to automate most of our research. It's hard to just stop because then we'll fall behind. Someone has the bright idea of, well, let's just take all the cases where it was dishonest and then negative reward and then proceed once it seems like we are not detecting any more cases of dishonesty. And maybe that works. Maybe that's good enough, and you've just trained it to be honest now, whereas your previous technique didn't work. But maybe what you've done is you've just patched over the problem, and you've made it better at hiding. Right? This is the sort of conversation that I do not think like a few people behind closed doors with massive conflicts of interest should be like trusted to make this sort of call. And this is just 1 example. I could easily create 10 different examples of like tough calls like this that will have to be made during around the time of AGI. And what I would like is for there to be, for example, the AI safety institute or some other sort of government body of people that have some expertise in these issues where people can be like, hey guys, we're having this technical issue here. People are not I as an employee of the company think that the current decision the company is making is dangerous for these technical reasons. What do you think? And then the government can be like they can evaluate and be like, it's fine. Whatever. Or they can be like, Wow, actually this is really serious and I'm not comfortable with the level of risk implied by this company's decisions or something. I think that's the ideal for me because I think that's the sort of thing that's going to matter the most. There's going to be like a dozen things like that happening in the course of the AGI accelerating R and D and it's less of existential importance to all humanity that those calls be made correctly. But there's another thing which is separate from that which is just holding the companies to their own word, right? Like the companies will make all these public promises about what they're doing to be safe and the procedures they're following and so forth. But there needs to be some mechanism to actually tell whether they're actually upholding their promises. And currently there isn't much of a mechanism, right? There is that deployment safety board thing which is like not actually a deal on the object level. GPT-four wasn't actually dangerous so putting out in India didn't actually hurt anyone but they've totally violated the procedure that they themselves had set up and they didn't tell anyone about it. We want that to not be a thing in the future. We want it to be that they actually uphold their commitments and get called out on it if they don't. You know?

Dean W. Ball: (1:38:40) Yeah. Or at the very least, if they do, that they have some sort of explanation that makes sense. I'm comfortable with the idea that you're putting out these documents, you're putting out these plans and these policy sort of documents, and is a month is a year inside of these places. And Mhmm. After 12 months, something has really changed quite substantially, and your plans made contact with reality in a way that forced you to alter your plans. It's either update the plan or if you have to behave in a way that requires you to differ from it, explain why. Explain why to the extent possible. Like, there's a lot of this that is hard to put into statute. But I guess the only other thing I would say on the whistleblower stuff is I basically agree. I think it should be not protections for public disclosure of company intellectual property. I think it's much more

Daniel Kokotajlo: (1:39:23) Agreed.

Dean W. Ball: (1:39:24) Privately go go to whatever, the Department of Justice or the the AI safety institute, and you talk to the people there. You can provide because I think there's bit I think it's very hard to say, oh, substantiate your whistleblower claim without revealing company intellectual property. I I think that's, like, probably not possible at all. Probably in some situations, it is possible, but certainly not in all. So you can't put that in the statute. And if that's gonna be the reality, then it's gotta be, like, a private government mechanism of some kind. So that's basically, yeah, how I think it's gotta go. And actually just to be substantive, it would be protections from after action lawsuits. It wouldn't Obviously, be

Daniel Kokotajlo: (1:40:07) this is something that I experienced. When I blew the whistle about the equity stuff, it mattered a lot to me that I was legally in the clear. Some people I talked to were like, Just sign the paperwork and then disparage the company anyway because they wouldn't dare sue you. It would look so bad. For me and for a lot of other people, it just actually matters whether what you're doing is legally protected or not partly because you're afraid of getting sued and stuff, but also partly just on the principle of the thing. You want to be the sort of person who obeys the law. So I think it just actually matters quite a lot. There's a lot of people who would totally raise concerns to the government if it was allowed but wouldn't do it if it was not allowed. Know? Don't have to say anything like and then they won't be retaliated against and then they won't be fired or whatever. Whatever. They can still be fired. That doesn't matter. The the important thing is just are they actually breaking the law or not, you know?

Nathan Labenz: (1:41:01) Yeah. So I was gonna ask about what other adjustments internally might we be concerned about? Earlier toward toward the beginning of the conversation, you said something about o 1 where you were presumably this is what OpenAI was doing with o 1. And this got me thinking, you not know or or you're

Daniel Kokotajlo: (1:41:18) kind of

Nathan Labenz: (1:41:19) speaking speaking generally? Yeah. But I do wonder, like, my sense, the the sort of vibe is like that all of the internal information sharing is also being ratcheted down. Like, what I'm hearing now is that Yep. People at the companies themselves are, like, being updated relatively late in the game when, like Yep. Certain research agendas have come to mostly fruition and are, like, on the verge of launching. I have no idea what the ratio is of who's in the know and who's not in the know, but that seems to be another dial that leadership, if inclined to take the path that, like if if they're maximizing for their individual chance of creating AGI that benefits all humanity and being, like, the number 1 hero of of all human history, then that's like a dial that they could turn. And I feel like there are a lot of those as I think this through, I'm like, on the training spec also, what if a company starts to say, you know what? We are just going to maximize user scores. We're not actually gonna have a constitution. We think that's and you, of course, you could dress this up. Right? That's heavy handed of us, and who are we to say? Let's actually just let the people vote with their thumbs ups and thumbs downs, and we'll maximize that. And that's in some sense more fair, but then it gets them maybe out of a lot of these things. I guess I ultimately think we need some actual rules too, to put my cards on the table. But how much do you worry about that that sort of, like, counter move in response to these kinds of requirements?

Dean W. Ball: (1:42:49) So, like, something that I've said on this podcast several times before in the SB 10 47 debate is that there is, like, a Shakespearean relationship between the intention of public policy and then what actually happens. And there is an extent to which, like, any rules you create are very likely to make the thing that you're trying to fix worse in some important way. That happens all the time, and to some extent it is unavoidable. I think the task of someone who designs public policy is to realize that and try to do your best to mitigate that kind of basic fact about reality. So I would say a couple of things. Like the transparency First of all, yes, the companies are I think my impression is that they're becoming more and more compartmentalized. Part of that, I think, is we're going to see more and more on the cybersecurity front from these places. At some point, the world isn't ready for this conversation quite yet, but we're probably going to start talking about physical security of not the data centers, but the people, right? The people at the companies, their physical safety, right? I think all that will come in due time. And as that happens, it's okay, then things are going to get even more compartmentalized. And so to some extent it's already happening, so it concerns me somewhat less because that's a very big part of them, you know, becoming closer with the US government, right? You had Paul Nacasone, former NSA official, join OpenAI's board. They just hired a very well regarded chief information security officer. Chief information security officers do not facilitate multi stakeholder collaboration. That's not what they're in the org chart to do. So, you know, like, I think the best you can hope for is that the balance of having some of these technical documents being public really helps you in terms of getting feedback from the broader world. Because these people will, for the time being, they're gonna be in San Francisco, and they're gonna be inside of a social community where like they do hear things from the outside world and Twitter and all that stuff, right? So like to the extent that these, that people are having these discussions in public, I think that really helps get ideas from the outside world into the companies, even if the companies themselves are compartmentalized. So it's a little bit of the transparency hopefully just gets you some the benefits that counteract against the compartmentalization, which is like happening anyway. I also think that there is something to, and I I don't quite know how to think about this, but there is definitely something to the idea that things about like research practices, to the extent that those documents are being like made public, that like you might incentivize certain kinds of research practices if those are the ones that you have have public, and those are like essentially setting company policy to which you are held through whistleblower protections, right? There's the question of like, would that bias companies toward picking certain kinds of research practices over another? I've thought quite a bit about that, and I just can't think about I I don't think that this would push them too much in the direction of the sort of stuff that you're describing, but I'm I'm happy to be told I'm wrong about that and to have to be persuaded otherwise. But it's just I think it's a minor enough change in the overall regime because at the end of the day, the thing is that alignment is not it's not some cost center for these companies. Some of the safety stuff is. Some of the safety and the security stuff is like an annoying cost center that slows things down for these companies, but alignment is not. Alignment is capabilities. And I know that you have differences of opinion there, but like in some sense, alignment is gonna The actual plan for how all this stuff works is gonna be pretty fundamental to the product strategy. So I don't worry too much about it. But again, happy to hear principal arguments that I'm wrong about that.

Daniel Kokotajlo: (1:46:36) I think I mostly agree. I would say that this is 1 of those things that comes back to the 2 dimensions I was mentioning earlier. So I think currently at the current levels of investments, they're spending sub 10% of their compute and labor on alignment stuff. And they're doing this sort of fail fast, move fast, break things type strategy with respect to it. So as a result, all their alignment stuff to date has really not slowed them down much at all and possibly has just actually accelerated them because it turns out to be useful for making better products and stuff like that. So that's all great. However, I foresee a time when things change. I foresee a time around AGI level where the systems are getting substantially powerful enough that the risks are starting to feel very real and that you like really want to make sure that this actual system isn't just pretending. Because if it is just pretending, you're fucked because you're about to put it in charge of a bunch of important stuff and you might not be able to recover from. In those sorts of situations, then I think that things are gonna start to bite. And then the trade off between how much effort do we put into making it aligned versus not. It'll start to actually cause a slowdown if you're being responsible I think. Similarly in terms of the quantity of investments, you might end up in a situation where it's like, Good news, our researchers have discovered a faithful chain of thought technique that when implemented at scale in our latest models would allow us to tell basically what they're thinking and then that would allow us to maintain control of them and iterate on their goals and values over time so that they get gradually more and more what we wanted them to be. Great. Good news. Problem. To maintain these faithful chain of thought techniques, need to be having this paraphraser model embedded in all the chains of thought. We also have to be doing this regularizer to keep it in English instead of having it be in some sort of pidgin language. Also, we're not allowed to use recurrency because that would mess everything up. Because of all these different limitations, the resulting system is going to cost 3 times as much compute to reach the same level of performance as if we abandoned the faithfulness property. And then it's like, Oh, that sucks. Because now we have to make a choice between the thing that we have good reason to believe would be totally safe and the thing that costs 1 third as much. And I expect there to be at least a few decisions like this that come up at some point. So far there's been no conflict here I would say, but I expect there to be a conflict in the future unfortunately.

Nathan Labenz: (1:49:03) We know what you agree on. We've got these 4 proposals. We know a little bit of like where there's disagreement. I kind of feel like all this stuff is pretty high stakes and it doesn't seem like self policing given all these incentives is a great bet. Why not take 1 more step and say somebody from outside the company, possibly a government agency, possibly accredited third parties or whatever, has to actually get in there and look at your shit in the training process and we'll let them be the ones to be potentially a bit differently incentivized and really wanna know what's going on as opposed to wanting to maybe tell themselves a story and have that extra layer of

Daniel Kokotajlo: (1:49:52) I mean, 1 100 I think yes, but I expect Dean thinks no. So I'll let

Dean W. Ball: (1:49:56) him Not necessarily. Not necessarily. So I I what I would say is this, self governance does not mean the companies exclusively self govern themselves or or govern themselves. You might the way to put it more might be, like, private governance. Think about the fact that there was a time period when insurance didn't exist. Right? And there were people with cargo ships that were, like, going out and sailing across the oceans and trying to trade stuff that was valuable. And it was like, what if we lose the stuff? What if the ship sinks, etcetera, etcetera? And, like, how do we deal with that? And there were financial products that evolved over centuries, and eventually, it turned into insurance. And it turned out that insurance was an extremely important part for the epoch of modern market capitalism. There is not a single capitalist country in the world today that does not have robust insurance markets, but you would never have intuited that you need insurance. Like if you read Adam Smith, it's not like Adam Smith is like, Oh yeah, this is all about insurance. You have to have insurance, right? You would have never intuited that, but it was an institutional technology that we needed to develop, And it was a private 1, inherently public act in some sense, this like weird risk management of all these things that oftentimes did involve like geopolitical competition and yet also like highly technical and specialized knowledge. It was a new type of institution that got built, a private governance mechanism. I think that is what we need to build. That is like where the I think that is where AI governance should go. And I think a big chunk of that is in essence a technical evaluation problem for all kinds of things, by the way, not just major risks, but also can doctors trust this not to do medical malpractice? Can blah blah blah? Like, a lawyers trust this, whatever. A lot of I think we use evals for 2% of what we could be at this point. And so I have been thinking quite a bit in recent months about how it is that the problem is I think for very principled reasons, I think the government is not going to do a good job at the things that you're describing. But that's not to say that I don't think that there should be a check and a balance there. So what I've been thinking about is, like, how do you foster an ecosystem of, like, evaluation institutions and other kinds of private governance mechanisms that can make a lot of this go better? I think METER is a really good example. The model evaluation and threat response research. I forget the acronym. It's a somewhat funny acronym. But I think the guys at METER are really smart, and they're thinking about this in a serious way. It's already happening. Hayes Labs is another example. There's a few groups like this. The AI safety institute of the US government is also somewhat like this and I think has a role to play. So I've been trying to think about how that ecosystem might look, and I think you should expect to see stuff from me on this front in the next couple of weeks, because I do think it's important. I just think if you have the government do it, it's gonna do a really, really bad job. And AGI will definitely not say the n word, but it might kill us all, right? That's the problem that you're gonna run into, right? Like, you'll have a very, very, like, politically correct, nice AGI. It won't say anything mean, but but it might kill us all. Like, I don't necessarily think it's AGI is gonna kill us all, but I'm just saying that's the risk you run with government.

Daniel Kokotajlo: (1:53:04) Government's not

Dean W. Ball: (1:53:04) gonna focus on the serious risks.

Daniel Kokotajlo: (1:53:06) I totally agree that government has historically botched so many things, and I'm only advocating for this due to lack of better options. But I would love to see a better option presented to me. Yeah. So I'm excited to read your stuff, Dean.

Nathan Labenz: (1:53:19) It does sound like an insurance requirement could be a kernel of an interesting idea. Daniel, last question for you, and I'll let you go. Anything that you're looking at right now that could, in a major way, change your outlook? For me, distributed training is a big 1. If I start to see breakthroughs there, I'll kind of have to reevaluate a number of my outlooks on the next couple of years. That 1 or others that you're keeping a close eye on?

Daniel Kokotajlo: (1:53:47) So whenever there's a new big release, it's like Christmas for me and for my team. We all drop what we're doing and read the paper and stuff. It's unfortunate from our perspective that things are so closed because it makes it so much harder to forecast the future. I think that these big questions like when AGI and how fast the takeoff and things like that, I could have substantially better answers to those questions if I was super up to speed at exactly what Anthropic did to make their compute using agent and exactly what OpenAI did with their latest O1 and stuff because then we could look at it quantitatively and be like, They've like trained in this type of environment for this long with this type of algorithm and then I can extrapolate and be like, what if we had a 100 times more compute and we had a much bigger richer environment and we trained for that much longer and so forth? What would the scores on the benchmarks be then? That's the sort of thing I could do if I had all that info, but I don't. Instead, I have to get what info I can whenever there's a big release. But yeah. Thank you and thanks Dean. This was fun.

Nathan Labenz: (1:54:44) Yeah. I appreciate you both, Daniel. As I said earlier, you've made an important contribution, I think. And I really also appreciate you both coming together and hashing out a zone of agreement. And this is definitely something that a model that many more people in the discourse should, take inspiration from. So I appreciate it, and, hopefully, this won't be the last time. But for now, I will say, Daniel Kokotajlo and Dean W. Ball, thank you both for being part of the Cognitive Revolution.

Dean W. Ball: (1:55:11) Thanks very much.

Nathan Labenz: (1:55:13) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.