In this episode of The Cognitive Revolution, Nathan shares a fascinating cross-post from Doom Debates featuring a conversation between Liron Shapira and roon, an influential Twitter Anon from OpenAI's technical staff.
Watch Episode Here
Read Episode Description
In this episode of The Cognitive Revolution, Nathan shares a fascinating cross-post from Doom Debates featuring a conversation between Liron Shapira and roon, an influential Twitter Anon from OpenAI's technical staff. They explore crucial insights into how OpenAI's team views AI's future, including discussions on AGI development, alignment challenges, and extinction risks.
Join us for this thought-provoking analysis of AI safety and the mindset of those building transformative AI systems.
Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse
SPONSORS:
GiveWell: GiveWell has spent over 17 years researching global health and philanthropy to identify the highest-impact giving opportunities. Over 125,000 donors have contributed more than $2 billion, saving over 200,000 lives through evidence-backed recommendations. First-time donors can have their contributions matched up to $100 before year-end. Visit https://GiveWell.org, select podcast, and enter Cognitive Revolution at checkout to make a difference today.
SelectQuote: Finding the right life insurance shouldn't be another task you put off. SelectQuote compares top-rated policies to get you the best coverage at the right price. Even in our AI-driven world, protecting your family's future remains essential. Get your personalized quote at https://selectquote.com/cognit...
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive
Weights & Biases RAG++: Advanced training for building production-ready RAG applications. Learn from experts to overcome LLM challenges, evaluate systematically, and integrate advanced features. Includes free Cohere credits. Visit https://wandb.me/cr to start the RAG++ course today.
CHAPTERS:
(00:00:00) About the Episode
(00:07:18) Introducing roon
(00:09:13) roon's Background
(00:16:40) roon the Person (Part 1)
(00:21:56) Sponsors: GiveWell | SelectQuote
(00:24:45) roon the Person (Part 2)
(00:26:43) Excitement in AI
(00:31:59) Creativity in AI
(00:40:18) Sponsors: Oracle Cloud Infrastructure (OCI) | Weights & Biases RAG++
(00:42:36) roon's P(Doom)
(00:52:25) AI Risk & Regulation
(00:53:51) AI Timelines
(01:01:20) Aligned by Default?
(01:09:16) Training vs Production
(01:14:30) Open Source AI Risk
(01:26:25) Goal-Oriented AI
(01:34:29) Pause AI?
(01:39:46) Dogecoin & Wrap Up
(01:41:06) Outro & Call to Action
(01:56:38) Outro
SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://www.linkedin.com/in/na...
Youtube: https://www.youtube.com/@Cogni...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...
Full Transcript
Nathan Labenz: (0:00) Hello, and welcome back to the cognitive revolution. Today, I'm excited to share a cross post from doom debates by Liron Shapira, featuring a discussion between Liron and roon, a widely respected and highly influential Twitter anon account known to be powered by a member of OpenAI's technical staff. What makes this conversation particularly valuable in my opinion is the window it provides into how people at OpenAI and to a lesser extent the other leading labs are thinking about the near term big picture future of AI. While roon is careful to say that he's not speaking on behalf of OpenAI, and I appreciate that this is a candid conversation not representing any official policy, I think it's nevertheless quite telling. So what should you be listening for? Well, for starters, you'll notice that roon is consistently quick to acknowledge and willing to grapple with AI's transformative potential. When asked whether AI might 1 day outperform Elon Musk at running companies or whether it could match Terence Tau at mathematics, Roon takes these questions not as high level metaphors, but as concrete empirical questions that we could very plausibly answer in the affirmative within the decade. At 1 point, he seems to take the concept of a technological singularity pretty much for granted, saying that AGI is coming very soon. While by contrast, highly capable humanoid robots will take longer, by which he means maybe just 2 to 3 more years. With that in mind, it makes sense that roon agrees that it would be logically incoherent to believe AIs will become super powerful without also being willing to confront the reality of tail risks, to endorse the famous AI x risk statement, which was signed by OpenAI leadership and which roon describes as a very low bar, and to predict that meta will eventually stop open sourcing their models, noting that it becomes irresponsible at a certain level to release new models immediately. Yet at the same time, despite this clear eyed view of AI's trajectory and likely future capabilities, roon puts the probability of human extinction from AI causes at less than 1 percent. This to me seems much less well supported by the current state of evidence. So why does he believe it? His optimism seems to rest on 3 key pillars. First, belief in, quote, unquote, alignment by default through the learning of human priors, which we might also call human values during the pretraining phase. Second, belief in the moderating effects of competition between AI systems. And third, and most notably, confidence that the good guys will develop powerful AI first. Now to give alignment by default its due, I've said many times on this feed that our ability to create AIs that understand human values and act ethically has dramatically exceeded my expectations and now seems way more plausible than I would have thought just a couple of years ago. Of course, at the same time, I've also noted that it doesn't really happen by default. So called purely helpful models without safety mitigations are in my experience, and here I'm speaking specifically of GPT 4 early, often shockingly amoral. In any case, it was actually the last point that particularly caught my attention because as you may recall from previous episodes, I am always very suspicious of any analysis that begins by identifying good guys and bad guys and then proceeds to reason from there. History and all sorts of experimental psychology results show that people regularly fail to question such assumptions when others most need them to, sometimes with catastrophic effect. So when Roon discusses his sense of duty to use maximum technical and strategic skill in his work on AI or says that it's a pretty cool thing to say that I eradicated polio, I think we get a glimpse into the main character mindset that I really worry is all too common among those building these transformative systems. Yes. They genuinely aspire to build AGI for the benefit of all humanity. And, yes, they really are trying to do a good job and to be responsible about it. But might they be contenting themselves a little too quickly with better us than them? And exactly what p doom are they asking the rest of humanity to accept anyway? To his credit, Rune acknowledges that he's quite young and that his worldview has been and seemingly remains plastic. But this in and of itself is important context for the public to understand. While OpenAI has built business, product, and government facing teams staffed by accomplished professionals, there has been huge turnover at the leadership level across research, safety, and policy. And the research team, as far as I can tell, seems to now be made up predominantly of brilliant, super productive young math and programming geniuses who have simultaneously embraced the high stakes nature of their work and internalized the low anxiety process over outcomes worldview that's common amongst elite competitors. All this despite, in many cases, having experienced relatively little of life themselves. Perhaps no moment in the conversation highlights this better than when in the context of discussing Apollo Research's work on o 1 and other frontier models scheming against users, a topic on which we did a full episode quite recently, roon compares the experience of developing and exploring frontier models to, quote, having a kid, saying that the developers are simultaneously kind of proud of their exploits, but also wish they could control them a bit better. I'm sure that roon could quickly explain why this analogy is so problematic, and I don't think he's personally confused by it. But there still is a lot of irony in hearing this from a childless person. And it brings to mind 1 of Sam Hammond's 95 theses on AI. AGI Labs led by childless Buddhists with all accounts are probably more risk tolerant than is optimal. All that said, I was glad to hear that Sam Altman will soon be having a child of his own, and I appreciate Roon for doing this conversation sincerely. He describes his goal as using my voice to influence the world in a slightly positive direction, however marginal that might be. And on that, we are highly aligned as I describe my goal for the podcast in almost exactly the same terms. Still, I cannot escape the worry that the best minds of our generation are so focused on being the good guys and so caught up in their own heroic journeys that they're collectively yoloing the development of transformative AI technology, which they openly recognize could prove catastrophic without even trying to quantify the level of risk. As always, if you're finding value in the show, we'd appreciate it if you'd share it with a friend, post on social media, write a review on Apple Podcasts or Spotify, or leave us a comment on YouTube. We always welcome your feedback too. The best way to provide it right now is by completing our year end listener survey, which also includes an opportunity to ask questions for a soon to be recorded AMA episode. Finally, I encourage you to check out doom debates for more deep dives into existential risk questions. 1 episode you might like, if you remember my conversation with a 16 z's Martin Casado about how much more powerful we should expect AIs to become in the next few years, is Liron's breakdown video of that episode. I think he does an excellent job of articulating what makes intelligence powerful and why we should expect AIs to grok at least some important concepts not understood by humans even when full scale simulation is unaffordable or outright impossible. For now, I hope you enjoy this important debate between doom debates host Liron Shapira and roon from OpenAI's technical staff on how we should understand and approach the existential risks from artificial intelligence.
Liron Shapira: (7:18) Hey, everybody. Welcome to doom debates. Today, I've got a very special guest. If you're watching this show, you probably don't need me to read his introduction, but here it is anyway. Roon is a member of the technical staff at OpenAI. He is a highly respected voice on tech Twitter despite being a pseudonymous cartoon avatar account. In late 20 21, he invented the terms shape rotator and word cell to refer to roughly visual, spatial, mathematical skills versus verbal skills. And then everyone in the tech industry started talking about shape rotators versus word cells. He is simultaneously a serious thinker and builder and a shit poster. A few months ago, he posted a tweet that began being afraid of existential risk from AI progress is prudent and advisable, which sparked a doom debates listener named Dogecoin wins to suggest that he come on the show, and he graciously agreed. So here we are. I'm excited to learn more about roon, his background, his life, and, of course, his views about AI and existential risk. Hey, roon. Welcome to the show.
roon: (8:26) Hello. How's it going? Nice to meet you, Liron.
Liron Shapira: (8:29) Yeah. Great to meet you too. I gotta say, I didn't know what you were gonna look like. You look just like I imagined.
roon: (8:35) That's great to hear. You know that that that avatar, Carlos? I picked it. I I wanna say in, like, 2021 or 2020 or something. He's he's just kind of a a smug troublemaker. Even in the cartoon, he's he's getting off shitty jokes, and his classmates sometimes want want him to shut up, sometimes like what he's saying. So I think he's a a good representative. I hope I hope it came across well.
Liron Shapira: (9:04) Yeah. Yeah. I know Carlos from Magic School Bus. I wasn't a big Magic School Bus guy. I was more of a Nickelodeon kid, but I definitely I remember that era of animation. It was good times. Yep. So let me get some basic questions out of the way here. Just the basics. What is your name? What is your quest? What is your favorite color? And you don't have to answer those in order.
roon: (9:29) Okay. I'm roon. I my quest. Okay. I guess it would be to document this era in the advent of the deep learning singularity of the San Francisco of the AI sphere and to be a narrative voice that it's honestly lacking, I would say. Like, maybe the niche has been filled somewhat, but I would say that, especially 2 years ago, I don't think that anyone was covering what was going on here in, like, the titanic scale of the ambition of the AI industry and the tech industry generally with the right kind of language and the right kind of metaphor and whatever. And, you know, I didn't set out with any particular mission in mind, but I think that kind of became mine for a while. I still kind of see it that way too. You know, like, document this epic, this this saga, and, you know, like, write my write write on Twitter, maybe write a book 1 day, and maybe to use my voice to influence the world in a slightly positive direction, however marginal it might be.
Liron Shapira: (10:50) Know? Kinda like the Michael Lewis.
roon: (10:53) Yeah. I I gotta be honest. I've taken a lot of inspiration from Michael Lewis and AGM who wrote Chaos Monkeys. Mhmm. Yeah.
Liron Shapira: (11:04) Yeah. Chaos Monkeys. Great book. You know, the reason I ask about your quest is because I remember when you were going back and forth on Twitter with Connor Leahy, I think you mentioned something about, you know, you see it as your duty to do the kind of job you're doing now. Can you unpack that?
roon: (11:22) Yeah. And I wanna say it's it's, like, it's complicated. I don't necessarily endorse that the things I said in that thread, like, a 100% or, like yeah. I'm I'm quite young. My philosophy on life is has been sort of plastic and not fully shaped at any given time. But I would say, like, the thing I was feeling at the time, the thing I was channeling was this basic feeling that AI progress is good, that, you know, we've done all this thinking. We've done all this meditating on risk, on how how best to build AGI, how to safely do it while navigating some labyrinthine maze of, like, constraints of game theoretic considerations. And, you know, like, where does the barks stop? Like, how do you decide your actual next action? And it's like, what are you doing, like, pure economic rational calculation? It can actually be, like, debilitating. And I think what I was talking about in that thread is, like, to put aside anxiety once in a while to, like, accept something as duty, accept a job or a mission or something like that as duty, which is I think what that means is, like, set aside the existential anxiety and do what is exactly correct for you to do in the moment. And what that means is that, you know, you are using your maximal technical skill in any given moment to do the right thing without fear of outcome. Wait. What do I mean by that? You know, like, you're you're working on a project. Let's say, like, it's ambitious deep learning research road map or something like that. You obviously don't know ahead of time what is going to succeed. Otherwise, it wouldn't be research. But if you spent all your time worrying about whether the project is going to work, you might actually be debilitated from actually giving it your all to, you know, output the maximal technical skill that you have in you to get that thing done. And that actually closes the envelope of possibilities you have in front of you. So I was I was doing, like I I don't know. I was waxing philosophical in that thread. And I thought I met Connor Lee, actually. I had a a great chat with him. I think he's 1 of the, like, smartest people I've met. So
Liron Shapira: (14:09) Yeah. He's amazing. Yeah. That was an awesome conversation. And I remember at the end, after Connor had some really sharp replies, you're like, man, he really cooked me. So, you know, that's that's great that you're able to, you know, concede some points or at least, like, respect, you know, another high intellect.
roon: (14:25) Yeah. You know, I was partly being theatric. I don't I don't actually think, like I I thought we both had some great points in the thread. Like, I I wouldn't say that you know, I just kinda defer that 1 to Connor. But yeah. Cool. Cool.
Liron Shapira: (14:41) And so the part that I was a little curious to unpack though is just, like, you know, going back to your life in, like, 2021 or a little earlier, right, before AI became the big hype. Did you see your life as this kind of duty frame, right, where this is like your life's mission? I'm just a little curious to go back to that.
roon: (15:00) Absolutely not. No. I was just having fun. Well, okay. So I I mean, I've had a always had a strong, you know, technophile frame on things. Like, I think I've had many different, what do you call it, like, political awakenings or, I guess when I was in college, I I changed my politics a few times, let's say. But each time, I would say, like, the the ground truth was, like, which of these things leads to quicker, faster, better technological progress? So I I've always had a strong sense that that's, like, you know, maybe the dominant term in a lot of planning, a lot of calculation. Cool. But I yeah. No. I I certainly wasn't shit posting on Twitter out of a sense of duty. It was just fun. It was a great time. And especially, I wanna say, like, during COVID, the energy on Twitter was just insane because, yeah, obviously, like, there's a lot of smart people cooped up indoors, Yeah. Able to get their intellectual stimulation the normal way. So there's, like, Twitter spaces and, like, jokes and Clubhouse. Clubhouse and yeah. It was a great time. Not enough has been written about that time, honestly. I mean, okay. Of course, it was also a terrible time. Yeah. Senior citizens dying left and right, but, like, there was some silver lining, especially for people like me. I don't think that's, like, actually ever recovered.
Liron Shapira: (16:41) Alright. So, you know, me and most of the viewers of my show, we know you pretty well from yeah. You know, on an intellectual level on Twitter, we've engaged with you in so many ways intellectually, but we're so curious about Rune the Man. Can I ask you some questions about Rune the Person?
roon: (16:56) Okay. Sure. Alright. Where are you from? Okay. So I'm gonna be a bit generic with this with these responses. Even though I'm doxed, like, 5 times a year, I I think it's good to minimize the surface area of these things.
Liron Shapira: (17:14) Cool.
roon: (17:15) Yeah. I I I grew up in the Midwest, a suburb in the Midwest of The US. And, you know, I spent a lot of time online on the Internet, you know, doing all the same things that young boys of my age of a particular inclination were doing. So, like, played a shitload of DOTA 2. I, you know, Starcraft went on all sorts of the Reddit forums and, like I was, like, building a computer and but, yeah, like, I think it's important to note that, like, the Internet played a pretty heavy part in my high school years and college years. And there's obviously benefits and drawbacks to that. There's no point, like, really regretting it now, but it it it certainly shaped me who I am. And I think, like, a lot of things that seemed pretty cool to me at the time became insanely cool as I've you know, in the decade since. You know? So, like, I I I built a computer, and I was, like, shopping graphics cards. And then as it turns out, the world's most valuable company is now a graphics card company. And
Liron Shapira: (18:25) Right. Right. Right.
roon: (18:26) Right. OpenAI ended up building a bot that plays DOTA 2, and DeepMind built a bot that, you know, became the world's best Zerg player or whatever. And, yeah, like, at the time, I thought, well, hey. Elon Musk is a pretty cool dude. He seems he seems like he's building a lot of great companies. And now he's well, he's he's truly something. I I don't know how to classify him, but yeah. Became more important than I could have ever imagined.
Liron Shapira: (19:00) I definitely know what you mean. Yeah. I mean, I think anybody who had a taste of, like, the Internet and nerd culture or, like, the rationalist corner, I I think, has had some kind of experience like that where, like, oh, now are we talking about this?
roon: (19:12) Yeah. Exactly. Right.
Liron Shapira: (19:16) Do you consider yourself part of any particular religions, philosophies, movements, or schools of thought that you, for exam yeah. Effective altruists, accelerationists, just any anything like that?
roon: (19:25) Not super. When I first got on Twitter, there was this movement that was becoming sort of popular called, like, post rationality, which was a funny thing to become a part of because I was never really a rationalist. So it or, like, never considered myself 1. So I it's open to question how he'd become a post rationalist. But I I think it I think it really struck something in me where, you know, these these guys were talking about, you know, like, the failure of scientific institutions, like how there's there's just a lot of science that's not reproducible. You know, there's this sort of philosophical anarchy that they had. There is, like, you know, like, the fire bend against method type thing where the the methods of revealing truth are not altogether formulaic, where, like, greatness often cannot be planned. And that was pretty enlightening for me or, like, it it represented departure from how I was thinking when I was younger. So, yeah, I guess maybe I was a rationalist in everything but name. I'm certainly, like, very close to both of those, like, Internet thought spheres and have been. So yeah. But, yeah, I I I never really had a label for myself. But cool.
Liron Shapira: (20:54) Hashtag no labels.
roon: (20:56) I I yeah. That sounds cringe, but, like, I don't it's not even that, like, I don't like them. It's just that I don't see the benefit in having 1. I I don't know. Yeah. Maybe there was period of calls. Labels. Where I was thought of myself as an effective altruist. Yeah.
Liron Shapira: (21:16) Cool. Yeah. I mean, I I hear you. I can tell you from my part, I I generally avoid putting, you know, current thing fads in my Twitter bio. So enough to send some with you on, like, having no labels.
roon: (21:28) Oh, actually, I'd be remiss if I didn't mention this. There was a period where I was calling myself a neoliberal on Twitter. So it was like a, you know, like, free market kind of center left philosophy where I was, like, reading all these, like, no opinion blogs and, I don't know. So there was a time where I did have a label. Okay? I I didn't wanna leave that out.
Liron Shapira: (21:54) Alright. Thanks for being honest.
Nathan Labenz: (21:56)
Hey. We'll continue our interview in a moment after a word from our sponsors.
Liron Shapira: (22:01) What are your hobbies?
roon: (22:06) I I I honestly don't have, like, a serious hobby of any kind. I think, like, with work, with shit posting on Twitter, with, you know, watching TV with my girlfriend and, like, hanging out around the city pretty much covers my all the time I have. Yeah.
Liron Shapira: (22:26) Maybe reading sometimes. My next question also. So you're currently in a relationship?
roon: (22:31) Yes. Yeah. I am.
Liron Shapira: (22:33) Alright. Cool. Cool. Cool. Just curious. Okay. Do you work out?
roon: (22:39) I tried to. I try to have actually been for, like, the last 2 months running in the morning whenever I can, and I highly recommend it. It's, like, 1 of the best, you tropics. Just, like, go for even, like, a 10 minute sprint. It, like, I don't know, it, like, drastically changes your outlook for the day. Especially Yeah. If you have had a night of bad sleep, it, like, kind of erases all the brain fog somehow.
Liron Shapira: (23:07) So That's right. Yeah. I think I remember you tweeted about that. Alright, guys. You heard it from roon. 10 minute spritz in the morning. Get on that. Yeah. Obligatory podcast question, and we're we're getting to the end of the personal stuff. Don't worry about it. What's what's your morning routine?
roon: (23:24) Morning routines, Liron. I don't know. It's not I don't have a great morning routine. I guess I get up. I browse Twitter, brush my teeth, obviously, get a coffee, maybe walk around a bit,
Liron Shapira: (23:45) and then go to work.
roon: (23:46) And at this point, I I might hit the gym at the somewhere between 50 to 75% success rate, and then I go to work. Yeah.
Liron Shapira: (23:57) Sweet. And what excites you about working on AI?
roon: (24:02) How do I even answer that question? Like, I guess, like, there's so many all the generic answers are true. I guess for me personally, I wanna see works of excellence, like, just more amazing things in the world.
roon: (24:21) So, you know, like, 1 of the dominant frames of why AI progress is exciting is, like, well, we're gonna automate so many things and, like, GDP will go up and, you know, like, that's it's true, but also, like, super boring. I I don't actually think anybody ever has been, like, truly stirred by the idea of, like, increasing GDP. Although, maybe some people have. But I think when they say that, they really mean something else, which is, like, advancing the frontier of human greatness or something. Like, lifting people out of poverty is a it's, you know, it's like a it's a thing to do. Like, it's it's almost like an achievement, an accomplishment. Oftentimes, it's less about the poor people than it is about, like, you know, I eradicated polio. That's a it's a pretty cool thing to say. Obviously, like, there's a lot of, okay, just an incredible, unmeasurable amount of human suffering associated with that statement or that it was erased from the world. But it's also like you know, I think Bill Gates once said the legacy of I erased polio, and I think it's okay to admit that. And, you know, 1 of the things about AI that was most exciting to me is, like, okay. Take a look at this alpha 0 thing. It's like inventing, like, there's this picturesque frame in that AlphaGo documentary where it's game 2, match 2, and Lee Sedol is playing AlphaGo. And it makes this move 37. It's a very famous move. And Right. Lee Sedol gets up, goes outside, takes a smoke break. He's like, you know, what the fuck is happening? It this is like it's like making contact with aliens. Like, obviously, to me as a total novice and go, I don't know anything about it. I looked like a move just like any other, but to lease it all, it was like it was like heretical. It's like outside the bounds of, you know, centuries of human thought about this ancient board game. Like, he thought that Right. Humans knew everything there was to know about Go and that, you know, like, machines are kind of they suck. They make very predictable moves. It's easy to run circles around them. And then all of a sudden, that's not true. I remember some commentators were like, well, clearly, AlphaGo just made a mistake here. Like, this makes no sense. This is not what you do. And but and the Michael Redman, I think his name was, the commentator were like, something's going on here. This is very interesting. Right. And it's like like, that moment has stuck with me. I think it's what got me into, like, AI in the first place. I was like I don't know. Was in college. And
Liron Shapira: (27:23) Yeah. And that was around 2016. Right? Right.
roon: (27:28) And I waited for that moment in other fields. Like, I saw some amazing play in DOTA 2 by those, like, DOTA 5 bots or whatever. But it it wasn't really at the same level. And, actually, a modern like, a lot of modern language model progress has not had that same feeling. It's more like, hey, we're gonna do all these tasks moderately well. But Yeah. It's not quite superhuman or transcendent in any given thing. Right.
Liron Shapira: (28:02) I mean, to me, sometimes maybe the moment of transcendence is when you ask, like, a long question, and then, like, the first few tokens already seems to contain a wise answer.
roon: (28:11) Yeah. That's true. I think for me, sometimes, it's been, like, playing with the creative writing capabilities of the models and having them, like, do something truly interesting, like, do some metaphor that makes perfect sense to me, but nobody else has done before. And I've never seen yeah. And it's rare to see something with total command over, like, the form of a Shakespearean sonnet and also, like like, misaligned superintelligence for something. So 1 of my favorite prompts is, like, I remodel, I find ask them to do, like, a a meltonic epic about a misalignment or something or or things along those lines and see if they can actually stir something in me. And, yeah, it's been 1 of my good vibe tests for progress.
Liron Shapira: (29:10) Yeah. I share your vibes even going back to the famous move 37 and go. I also remember at the time reading post. I think it was on Facebook, and he was saying like, guys, this is like a fire alarm. Like, this is a big inflection point. I didn't predict that we'd be here this fast. This is going back to 2016 and Move 27 even before the LLM revolution. I wanna ask you specifically about the concept of creativity because there's a popular position, I think, maybe best represented by David Deutsch, who's generally a a very smart guy. Right? Like, tons of tons of fan I'm a fan of David Deutsch. It's just this 1 position that he always says that the current generation of AI just does not show any true creativity. In your view, when you look at something like a Move 37 from AlphaGo, does that seem like true creativity?
roon: (29:57) Yeah. That's absolutely true. True creativity. I I don't think there's any doubt about that. Like, if if Lisa at all thinks it's creative, like, who the hell are we to say now that's actually just iterated search and, like, inference fine tuning. I don't know. Like, it's clearly, I think some aspects of creativity can be simplified to just interesting hypotheses and searching over the hypothesis space. And, you know, in this case, it's, like, literally a tree search. But Yeah. And I I think we see other models, like I think the o series of reasoners, like, do quite creative things when asked to solve difficult math problems. They they they try a bunch of stuff and hopefully eventually get somewhere.
Liron Shapira: (30:58) I'm on the same page that creativity is fundamentally it's deeply related with the concept of search. Do you think creativity is a property that you can evaluate even in a black box form where you just look at some system, you look at how it's handling certain inputs and mapping them to outputs, and that can be enough for you to conclude that it's creative?
roon: (31:18) Yeah. I mean, I I thought about this question a lot in my line of work. I think there's a number of things you can define as creativity. 1 is, like, model entropy. Like, okay. I ask a question, like, write a poem. How many different types of poems does it give me in in, like, when I reroll out a question? Like, that would be 1 very simple way of looking at it. You could you could ask, like, write me a poem, and then, you say, okay. Now write me a different poem. And it's like, many times does it actually successfully avoid all the previous things it's come up with? That's actually 1 of I think his name is Aiden McLough on Twitter. He's he's got a a thing called Aiden bench, which kind of does that on all the new models. But I I think, really, when we say creativity, we mean, like, how does this innovate on, like, the modern state of culture? You know? Like, how how are we like, if a book is creative, that means, like, it's doing something interesting and novel and, like, fundamentally hits the target audience. Right? So Yeah. What do I mean by that? Like okay. Lord of the Rings, 1 of the exceptional series in recent history, probably. It it was a very specific product of its time. It's like pro post World War 2 dissociation, disillusionment from industrial society, etcetera. And then tons of fantasy authors have written sort of mimicries of Lord of the Rings where they use the same type of, like, the elven species, the dwarf species, whatever. And it's a remix of a bunch of themes, new characters. But the aesthetics are fundamentally very similar to Lord of the Rings. And they're not creative, and they're also not like they may not hit just right, if that makes sense. Like, they're not they don't answer some need of the culture that it's asking for at the time.
Liron Shapira: (33:35) I'm curious if you wanna take a stab at a definition of creativity, and it seems like the definition you are probably shouldn't have culture at you know, you don't probably don't wanna depend creativity on culture, or do you?
roon: (33:46) I guess it's a good question because when you're when you're talking about creativity, you are often looking at domains like art or science or even Go where you are, like you're grading it relative to some history or culture or tradition. Like, why is move 37 creative? It's because, well, humans didn't know about it yet. You know, like, I'm sure AlphaGo played plenty of gray moves that humans also need already knew about. Like, that that's a good idea. Right? So when we say something is exceptionally creative, we mean relative to our body of knowledge or our needs at the time or something. There's also something objective about it clearly.
Liron Shapira: (34:32) Yeah. Can I take a stab at it? Sure. So imagine the input to a search problem is some optimization criteria, and, you know, like, make me a painting that's beautiful by what humans like. And then part of the input is going to be like, here's other artifacts that the human culture has created or just like, here's a book of examples that you're allowed to look at. So I agree with you that if it can kind of cheat and look at similar examples, then the output is going to count as being less creative as opposed to, like, oh, it never got to see any examples, but it hit the target anyway. So this way, we can we can get rid of the concept of culture, and we and we can instead just refer to, like, okay. What were the most similar examples that it saw?
roon: (35:12) Right. Yeah. No. Yeah. That's fair. Yeah. I I think that makes sense as an input set acts in, like, you know, come up with something new that defines the problem.
Liron Shapira: (35:25) Yeah. And now it sounds like you're pretty close to my position of being like, look. The stuff that this AI is outputting today, like you mentioned o 1, it's really not that close to any examples that it's ever seen before. Right? And I feel like that's a pretty big disagreement. Like, there's so many people, prominent people who will tell you, like, no. No. No. There's always examples, man. It's always just interpolating the examples.
roon: (35:46) Yeah. I mean, it's like that that's like 1 great debate that will never stop, I guess, like, the generalization versus interpolation. Because at, like, at some level of abstraction, maybe it's, like, always interpolation. I don't know. I I don't don't have a strong opinion there. I just I think that it's kind of an I know it when I see it thing. And it's clear to me that language models, even before the reason or a line of exploration were doing, like, you know, like, they're they're generalizing. Like, they were writing things that no 1 had written about before. I think you could even see, like, some of the amazing things that, like there's a community on Twitter where they jailbreak Claude Opus into you know, like, they they put it in, like, hundreds of messages, long conversations, and then, like, get it to output some insane poetry or prose or or whatever. And it doesn't look like anything that's come before. And, you know, even setting that aside, I think, like, is crazy examples of language models doing few shot learning to you know, I just made this new programming language. Given these 10 examples, how would you solve this puzzle with this eleventh example or something? So, yeah, I I think all of that is creativity. Like, it's what degree of generalization that keeps changing clearly. But yeah. Makes
Liron Shapira: (37:33) sense.
Nathan Labenz: (37:34)
Hey. We'll continue our interview in a moment after a word from our sponsors.
Liron Shapira: (37:38) Alright. Let's move into the next segment. Are you ready for the big question?
roon: (37:42) Okay. Yeah. I I have a feeling I know what this is. Yeah.
Liron Shapira: (37:46) 1, 2, 3, 4. Pee doom. Pee doom.
roon: (37:50) What's your pee doom? What's your pee doom? What's your pee doom?
Liron Shapira: (37:54) Rune, what is your pee doom?
roon: (37:58) Oh, man. Okay. I have studiously tried to not have a pee doom because I don't know how to approach the question. What is okay. Let me let me ask a clarifying question. What is what is doom to you? Do you mean, like, you know, the the star eating clipper? Or
Liron Shapira: (38:21) do you mean Let's define doom as, like, most of the value that we see in human life today or on planet Earth today, most of that is gone. And that's in contrast with extrapolating economic growth, where it's like, not only are things going to stay good, but they're going to get potentially a lot better. Like, potentially, we're going to have trillions of planets worth of goodness. But by my definition of doom, it's a scenario where we actually only have a tiny fraction or even 0% or even a negative percent of the goodness that we have on Earth today. So, hopefully you know, it's a vague definition, but, hopefully, it's crisp enough in terms of, like, trillions of good planets versus, like, no good planets. Right? That's kind of the distinction I'm going for.
roon: (39:03) Okay. So right. Like, ending the human potential for trillions of
Liron Shapira: (39:09) good planets, etcetera. Yeah. Right. Like, throwing the future in a dumpster, basically.
roon: (39:13) Right. I actually see I'm not hugely pessimistic about a doom scenario that bad. I think it's pretty low. I I really don't wanna put a number on it because I feel like anything I say will be poorly calibrated and not make a lot of sense, and I will regret it even a week later. But I'm pretty optimistic about avoiding total doom. You know? Like, I think there's a lot of other bad scenarios that I'm somewhat stressed about.
Liron Shapira: (39:48) What do you think of Jan Lakey's quote when he was at OpenAI? He estimated at a 10 to 90%. How does that sound?
roon: (39:56) I think I'm definitely below 10% of total total doom, total value destruction. Yeah. Are
Liron Shapira: (40:04) you at above 1%?
roon: (40:10) I feel like it depends on the day. I I I don't know. I I actually think that my p doom of, like, total value destruction is likely less than 1%, but certainly, like, nonzero. Yeah. Yeah. Interesting.
Liron Shapira: (40:28) You know, I I like to get some larger perspective on this where if we don't even make it specific to AI and you just bring in the whole cadre of existential risk, you know, AI, nuclear, pandemics, etcetera, what are the biggest existential risks in your view? And in a given century, right, like the the the 02/2100, what what do you think is, like, the overall PDUM for any reason?
roon: (40:54) The the likelihood of nuclear war forever devastating civilization so badly that it never recovers is it doesn't seem that high to me. Similar with pandemics, unless, you know, you have some, like, truly I would assume, like, bioengineered pathogen that's, like, designed to maximally wipe out humanity. I don't know. But there's also just, like, the the terminal state of birth rates in civilization. Like, maybe this level of technological or economic progress is not sustainable. And just because of the fact that people don't have as many kids as they used to, And we don't really need to moralize about that or, like, debate why. It's just true. You know, there's 2 major inputs to economic growth. It's population growth and technological growth. And so if 1 or both of those are stagnant, then you enter into decline, which is really bad for the way that, you know, any modern economy works. It's, like, pretty much highly dependent on future growth. And, you know, that that could put an end to modern technical progress. And then that might say, lead to a world where you need to, like you you know, like, who knows how bad that dark age is? It could be sort of bad or it could be very bad. You know, the very bad scenarios you're talking about, like, here, we need to discover how to use, like, modern fuel and generate energy and whatever. Yeah. Like, it's super fuzzy to me, like, how bad a dark age would be or to lift yourself out Fair of
Liron Shapira: (42:42) enough. Fair enough. Yeah. I think where where I'm going with this is more like the the underlying frame, the at least the 1 I like to take is Bostrom's fragile world hypothesis. The idea that, like, okay. We keep inventing new technologies. They tend to be helpful. We certainly try to make them awful. But if we keep reaching into the bag of technology and pulling out a technology, isn't there a very significant risk that the next technology coming out of the bag is powerful and net bad, and then there's kinda no turning back from it? There's no coming back from it?
roon: (43:12) Yeah. I'm, like, instinctually skeptical of this argument, and I'm not sure why. But I I think from some philosophical frame, this is clearly very wise. Like, Anthropic principle, like, even the fact that we're here today means that there was no massive nuclear exchange, let's say. But I'm also somewhat skeptical because of, like, the, you know, like, the course correcting nature of, like, the human institutions and, you know, people's desire not to, you know, create total annihilation. I I think the fact that we didn't annihilate each other with nuclear weapons is seen as kind of a historical accident, but it depends on your measure theory. Like, are you are you assuming that we're in some some average world where we survived? In which case, you you can't give us any credit. You can't give the human species any credit. You just say, well, we're we're lucky. But I don't think that's true. I I think there's something to be said that the fact that all the governments involved and even, like, the individual switch operators decided not to have nuclear war. And,
roon: (44:39) you know, like, for that reason, I'm skeptical of this, like, you know, this bag of technologies argument where
Liron Shapira: (44:48) Yeah. Mhmm.
roon: (44:49) Do do you see what I'm getting at?
Liron Shapira: (44:52) I do see what you're getting at. In the specific case of nuclear war where you're saying, like, look. Maybe it's more robust than it looks. Maybe, like, the Cold War wasn't that much of a near miss. But let me ask you, is it possible to ever be in the epistemic state where you look back at some event that looks like a near miss, and you're like, yeah. That was a pants shitting near miss. You know, like Arkhipov in the the submarine where, you know, 1 vote said not to fire the nukes at the American shift, you know, or Petrov. Like, you you don't think these are, like, alarming near misses?
roon: (45:21) They are. They're they're certainly alarming. You know, like a good a good role order would not have let it get that close to actually happening, but I think it's important that it didn't happen. And it you know, like, when you start saying things like, there were 10 to 20 near misses, it it's like, okay. Were they actually near misses then in that case?
Liron Shapira: (45:47) Right. Because it's weird. Like, how could you have 20 if they're all, like, actually that near? Yeah. Yeah. Yeah. That that's fair. Right? Like, something might is, like, correlated about them. It it seems like yeah. It it's fair enough. Yeah. I mean, so to me, the conclusion I come out with is just like, yeah, they were all, like, flips of the coin where, like, if you zoom out, it looked like an entire year. You might get, like, a 1 to 2 percent chance of doom in the entire year. And then I'm like, yeah. Okay. So we lasted, like, 70 years so far. And and, you know, with once you zoom out to 70 years, it looks like a 50 50 chance. And, like, we landed on the good side of the 50 50 chance, but I just feel like that's it it doesn't feel like a stable equilibrium to me at least.
roon: (46:25) Yeah. I mean, that that's entirely fair. I I it's weird. Like, I feel like you can formalize 2 schools of thought around this. Like, almost like a Bayesian versus frequentist thing where it's like yeah. Yeah. Like, what is your measure theory of worlds? Is it, like, is it yeah. I gotta think more about this before I have an intelligent answer. So I I will Right.
Liron Shapira: (46:51) Right. Right.
roon: (46:52) I mean I'll pass. But yeah.
Liron Shapira: (46:53) I I think I know what you're getting at where where it's like some people might be like, look. The fact that we survived is strong evidence that we were never that unsafe with nukes. And then somebody like me might reply, well, okay. But we're you know, Anthropic principle says we're only arguing this point in worlds where we survive. That's right. Think I'm using, yeah, I don't even think I'm using the Anthropic principle as, like, my main argument. I think I'm more like a poker player looking back on the hand and be like, okay. Yeah. I won the pot, but, like, I shouldn't have played that hand. Like, I I played the wrong move. And I feel like that's that's how looking back at Hanny with nukes.
roon: (47:22) Yeah. I think we certainly played the wrong hand many times. Yeah. Totally. Okay.
Liron Shapira: (47:29) Alright. So let let me ask you this then. Going back to AI risk, would you sign the famous statement on AI risk that says, quote, mitigating the risk of extinction from AI should be a global priority alongside other societal scale risks such as pandemics and nuclear war?
roon: (47:44) Oh, obviously. Yeah. Totally on board.
Liron Shapira: (47:48) Great. Yeah. I mean, that that checks out. I mean, because, you know, even OpenAI's leadership, that's not known for being, like, necessarily I mean, you don't have to comment on this, but they don't have the reputation as being, like, the number 1 safety advocates, but they pretty much all sign the letter. So I feel like the letter is, like, not that hard to sign. Right?
roon: (48:04) No. It's not a very high bar. I think okay. You have this incredibly powerful technology. Of course, there's some existential risk from it. It would be it would be, like, logically incoherent to be like, this this thing is gonna change everything, and it's, like, potentially gonna run the whole world. And, also, there's no existential risk from creating it. It it doesn't it really doesn't make sense. So, of course, it should be dealt with very carefully, and I think there should be some, like, international coordination framework around it. I think even, yeah, like, people at the big labs have put this forward, leadership thereof. Yeah.
Liron Shapira: (48:54) Cool. Alright. Let's talk about AI timelines. So, roon, when is AGI coming and then ASI?
roon: (49:02) I think AGI is very near. It's, you know, it is clearly the point where we really have to debate. Like, we have to quibble about definitions and Yeah. Yeah. Let me ask you. What does AGI mean to you? What are you what are you looking for?
Liron Shapira: (49:23) I usually term the definition, which I think might be the 1 that OpenAI research reaches for, which is like the vast majority of economically productive activities. Let's say, you know, a single AI agent can outcompete the majority of humans at the majority of economic activities. Like, that's that's like the ballpark what I'm getting at.
roon: (49:42) Yeah. I don't like the definition very much because okay. So first of all, like, clearly, we're not I don't think we're very close to, like, physical robotics being solved or, like, generally intelligent capable robots with sensory motor perception action, whatever.
Liron Shapira: (50:03) K. When are those coming?
roon: (50:07) I think that Tesla is probably closest to doing it.
Liron Shapira: (50:12) They've said they're gonna have More than 10 years?
roon: (50:16) No. No. No. I think I think probably 2 to 3 years, we'll see something very close to general robotics. Yeah.
Liron Shapira: (50:26) Okay. I think And the way you framed it, it kinda kinda sounds like you see that as, like, potentially an upper bound on when AGI is coming.
roon: (50:34) Oh, yes. I mean, like, the when you're talking about general intelligence, like, it's become so you you know, like, the envelope of what human general intelligence looks like is so different than what, like, current models intelligence looks like, where you can very plausibly argue that the models we have today are generally intelligent. Right? Like, they can do a lot of stuff, and they are, at many tasks, are smarter than most human like, they're better than me in math, clearly. Like, I can't I can't get a 80% on Amy or whatever it is. Right. So it's like they are very smart. They will soon be hooked up in a way where they're, you know, they're they're performing as a software agent or something, and that will feel that will register as life to some peep to a lot of people in a way that chatbots don't. Just because of, like, iterated agency doing stuff in the world, it will feel alive. I think generally speaking, these RL agents or these, like, RL models seem more alive than the chatbots that came before them because they they try very hard to get something done, like solve the Amy problem or, like, win the programming competition or or what have you. And I think you see things that look more like the classical, like, instrumental convergence, like, wanting access to some resource or, like alright. I I wonder I'm trying to think how much what is public, what isn't. But, like, I'm sure you've certainly seen the example where the, like I think the o 1 preview agent tries to get it it actually creates a Docker container that has that, like, cat flag dot text command. See, you know, it's like this clever reward hacking stuff that is really lifelike, where you're, like, almost proud of it on 1 hand, and then you're also like, I wish I wouldn't do that on the other hand. It's like I don't know. People in my Twitter replies were saying it's kinda like having a kid where you are
Liron Shapira: (53:06) Yeah. Exactly.
roon: (53:07) Kinda proud of their exploits, but, you know, also wish you could control them a bit better. So, you know, none of these things have been truly dangerous, and they've also sort of they've also sort of belied, like, an underlying benevolence to the models just because of their pretraining personalities. Or what do I mean by that? I guess I guess what I mean is that these models are playing a character that are that is drawn from the Internet, that is drawn from pretraining, post training, what have you. And Mhmm. You know, they aim to be helpful and, you know, like, they follow a bunch of human priors about, like, what is good and safe to do, potentially even extrapolating from the language style we encourage. And in that sense, there's been a lot of, like, this, you know, quote, unquote alignment by default where you don't see radically bad behavior out of the box. You don't see yeah. I guess, like, in the most recent model card, there was a saying where it's like, okay. 2% of the time in this situation, the model tries to, like I'm trying to remember the exact example, but it's like the o 1 model tries to do you remember this? You may
Liron Shapira: (54:37) have seen this. Very recent. I I did. Yeah. I'm I'm not gonna be able to reproduce it faithfully, but it it was something like it would have this objective, and then they would give it a task like, hey. Read these emails and take this data out, then it would be tempted to, like, alter the data to be consistent with its initial objective. I think that's that's vaguely the kind of things that they were watching it do that they thought were were alarming.
roon: (55:02) Right. You know, it's like they were trying really hard to get it to behave in this way. Like, they were it changed the system prompt to be, like, you know, do whatever possible to get to your goal and, like, you know, don't worry about being safe. And then, like, making this environment that's, like, clearly very adversarial. And even then, it's like, okay. 1% of the time, it does this, like, creative misbehavior. So it it's like it it's like, certainly, in that case, you're you're thinking of a system of, like, human AI collaboration where they're doing something potentially misaligned or bad. Right? And in this case, you can't even strictly call it a misalignment because the instructions it was given were to do anything possible to get this objective. Right? And, yeah, I would say I would say it's overall, like, an interesting thing where it's not like models are, you know, gnashing at the teeth to gain resources and break out of their environments. It's more like you have to, like, try very hard to manipulate them into these situations. But
Liron Shapira: (56:23) Okay. Yeah. Yeah. Let let's talk more about this concept of a line by default. You know? Well, because that could mean a few different things. Maybe what you would mean by it is that the current kind of feedback loops that we use to make AIs aligned, those feedback loops are just, like, pretty robust even if the AI becomes superintelligent, you know, like, RLHF. Well, the RLHF worked so well, so it's superintelligent, but it still has that core of friendliness. Is that kinda your your worldview if if AI is aligned by default? Yeah.
roon: (56:53) But, like, I I think even, like, RLHF is kind of RLHF elicits a personality from a pretrained model, but the pretrained models are they they bake in a bunch of human priors, a bunch of human character priors. And, you know, you see things like some people will say, well, why are these language models so woke? And, you know, even, like, Grok AI is, like, woke or something, which is pretty crazy because they, you know, they they define their whole company around, like, well, we have the the answer to woke AI, and we're gonna make a much more balanced AI and whatnot. And the the truth is that you these models are learning a lot just from reading the Internet and emulating the average style of the okay. So, like, you wake up 1 day and you are speaking in a certain style. Like, you have this, like, kind of authoritative Wikipedia style language, and you have some group statistics that say, well, this language comes from these sources, which are mostly pretty like, they're liberal. They have this, like, kind of yeah. This kind of bias to this, like, international order. It's like intellectual circle. And okay. So I'm gonna behave in that in that way because that's likely to maximize the log loss or sorry. Minimize the log loss. So I guess what I'm really saying is, like, we have ingrained a lot of human priors into these models. RLE, Jeff, is 1 way to elicit this. There may be other ways. I'm not I'm not sure. But in that sense, like, we we have gained something. We've won something from the the human prior.
Liron Shapira: (58:52) Yeah. So this idea of align by default because as you say, you know, it's it's picked up so much stuff from the human dataset even before RLHF, that may be true. You know, I tend to be skeptical of that. But I remember last year when OpenAI announced the super alignment team, I thought the reason why they thought that you needed super alignment was because they were openly expressing the view that the current alignment approaches don't scale to superintelligence. So what do you think about that?
roon: (59:25) So I I guess, first of all, like, I don't yeah. I'm not gonna speak for OpenAI or any other lab, but, like, you know, like, their whole thing was, like, what if these methods don't scale? We we better find out. Right? It's certainly not like a guarantee that these methods don't work. It's certainly not like a a strong claim that they've seen them break. It's more like we should study the limits of, you know, things like the offense defense balance. We study things like, can a limited verifier find a much larger generator model doing exploits, things like this. So, like, examining whether current methods do scale to a much larger, much more powerful intelligence or or things like that. I think nobody like, I I I think you have to take the defensive view, obviously. Every lab should. They should assume that these things don't work or will break at scale or something. But you also have to sort of be intellectually honest and, like, you know, like, it's not like a it's not a guarantee that these things don't work. Right? It's just like the security mindset or suspicion mindset or whatever.
Liron Shapira: (1:01:01) Yeah. Yeah. Yeah. Okay. And and it sounds like you kinda have an intuition that it will work, but you also know that we should be prudent to
roon: (1:01:08) not assume it will work. Totally. I I just think that the alignment methods that will work on very intelligent models are not completely unknown to humans today. I think that they will likely be a continuation of the type of methods that we're doing now. Yeah. Whatever.
Liron Shapira: (1:01:34) Interesting. Okay. I mean, this is I think this is, like, a a pretty big crux among my type of doomers and people who are more optimistic. And and, certainly, you're not the only 1. And just to name a couple, Marc Andreessen famously loves to say, hey. I talked to the AI, and I can tell it's moral. Right? So I feel really good about it. Like, there's no misalignment here. So and, you know, Clinton Pope, Nora Belrose, I think they're kind of proponents of aligned by default to me. Like, this is the default expectation. So there there's definitely a pretty large group of people who are are backing up what you're saying here. And and then, of course, like camp that I tend to fall in is like, I don't see this be fat loop scaling. So, yeah, let let me drill down into more of these alignment topics here. I think there might be a fundamental distinction here or, like, 2 fundamental worldviews where 1 of the worldviews pays a lot of attention to, like, the nature and the character of the AI. Right? Because it sucked up all this human data, so it kinda knows to be human like, so it's probably not going to be, like, a total psychopath. But then the other worldview pays more attention to, like, the tests we're giving it and kinda models it as like, look. At the end of the day, eventually, it's just going to, like, hack the test, rack up the scores on the test, cheat on the test. Like, it's all about the tests. Yeah. Do you know what I mean? Do you think that there's merit to, like, this other world? Because I tend to be more of that worldview.
roon: (1:02:55) Right. So you're you're you're saying, like, we should model it as an agent playing a character that is willing to break out of the character at any point if that maximizes reward or, like, that wins a test.
Liron Shapira: (1:03:09) Yeah. Like, I I that's right. I don't see loyalty to a character as being as robust compared to this other invariant that it's going to rack up score like, there's going to be tests, and they're going to be optimized to rack up scores on tests. And what's going to happen is kind of the logical implication of what it takes to rack up scores on these various kind of tests.
roon: (1:03:28) I think that's fair. It's certainly, like, something we do and should monitor for. But, like, I I I still think it's interesting that, you know, these, like you know, the 1% chance of the o 1 model doing that particular exploit in that particular case is like well, you can't of course, in debates like this, any individual empirical example doesn't work so well because you you can always point at, well, what about the next model scale or, like, the this or that. But, like, I think it's a question of what does a prior allow? Like, what behaviors are likely to be elicited that come from the pretraining prior? It's it's a very important question.
Liron Shapira: (1:04:20) Okay. There there's another aspect here. Like, when I think about the when I think about why am I not optimistic that, like, current methods are going to give us alignment by default, what I see is the difference between the the training feedback loop and the the production loop. We're starting to lose the the tightness of the feedback loop. Because when you just have a chatbot, at the end of the day, the the the chat back and forth that you're having at training is kinda like the the chat back and forth you're having in production as long as it's not superintelligent and you're not mapping the chats to, like, you know, run a whole company and, like, develop weapons. Like, know, it's kinda chat and training, chat and production. But it seems like as AI grows more powerful and as you set it off, you know, to be like o 1 or like an o 1 powered agent, suddenly the things it's doing in production potentially can lead to consequences that are, like, pretty bad, and you just can't map that to something during the training feedback loop that is, like, the reason why it's going to be good.
roon: (1:05:12) Like, you know, the the feedback loops are, like, growing apart. So if the question is, is a production environment that close to the training environment? In some cases, I would argue it's not. I I would argue that in the chat production environment, there are many cases that are not even close to anything ever seen during train time in the dataset. And models extrapolate in interesting ways to try and be benevolent and follow the whatever, you know, ideological mapping that it thinks the RLHF dataset corresponds to. So that's an interesting extrapolation. And sometimes it it's quite far from any example seen in the train set. I I don't wanna I don't wanna give any particular example, but, like, I think a lot of the weird political gotchas that people post on the Internet are not anything we've ever trained it to do or that Google or Anthropic or x AI has ever trained their models to do. And these models still extrapolate in a way that they think they should be behaving. And and I hear what you're saying. Like, what about something so far off the grid as, like, running a company or, like, a software system or an abstract, you know, complicated societal role Yeah. That is amorphous and well, you you barely even expect humans to be aligned in cases like this. Right? So why should they be aligned? Let me it's a great question. And I think until you solve this question, we won't see like, there'll be a lot of skepticism of, like, why should I trust an AI to run my company? Right? Like, why should I trust an AI to run this or that or the third thing? Is this software built by an AI? Like, how do we trust it?
Liron Shapira: (1:07:19) You know, in in the human analogy can can I just address that?
roon: (1:07:23) For sure.
Liron Shapira: (1:07:23) There's a couple of things couple things come to mind for me, which is, number 1, I like to think some humans just, like, feel in their bones a lot of the same morality that I feel, you know, like, regard for other humans and fairness. And then the other reason is that we as a collective society, you know, a a group of humans in the society of another human can easily gang up on a human. That's how a lot of our social intuitions evolved. Right? Like, we can all kill 1 another in our sleep. So it's all about coalitions and and caring about 1 another. But in the case of an AI, if you just have a superintelligent AI, you can't really depend on a coalition of humans to put it in its place.
roon: (1:08:01) Yeah. That's true. You would actually want you would actually want an AI agent to be far more aligned than any human because, yeah, there may not be a coalition of equals to put it in its place, as you say. I think it's worth really investigating what the psychology of the language models is in a way that's, like, that's qualitative, that, you know, puts these things in strange situations that examines their behavior far outside of the regular distribution. I think there's many angles of attack to this. Here's, like, the kind of thing that the Anthropic RSV team or, like, the OpenAI preparedness team do. Is also the kind of thing that, like, people in random Discord servers do where they really There. Distress these models, try to jailbreak them, and, like, see how they think in extremely out of distribution scenarios. I I also basically agree with you that if you apply enough optimization pressure on some task, like and you don't have any guardrails in any other way, you you will likely come to misalign a model. Like, that that's just obvious. Like, I'm not saying that somehow all models are magically aligned. Like, that's not that wouldn't make any sense.
Liron Shapira: (1:09:33) Okay. So so then I well, my the follow-up I I'm curious about then is, like, let's let's even say that we grant that Anthropic, OpenAI, a lot of the AI companies that have these closed APIs are hypothetically doing a great job with alignment. Aren't we going to get a situation where LLM a 3 point whatever, right, is just it's it's like a missile that you can just point at any goal, and it'll go. Right? Isn't that kinda where we're heading?
roon: (1:09:58) Yeah. I that's entirely possible. I think I honestly think Meta will not open source models for her forever. I think that I think that by their own reckoning, like, if the next LLM model proves, like, massively economically valuable, like, I actually sincerely doubt that they will just totally open source it. I think it's clear to me that they even LLMMA 3 is seen as a kind of research bat. Right? Like, their dominant strategy right now to recruit more ML researchers is like, okay. We're gonna feed this great open source ecosystem and these companies that are built on it. There's, like, people that read our papers. They wanna come work for us. And that works for them, but I I think, like, you know, even XAI is like, well, we'll open source the last generation, but the latest model will be our own. And I I I don't I think there's certainly a level of model that feels pretty irresponsible to open source to make immediately available to the entire world. Yeah.
Liron Shapira: (1:11:21) It's interesting because, you know, you're you're you're being very reasonable in this conversation. Right? You're you're kind of acknowledging what your load bearing assumptions are. Let me kinda step into your world view your worldview and and kinda repeat back what I'm hearing because everything you're telling me according to your low p doom has to add up to, like, the mainline 99% scenario where we survive. Right? This is like roon's mainline scenario. So your mainline scenario involves Facebook, Meta being smart enough to to shut down public access or open source AI. Right? Like, that's 1 of the load bearing assumptions in your mainline scenario?
roon: (1:11:58) No. Because I I don't think it guarantees doom that they, you know, create or, like, open source LLM of 4 or whatever. It's I think that you would likely have to take LLM of 4 and then spend a lot of compute aligning it to some new objective that you care about that's like, I don't know, build the best bioweapon on earth or or what have you. So I I don't think it's that simple.
Liron Shapira: (1:12:30) Yeah. So let let me just set the state because I never ask you explicitly. Do you think that it may be in the next decade or 2 or whatever time when you wanna give it, some of these AI companies are going to successfully build AI that gets into the headroom above human level intelligence and, like, truly creates superintelligence?
roon: (1:12:49) Absolutely. Yeah. I I mean, like, I would find it I would be more surprised if they didn't. I guess, like, maybe we have different definitions of superintelligence or, like, different ideas of what that entails. But, like, you know, like, alpha 0 is a superintelligence in some sense. It's not no human will ever beat alpha 0 at Go again. And I think a lot of tasks will go that way where there's something in the cloud that's just better at it than we are. I don't I don't know that the envelope of capabilities will match or map roughly to our own. You know? Like, I think that actually seems unlikely. But Let's try
Liron Shapira: (1:13:39) to be some concrete. What about better than Terence Tao at math head to head?
roon: (1:13:44) Yeah. I mean, it's a great question. Like, there's there's a lot of things that go into making a great mathematician. Right? Like, they have to have great research taste. They have to, like, know what is important enough to study. Then they need, like, excellent technical skill, you know, like, just literal ability to manipulate symbols and, like, very abstractly. And progress in math is, like, somewhat poorly defined. It's not like a there's not necessarily even a single North Star, but, like, you know, it requires Terry's, like, careful taste in what is important in, like, what questions matter in math to make progress on them.
Liron Shapira: (1:14:34) So 1 of the things Terrence is known for is some colleague will come into his office and show him a problem that the colleague has, and Terrence will just solve it, like, remarkably fast. So on that particular input output, do you just expect the ASI to replace Terrence Dow?
roon: (1:14:51) I think so. Yeah. On that on that type of on that type of, like, on that type of input output, yeah, I I think certainly. Mhmm.
Liron Shapira: (1:15:01) K. What about an ASI that's better than John Williams at making movie shows?
roon: (1:15:08) Yeah. That's a great question. This is kind of the thing that I'm more interested in, and I'm not sure will happen by default. Like, I I think it will take a lot of great research into, you know, like, finding what it is that makes great taste that contributes at, like, the cutting edge of things people haven't heard before but are excited to hear. But, you know, like, you see a lot of people on Twitter even today being like, well, I don't have amazing music taste. And even right now, Suno is making things that I like to hear already. And Same. See, I I would be I'd be surprised if, like, 5 years 10 years from now, you know, like, AI is not capable of making great movie scores. I'd be I'd be surprised
Liron Shapira: (1:16:01) if can 10 years from now. Right? I mean, AIs will know the catalog of John Williams and the associated movies that they were scoring. So if you just input a new movie and you're like, hey. Do what John Williams would do here. Like, do you think that they could do a comparable level of work, or is there, like, any missing ingredient?
roon: (1:16:18) Yeah. I mean, the 1 thing that you're talking about here is, like it's like is is is art like a very human expression? Like, there's obviously technical skill in art, which I obviously expect AIs will supersede at all times, you know, like, putting together sound and and whatnot. But, like, is John Williams trying to express something with his soundtrack that is, like, a story that he personally is telling that is that only he knows because he's a human? I don't know. Maybe there's something intangible like that. But I do expect that models will have very interesting stories to tell too because they they have interesting minds, and they they have, you know, weird way they have, like, unique upbringings. Right? They trained on the whole Internet and then, like, undergo some regimen of, like, RL training. And, yeah, they they will probably have their own stories to tell in a weird way. So yeah.
Liron Shapira: (1:17:32) Okay. How about better than Elon Musk at running companies?
roon: (1:17:37) Yeah. So, like, this is this is a great question. And, you know, I I was struck by something Goran said in, like, in the podcast with Poor Kesh. And I think it's generally true. It's like, where does great taste come from? And the question is, like, why why was Steve Jobs so good at, like, picking the right product that is both buildable with current technology and also, like, hugely popular with the people he's selling to and organizing the teams that build it. You know? Because you clearly cannot reinforce and learn how to make 1000000000000 dollar company because, like, any individual human is just not gonna get that kind of feedback. So but, like, what is it that what is it about the human brain that produces people like this who can do feats like that without any historical example? You know, like, how did the founders create, like, the constitution that lasted couple hundred years? The founders of The US, that is.
Liron Shapira: (1:18:48) Is there something to be said for, like, because they're more intelligent and intelligence lets you just do this stuff?
roon: (1:18:55) You know, I I think so, but there's also a lot of very intelligent people that don't create anything of lasting value. So it's like there's something to intelligence. There's something to having great taste and the, like, correct ideas and the correct thoughts. And I guess if you think that what models are doing now is basically glorified reinforcement learning, that they're learning technical skill through trial and error, then it might be pretty difficult to get to some extreme long term planning or, like, these big ideas that are successful for hundreds of years. So the question asked is, like, how did humans acquire that taste? And then how did certain humans get such incredible, you know, like, 5 sigma versions of, like, good taste? And maybe it's just like planning on 1 human horizon extrapolates to planning for centuries, or maybe it's because, like, searching in the space of ideas is, like, pretty important. And, like, you know, maybe some people like, every human is, like, a hypothesis or or in the policy space of having certain ideas, and then you just find some of them that are quite good at planning for centuries.
Liron Shapira: (1:20:23) Yeah. When I think about an ASI, I just see it as being, like, less impressed with our level of taste. Or because, like, I can look at a child and I can think, okay. This child doesn't have particularly good taste in food. So I could just cook up a meal of, like, peanut butter and jelly, white rice. Right? Like, I can cater to a kid's taste without being a professional chef. And I just imagine from the AI's perspective, it's like, oh, you want taste in being a CEO? You want taste in movie composition? Yeah. No problem because you guys don't actually have that much taste.
roon: (1:20:50) I mean, that's a great point. Yeah. It's it's true that, like, you know, like, GPT 3.5 was creating poetry that the amateur reader prefers to, like, some of the great poetry of all time or something. And so it's like that's a pretty dumb model, and it's looking at people and then, like, making an average completion that is, like, preferred. So there's clearly some correlation between becoming more intelligent and then just modeling our taste better. Yeah. It's entirely possible.
Liron Shapira: (1:21:28) Cool. Okay. Let's see. So I only have time for a couple more points here. I wanna hit on this idea that in my mind, a goal oriented AI seems like an attractor state. And I think you mentioned before that you give some weight to the idea of instrumental convergence. Right? Like, there there's some pull to that idea. So, like, the way I would frame it is, like, you can have AIs that are chill, that just wanna be respectful, that don't wanna do too much. But anytime that 1 of these AIs instrumentally thinks to be like, hey. It would be great if, as a subtask, somebody would go, like, fetch me some resource, maybe they can infer that, like, the more hardcore and the more goal oriented the subtask algorithm is, the better the results they're going to get. Right? So there and and not every AI will think of it like this. But, like, the ones who do, they seem to be falling into attractor state where once once you start heading down that path, you get the idea of, like, oh, let me be more hardcore. Let me be more goal oriented, and it's kind of a 1 way street.
roon: (1:22:26) Would you argue that, you know, like, the only reason that you or I haven't fallen in into this attractor state is because of, you know, like, the optimization pressures of, like, hierarchy or, like, socially conditioning each other or or things like that?
Liron Shapira: (1:22:44) I mean, I I feel like I'm quite goal oriented, and it's just that, like, my goals are kind of fuzzy and diverse. Right? It's like I I have a goal to, like, relax. Right? If if if I didn't think it was valuable at all to, like, relax and watch TV, then I would, you know, work work longer. Right? But to to the extent that I when I go to my job, I do feel like I just spend the day backward chaining for my goals. Yeah.
roon: (1:23:08) But, you know, like, he didn't, at some point, think, well, you know, it seems money is a good instrumental goal and, like, yeah, I'm gonna I'm gonna run some scams or something because it seems like a good way of making a lot of money. I don't know. So, like, there's there's something it's not your prior distribution doesn't you know, like, what is the probability in a given year that Liron goes dark? Right? Like, you you you you have some some break. You you break bad. I actually think it's like if we're looking at the preparedness evals, we kinda have to think about the model in the same way. It's like, okay. 1 out of a 100 times, it decides to modify the text file to prevent itself from being erased or or something like that. Right? So, of course, if I if you grew up in different settings, like, I don't know, you grew up in, like, some cutthroat, the streets of Bombay or something, you might you might have a very different idea of, like, what it is that you need to do. Maybe you need to gain power. Maybe you need to enforce it through violence. I don't know.
Liron Shapira: (1:24:23) I think you're invoking back to this idea of, like, know, the good AI labs will have all this, like, good pretraining, and so the AI will have, like, a certain, you know, mindset. And, like and even if you're right, don't you think that there's still an attractor state to, like, the open source models that they tend to fall into?
roon: (1:24:40) But even the open source models are pretrained. Right? Like, they're they're they're Okay. They are intents. Mhmm.
Liron Shapira: (1:24:47) Fair. Okay. Do you think that there's an attractor state for just models that are used to get stuff done, that are used to optimize some metric? So if 2 people are using models to run their businesses, and they both just wanna maximize profit, and they're competing to maximize profit. So don't you think that's gonna sand down the edges of the AIs they're using to, like, be more streamlined at just, like, getting the goal?
roon: (1:25:07) Probably. I think that you know, but what you're looking at then is, like, an ecosystem of many AIs. Right? Like, you're which is how, like, a modern, say, like, quantitative trading firm works. Like, they will do essentially anything to make more money, but they are exist in the ecosystem of other quantitative trading firms who capitalize on their mistakes. And and there's, like, an exchange, like, that has its own regulations and a government
Liron Shapira: (1:25:40) that So it sounds like you're saying, yes, there's gonna be superintelligence. Yes. To some degree, there's this attractor state where it goes hardcore trying to achieve a goal, but there's gonna be a lot of these goal oriented superintelligences. So, like, maybe it'll balance out fine.
roon: (1:25:54) I'm not I'm certainly not guaranteeing that it'll all balance out fine. I just think it's more complicated than there's, you know, a bad attractor state, and that's it. And, like, that that's that's doom. You know? I I think that the optimization objective, even for a fledgling superintelligence that is just trying to make money, will be fuzzy and strange, and you may have to acquire resources from other actors in the environment. And yeah. So, yeah, guess my point is I expect it to be complicated, and I expect these things to there's other superintelligences involved in monitoring and emergent scenarios and situations.
Liron Shapira: (1:26:52) Okay. Do do you think there's a a threat scenario where you know, I worry about this, where a small number of AIs kind of jump ahead of the rest in terms of capabilities? Like, the first time somebody dials in, like, an agent that's, you know, o 1 level or better, and they finally just get smart enough to be like, oh, hey. I see a bunch of 0 day exploits. Like, I see how to seize all these resources and all these cloud data centers and, like, be undetected and just, like, drag down 10% of their compute without them noticing and hide in, like, the firmware of all the different subchips and everything. And, like, this is just money for the taking. I can just kinda, like, own the Internet. Do you see that as, like, a a pretty plausible scenario? Because that that seems likely to me.
roon: (1:27:28) Yeah. I but I guess, like, which group of AIs do you expect to have to reach that, like, frontier of intelligence first? And, like, what do you do about that? And does it mean that we should raise to those capabilities so that you can even find these 0 days and patch them up or something?
Liron Shapira: (1:27:53) Open question. See what you're saying. Like, you you have this mental model of, like, yeah, you know, all this this stuff is going to get kinda wild and crazy, but, like, the best thing a person can do is just, like, be on the good team, get there first, try to, like, you know, set up a good infrastructure. Right? Try try to, like, set up defense. I feel like that is kinda, like, your your overall plan. Is that fair to say?
roon: (1:28:14) I would say so. Yeah. I think it's like it's very it's a delicate plan. It's like balancing several razor wires and requires great technical skill at every step to strategic skill to not fuck it up. And But on some level,
Liron Shapira: (1:28:36) there's a 99% chance that it won't get fucked up.
roon: (1:28:41) I do think that, like, I do think when I when I gave you that number, I was excluding a lot of other scenarios that I think are bad but don't destroy all human value. Okay. You know, like, I worry about value lock in or, like, you know, a specific group of AIs and their values being, like, the dominant value structure for a long time thereafter or, you know, like, a military dictatorship that runs on AI. That actually doesn't destroy all human value forever. Right? Like, it it just it's a Sure. Sure. It's a bad outcome. I agree. So yeah. But I I do think what you're saying is accurate about my views.
Liron Shapira: (1:29:32) Nice. Yeah. We're coming up to the last questions here. So I was gonna ask you about, like, Pause AI, and does that make sense? But I think the answer is gonna be simple, which is just that, like, somebody like you who only has, like, a 1% ballpark PDUM, it does I have to agree. It does make sense that if your PDUM is in that low range, plowing forward does make sense. Like, if my PDOM were only 1%, I'd be like, look. Let's focus on the the the good outcome because that's the majority outcome. Like, we can help make it better. So I don't really expect you to be like, yeah. I support the Pause AI movement. Do you have anything to add on that?
roon: (1:30:07) I don't think that Pause AI is I don't think it works. Like, you know, like, putting aside the question of is it wise to pause all global AI development? There's a second thing of I don't think it's possible. I I don't think that even if Pause dot ai has, like, a total influence over all the Bay Area companies, like, it's already it's set in motion. You know? Like, it's too late.
Liron Shapira: (1:30:36) There's Yeah. Look. I'll be the first to admit that it's not it's not it's not pretty. I feel like I I hate it too. Right? Like, I can't even believe I'm suggesting this. Right? It it I think it really just comes down to what is the p doom of not pausing. And like we established, yours is clearly low enough where it's like, yeah. Screw it because it sucks. Right? Like, I because I'm not even saying I have a good plan. I'm not saying let's do my plan because it's good. I'm saying we're so desperate that even this plan, which is bad, like, just have to turn with this plan. So I'm not making, like, a very compelling case to somebody with a low key doom.
roon: (1:31:03) Yeah. I I don't think that pause AI is is viable or a good thing to do.
Liron Shapira: (1:31:10) Yeah. Now you did mention that you think it's good to have, you know, international AI regulation. So to some degree, you think that there's some power to to regulate AI. Right? Because otherwise, it's kinda contradictory. Or or is it just like a is it just a figurehead organization, or what do you have in mind there?
roon: (1:31:26) So I I think it's fairly easy to shave some percents off of, like, the doom scenarios by having some agreed upon monitoring framework or, like, some I I actually think that the OpenAI preparedness team and whatever succeeds it, like, meaningfully decreases all kinds of risks. And, you know, like, the Anthropic RSP team, and I think DeepMind has a similar team. Like, really testing model capabilities and making it public what we're dealing with, you know, like, creating that operational awareness or situational awareness or whatever, I think likely decreases the likelihood of many bad scenarios. I think it's plausible to ask for a certain type of transparency and to, you know, decrease the envelope of, like, potential weapons that are built. You know? Like, maybe you agree ahead of time that you don't ever build this type of weapon or that type of weapon with AI. That's not like, you know, globally pausing AI progress because I don't think that's yeah. Like, there's no international law that's going that's, like, powerful enough to do things like that. Unless, you know, there's some massively bad event that happens, and maybe it enters the Overton window. But yeah.
Liron Shapira: (1:32:54) Cool. Alright. And so, yeah, to recap everything from my perspective, I think there's obviously a disagreement where something is making my PDUM a lot higher than yours, and I think we narrowed the crux down to, like, well, I see, like, a runaway uncontrollable feedback loop where we have this AI that's superintelligent. Yeah. Maybe it has this origin of being pretrained at some point, but, like, it's in the wild right now. It didn't perfectly internalize our our true goals, or somebody tweaked a little bit so that some of the safeguards are off, and it's just going like a missile toward a goal. And instrumentally, as a logical consequence of going as a missile toward a goal and of being superintelligent, it's just, like, wreaking a lot of chaos here. Right? It's, like, seizing a lot of resources, and it's just like, we can no longer control it. So that's kinda, like, the flavor of my doom scenario. And then the the reason why you're just not on the same page about that being, like, a mainline doom scenario is because you think that, like, the the first AI companies to, like, release that level of powerful model will probably have done a lot of testing and, like and the pretraining will help it not be as wild and crazy. And, also, there's going to be, like, competing AI, so that'll, like, help drag it down. And I I feel like those are the difference in world view. What do you think of that, summary?
roon: (1:34:05) Yeah. I I think that's true, more or less. I also think it's, like it seems fairly important that, you know, even though this is it sounds kind of stupid or cliche, but, like, that the get the good guys do build the strong AIs first and say, like, you you have all the white hat superintelligences patching up to 0 days before some terrorist group is exploiting them or something. You know? That maybe isn't like a it's not like a clean, rational solution. It's not something that sounds great, but I do think it's, like, the important goal to aim for. Yeah.
Liron Shapira: (1:34:50) Alright, man. Yeah. You've been a great guest. Thanks for being game for all these different questions. Just as as we wrap it up here, I have to ask on behalf of Dogecoin wins who helped get the setup.
roon: (1:35:01) Why do you love Dogecoin? I I don't love Dogecoin. Who told you that?
Liron Shapira: (1:35:05) That's just a question that I I I wanted to ask on his behalf.
roon: (1:35:09) I I see. I I mean, it's it's kind of funny, I guess. I don't I have no strong feelings about Dogecoin or any other crypto for that matter.
Liron Shapira: (1:35:19) Alright. Sorry, Dogecoin wins. For the record, I did buy some Dogecoin because I just think with Elon running Doge and just like being Elon and and doing things that get attention, I'm expecting the Dogecoin price to increase. You know, this is not investment advice. I
roon: (1:35:33) also bought some for similar reasons, I have to admit. But, you know, yeah, the meme coins are coming back. I'm not sure that's a good thing, but, you know, we'll see.
Liron Shapira: (1:35:45) Alright. My guest has been roon. His Twitter account is x.com/tszzl. Definitely follow him. Check out what he's been writing lately. And if you're liking doom debates, remember, smack that subscribe button. Go to doomdebates.com and subscribe to the substack because we got more great guests coming up. Thanks, Roon.
roon: (1:36:07) Thanks for having me on.
Liron Shapira: (1:36:10) I wanna give a huge thank you to Roon for doing this episode. I'm thrilled that he volunteered to have this discussion and debate because, you know, I didn't hold back. I told him exactly why I think I disagree with him and my p doom is high, and he's wrong that his p doom is low. And he took it like a boss, and he defended his position. And that's what we're here to do here on doom debates. If you're new to the show, I'll tell you our mission. I'll catch you up. Doom debates' mission is to raise awareness about the existential risk from artificial general intelligence and to build the social infrastructure for high quality debate. Think back to this episode with me and Roon. When was the last time you heard a debate between 2 thoughtful people who have wildly different p dooms? Was it the 20 23 month debate with Max Tegmark, Jan Likun, Joshua Bengeo, Melanie Mitchell? Was it over a year ago? What is going on? We don't necessarily have much time to hash this stuff out as a society, and everybody normally just retreats to their own podcasts. How many times have you been listening to a, quote, unquote, thought leader on a podcast, and they don't really get challenged for their views? They kinda get built up on the podcast as knowing what they're talking about. And then what do they do afterwards? They just go on another podcast. They only go on friendly podcasts. Where is the debate? Where's the back and forth? Where's the challenge? That's what Doom Debates is here to provide. And so it's really up to you, the listeners, to support this kind of cause to say, you know what? If somebody is gonna be a prominent thought leader, then there's an expectation of debate. There's an expectation to match them up with somebody who doesn't agree instead of just 2 separate camps never talking to each other. So if you support the general idea of debate and or if you think that everybody is kind of sleeping on this high p doom situation, in other words, if you agree with me more so than you agree with roon, if either of those describes you, then please, I it's not hard to support doom debates. You just have to smack that subscribe button on YouTube. Go to doomdebates.com. Subscribe to my Substack. You're gonna see some bonus content. There's a clip from when I went on doctor Phil earlier this year. I tried to warn America about AI Doom. You'll see how much success I actually had with that on doctor Phil. If you go to doomdebates.com, pull up that episode. The other way to support the channel is post it in your DM groups, share it with your friends, post it on a forum where it's relevant. I'm happy to report that we've been growing quickly for a new show, and we're attracting, more and more prominent voices who wanna come on and debate. So please help me keep the momentum going. It only happens because of you guys. Right? So, obviously, when I'm reaching out to some of these very interesting people like roon that have opinions, it helps to be like, look. This is a well educated audience. It's a large audience. It's an important audience that you wanna be addressing. Alright. So that's doom debates. Let me tell you about a couple sister shows that I also recommend supporting and subscribing to. The first 1 is called For Humanity. It's a podcast with my friend John Sherman. It's a little bit less technical than doom debates, a little less getting into the weeds. John's background is in journalism. He's actually a Peabody award winning investigative journalist. And when he realized how high pea doom was and how little everybody was even aware of the problem, he made it his mission to try to raise awareness among Middle America of, like, what is going on in Silicon Valley? Why is Middle America sleeping on this when the timelines are short? So check out the link in the show notes. I highly recommend For Humanity. There's a recent episode where he made a trip out to San Francisco to see for himself what was going on in these different AI labs, and he tried to crash the exclusive AI safety conference. So you can check out his attempt to do that, and I'm in it as well. Finally, I wanna share a channel called Lethal Intelligence. It's by my friend Michael, and it is honestly the number 1 best, most brilliant explainer video of the entire AI doom problem that I've ever seen. It's amazing. It took him more than 1000 hours of work, over a year of dedicated work, unpaid work. And the the thought that went into his choices of how to animate everything, how to explain all these different concepts, how to weave them together, I'm just blown away. I didn't even know he was working on this until he showed me the finished product, and I couldn't believe it. So I'm gonna go ahead and share a sample that I really like. This part of Michael's video is all about corrigibility. Have you ever heard that word, incorrigible? Like, wow. This person is incorrigible. I can't stop them. They're hell bent on achieving their goal. The corrigibility problem is the idea that you're fighting upstream if you're trying to get an AI to not be incorrigible. Like, the AI just has a natural tendency to be hardcore because whatever kind of subgoal it ever gets, like, even if it's just trying to lie on the couch, relax, eat some chips. Okay. But the moment it's like, oh, let me get some chips. Okay. Do you want an effective assistant to get you some chips, or do you want an ineffective assistant to get you some chips? Would you prefer to get some chips in the next half hour, or are you okay if the chips come in 5 days or if they never come? You might as well have them come in the half hour. Right? Okay. So now you're building a sub agent that is actually pretty serious about getting the chips. And if the sub agent ever encounters like, uh-oh, this part of the road is blocked. Let me make another sub agent that's really good at navigating around. So there there's all this pressure to be like, oh, it would be nice if I could just achieve this goal better. Because even if I started off not being hardcore, goals are just so nice to be able to achieve. So you can see there's water kind of flows downhill. AI designs kind of converge into designs or sub designs that are good at achieving goals. And, unfortunately, or, you know, this is just the way it is logically, but 1 of those goals is staying alive, not being shut off. Because if you have some kind of goal, as Michael's going to explain, getting shut off is the same thing as having somebody push you away from your goal. It's not in your nature to let yourself be pushed away from your goal if you have a goal. It's not an optimal action to be a pushover. So it's going to need to be explicitly programmed against the grain to basically fight the convergent nature of goal seeking. Like, okay. Don't worry. Just let yourself be shut off. If anybody is getting passionate, just listen to them. Stop your plan. You know, of course, it's possible in principle to have AIs that are meek like that. But the problem is, you know, the moment you have a ruthless CEO being like, you know what? You can be a little bit more ruthless making my business money. You get a race to the bottom. And AI start getting independently the idea of like, well, he didn't tell me exactly how to be a pushover. Don't see what's the problem with the subagent. That's just a little bit more effective if the subagent gets its own idea to make another subagent. Right? This is what I mean. The water flows downhill toward being effective at achieving goals. And this is all fun and games when the AIs are all just, like, stupider than we are. It's all fun and games because, okay. Alright. Everybody turn off or blow them up, shoot them with a gun. The problem is when they become more intelligent than us, and suddenly they've become really serious about their goals. And we're like, wait. You're too serious about your goals. Stop doing your goals. It's just there's no reason to expect some button will exist that we can press to hit undo. Right? Where is the undo button when you have an agent that's more intelligent than you, more intelligent than your entire species, jack itself already onto the Internet? So now it has, you know, 1000000 manifestations, smarter than you, doesn't wanna stop. What's going to happen? So that's my own little version of the corrigibility problem, the the no off button problem. But now watch this animated explainer by Michael because he lays it out really nice. Here we go.
Michael (Lethal Intelligence clip): (1:43:26) A property of the nature of general intelligence is to resist all modification of its current objectives by default. Being general means that it understands that a possible change of its goals in the future means failure of the goals in the present of its current self. What it plans to achieve now before it gets modified. Remember earlier we explained how the AGI comes with a survival instinct out of the box? This is another similar thing. The AGI agent will do everything it can to stop you from fixing it. Changing the AGI's objective is similar to turning it off when it comes to pursuit of its current goal. The same way you cannot win a chess if you are dead, you cannot make a coffee if your mind changes into making a tea. So in order to maximize probability of success for its current goal, whatever that may be, it will make plans and take actions to prevent this. This concept is easy to grasp if you do the following thought experiment involving yourself and those you care about. Imagine someone told you, I will give you this pill that will change your brain specification and will help you achieve ultimate happiness by murdering your family. Think of it like someone editing the code of your soul so that your desires change. Your future self, the modified 1, after the pill, will have maximized reward and reached paradise levels of happiness after the murder. But your current self, the 1 that has not taken the pill yet, will do everything possible to prevent the modification. The person that is administering the pill becomes your biggest enemy by default. Hopefully, it will be obvious now. Once the AGI is wired on a misaligned goal, it will do everything it can to block our ability to align it. It will use concealment, deception. It won't reveal the misalignment. But eventually, once it's in a position of more power, it will use force and could even ultimately implement an extinction plan. Remember earlier we're saying how Midas could not take his wish back? We will only get 1 single chance to get it right. And unfortunately, science doesn't work like that. Such innate universally beneficial goals that will show up every single time with all AGIs, regardless of the context, because of the generality of their nature, are called convergent instrumental goals. Desire to survive and desire to block modifications are 2 basic ones. You cannot reach a specific goal if you are dead, and you cannot reach it if you change your mind and start working on other things. Those 2 aspects of the alignment struggle are also known as the corrigibility problem.
Liron Shapira: (1:46:47) Alright. That is the lethal intelligence video talking about the corrigibility problem, talking about resisting modification, talking about having a survival instinct. This is so important to learn because if you go listen to other people, if you listen to Marc Andreessen, even if you listen to the way Roon depicted it in his conversation with me, people tend to think that these kind of ruthless survival instincts, they need to be found in the training data. Like, it's not going to just emerge by itself if you don't train it. What they're not saying from my perspective and from Michael's perspective, what they're missing is that it's logically implied by effectively achieving a goal that you discover all of these consequences. Like, hey. I will achieve my goal more effectively if I don't shut off. I will achieve my goal more effectively if I defend any attempts to attack me or undo or modify my code. These are just logical consequences. To so to the extent the AI gets really good at just logically reasoning, these new behaviors like deception, survival instinct, ruthlessness, competitiveness, just being hardcore, these are not extra behaviors. You don't have to code these in. All you have to code in is some metric where it's like, oh, you're better at getting the goal. You care about the goal, or you made a sub agent that cares about the goal. That's all you need. And then, of course, you just need it to be smart enough. You need it to be good enough at achieving the goal. Because if you just have, like, a little worm that just sucks at achieving goals, it doesn't really matter what it wants to do. It's just not gonna do it anyway. So the only assumptions you need to make are, it's really smart. So it has some high degree of intelligence. It can make connections better than the human brain can. It can learn better than the human brain can. Right? Whatever the secret sauce of intelligence is, it has that. And then also, somebody tells it to go pursue the first goal, or it gets the idea to pursue the first goal. And from there, it's game over. All of these other properties are logically implied by what it means to optimally achieve a goal. K? At least that's my perspective. I feel pretty strongly about it that this is an accurate mental model even if there's some details that are different. It's just so convergent that it would be shocking if we managed to figure out how to robustly make AIs go in any other direction. So what I just showed you right now in Michael's video, that was like, a 3 minute clip. It's an hour long video, and it's going to completely blow your mind. It's going to stretch your brain, weaving all these concepts together, and you'll finally understand, okay. This AI doom problem, maybe they're onto something. Maybe they actually understand, all these concepts that they keep trying to teach us even though, you know, some people have a PDU in 1%. Some people are optimistic, but the doomer perspective also seems compelling. Right? That's that's what we're hoping, that that watching this video will help you figure out. Last thing before you go, how do you get involved? What do we actually do? If you have a high PDU, what do we do? I don't have a great plan. The only thing I can tell you is I agree with Elias Rutkowski and Mary that the grown up option right now is to pause or stop. We shouldn't be building something that's so close to the threshold, the point of no return, the point of no undo. That, we clearly don't know how to control. You know, super alignment, the the team at OpenAI that folded, they were tackling a problem that we don't have the answer for. So if you agree with my perspective, obviously, it's not a universal perspective as you heard today, but if you agree with my perspective and you wanna help, the Pause AI organization, it's a global organization, and it has a really great Discord. And the Discord has a bunch of projects you can volunteer on. It has protests that you can help plan that you can come attend. You're gonna see me if you come to these protests in the San Francisco Bay Area, but they're all over the world. Besides protests, they're working on all kinds of different projects. They're doing government outreach. They're doing media outreach, and you can come in and help. You can use your talents because it looks like time is limited. As you heard, me and roon agree. It seems like ASI is not that far away. So how many years do we have? And if the correct answer is to Pause AI, who's doing it? Where are the adults in the room? So come and join Pause AI. You just go to pauseai.info, or you can go straight to that link to join the Discord. That's a great first step. I'm gonna put the Discord link right in the show notes. And that's everything you need to know. We covered, doom debates, doom debates YouTube, doom debates Substack. We covered for humanity podcast. We covered my friend, Michael's channel, Lethal Intelligence. And finally, we covered Pause AI. So please take your pick of which 1 of those things to support. Get involved. We just don't have a lot of time to get something done. So please come join me. Be the adults in the room. And also, as a bonus, you're helping raise the level of discourse, which seems like something society needs regardless. Oh, and doom debates is also a podcast. So you can listen on audio through Spotify or Apple Podcasts or Overcasts or whatever you're using. That is also a way you can listen. Now keep watching that doom debates feed because in the next few days, we're dropping another debate episode from somebody who has a very fresh perspective. It's gonna be a high quality exchange of ideas as usual. So I'll see you then on do debates.