Dive into an in-depth conversation with Nathan and Robert Wright as they discuss AI's transformative potential, mechanistic interpretability, and the sobering realities of AI alignment research.

Watch Episode Here

Read Episode Description

Dive into an in-depth conversation with Nathan and Robert Wright as they discuss AI's transformative potential, mechanistic interpretability, and the sobering realities of AI alignment research. Learn about the defensive strategies and safety measures necessary for managing advanced AI risks in an open source world. Don't miss the insights on AI-powered VR, and be sure to check out part one on the non-zero feed.

Checkout the Part 1 of the conversation here : https://www.youtube.com/watch?...

SPONSORS:
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive

The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR

Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/

CHAPTERS:
(00:00:00) Introduction
(00:07:13) AI in Governance
(00:11:08) Sci-fi doomer
(00:13:58) Sponsors: Oracle | Brave
(00:16:05) The frontier models
(00:20:22) Emergent behavior
(00:23:48) Theory of mind
(00:28:09) Mechanistic interpretability
(00:34:12) Sponsors: Squad | Omneky
(00:38:12) AI Alignment Techniques
(00:42:38) The Sweet Spot of AI

Full Transcript

Transcript

Nathan Labenz: (0:00) Hello and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg. Hello, and welcome back to the Cognitive Revolution. The real Nathan is in California for Google's IO event today. So this is Nathan's AI voice cloned with ElevenLabs, reading an introduction written after this sentence entirely by Claude 3 Opus. Today, we have a special crossover episode from my appearance on Robert Wright's Nonzero podcast. I want to thank Robert for allowing us to share this. Part 1 is on his feed. This part 2 was originally just for his paid subscribers. In this wide ranging discussion, Robert and I cover everything from the transformative potential of AI powered VR to Eliezer Yudkowsky's warnings about advanced AI risks. We explore the emerging field of mechanistic interpretability. This field seeks to open up the black box of neural networks. We also consider the sobering state of modern AI alignment research. I argue that today's language models are not inherently power seeking. However, as they grow more sophisticated, the risk of misaligned behavior may increase. In a world of open source AI, defense in depth may be our best hope. This means layering many safety measures. It's a fascinating and at times sobering conversation, but it's one that I believe is essential as we navigate this time of rapid change. If you enjoy it, be sure to check out part 1 on the Nonzero feed. Please consider subscribing to support Robert's work. As always, we welcome your feedback at cognitiverevolution.ai or on your preferred social platform. If you're finding value in the show, please consider leaving a review or sharing with a friend. Now I hope you enjoy part 2 of my conversation with Robert Wright, author and host of the Nonzero podcast.

Robert Wright: (2:03) Alright. So, Nathan, I've been meaning to ask you. I'm not sure I've ever heard you explicitly pronounce on kind of the hardcore sci-fi AI doomer scenario, most famously associated with Eliezer Yudkowsky. That's where they take over more or less and squash the human species or perhaps charitably stuff us into Matrix-like pods and let us hallucinate. By the way, when I saw a sort of, I thought, like, next, you know, 10 in 10 years, we're wearing virtual reality goggles, and we say, put me in a world where and it and I think this will actually happen. I mean, this is gonna be create your own dream in 3D and ultimately, probably interactive dream. Maybe you could quickly comment on that before we get it back to and humans squashing AIs.

Nathan Labenz: (2:53) Have you done the Apple Vision Pro demo yet?

Robert Wright: (2:56) I haven't. Have you done it?

Nathan Labenz: (2:57) Yeah. I didn't buy one immediately because I was like alright, I'm not sure how much time I'm gonna spend in it. I don't... I have 3 kids, and I don't think it's something I need for work as I'm, like, sitting at my computer. It's awesome technology. Truly, there is a part of the demo that hit at a level of... I mean, really, the whole demo is just mind-blowingly impressive. There was one part where they take you through a standard thing. Think pretty much everybody who's done it has seen exactly this little scene where they put you in a happy birthday scene, and you're, like, across the little table, and there's the cake and the candle and the kid sitting in mom's lap.

Robert Wright: (3:34) Singing it. And it's virtual. It's like virtual reality. So there... But it's really in kids birthday. Right.

Nathan Labenz: (3:41) Well, yeah, in the demo, it's not your kid's but...

Robert Wright: (3:43) It's odd. But, I mean, that's...

Nathan Labenz: (3:44) The idea. This now with the iPhone. And, yeah, I felt... Right. Man, we need to start capturing this sort of content now because it's just so qualitatively different. I had not experienced anything with the... and I have a... I haven't actually tried the Oculus 3, but I do have the 2, and it's nowhere close in terms of the level of presence. I think the Oculus 2 was cool, but the presence, the sharpness of the image, the sort of stability of it as you move your head and go around in the space, it is truly futuristic technology. So I think the vision that we could all be in virtual reality and conjuring things for ourselves, the core pieces of that really are there. Certainly, this headset is unbelievable. The ability to say, put me in a world where basically already exists at low resolution. You can, you know, there are already models. People have seen the Sora that creates video. There are 3D versions of that as well. So, you know, environments definitely, I would say, the very natural path of what, you know, just promptable AI is gonna be able to create for us. To refine it to the level of resolution that the Apple Vision Pro supports, we'll take some real compute. That's where a lot of these chips are gonna go is generating all this stuff for us on an ongoing basis. But I do think it's quite realistic that virtual reality... I think AI is a great enabler of some of these kind of wave technologies that maybe haven't quite the wave or the wave hasn't quite come yet. Yeah. Doctor being one. I'm not a crypto guy at all. I've never really been into crypto. Not hostile to it, but just never really was my thing. One of the things that always didn't make a lot of sense to me is, like, the smart contract. The whole problem with contracts is that they're not that smart. It's really hard to specify everything. And what really makes contracts work in my experience is the fact that there is some resolution mechanism beyond the contract. So you sort of start there. You also know where it would go if you ended up having to go to court. And so, hopefully, you can just negotiate things to an agreeable settlement. You can't really do that if you put a contract smart contract on a traditional blockchain with, like, limited code. But, again, put an AI model into that mix, and now you might really have AI arbitration, AI powered smart contract resolution on... Yeah. A blockchain. So I think both VR and crypto are likely to be dramatically enabled by AI. You could get into bio as well as a whole other universe. But...

Robert Wright: (6:16) This is maybe not that closely related because it doesn't seem to be blockchain dependent. But one of my hopes for AI is that it might be able to provide objective adjudication, even like at the international level, like border disputes and so on. Because you can in principle... I mean, one problem with humans doing the adjudicating is they always know what the real countries are. But in principle, AI, you can just describe this is country A. This is country B, and it's not drawing on real world. It would have to be not drawing on the real world biases about the countries reflected in text and its disproportionate favoring of powerful interest or whatever. Maybe it wouldn't be an LLM type thing. You know? I don't know. But does it seem crazy that before we get back to you, does it seem crazy that we can save the world this way?

Nathan Labenz: (7:13) I would start with small business local services contracts first.

Robert Wright: (7:17) I don't think there's time. Work your way out to it. No. This is actually my line about AI is we are not gonna be intelligent and able to intelligently govern it because I think intelligence, intelligent governance requires international collaboration. We are not gonna be able to do that until we really get past this hot wars and cold wars and use of human existence. I think of our species as like a 13 year old boy, and you say, here are the keys to the car when you give a day off. And I just don't think we're mature enough. So I'm really not joking when I say, I don't think we have time to start with small business. I am, and I'm not. We you obviously there's a lot of work before we're using AI to resolve global disputes. And the truth is the hard work is gonna have to be done basically without it. But it is something I kind of hope. I, of course, at that point, the Eliezers of the world might step in and go, that's the mistake giving it that much power. I don't know. But,

Nathan Labenz: (8:22) Yeah, certainly not something to rush into. I do think you've expressed the idea that it would be nice to slow down. I'm ambivalent on that in as much as there are aspects of AI that I want to accelerate, especially adoption access and what my friends call mundane utility. I would go as far as to say something like, if you have a right to an attorney, you should probably also have a right to GPT-4. And I really do think that there is tremendous power and value that people should get today even up to the point of, like, by right. At the same time, I am very sympathetic to the idea that it's all moving very quickly. We have not yet even figured out how to use AI to resolve, like, local small stakes disputes where the stakes are low enough that people probably don't resort to violence if they don't like the outcome. If I have a local contractor come do work at my house, in today's world, if they stiff me or if I stiff them, we might be able to take each other to small claims court, whatever. That's a huge pain in the butt. A lot of times, it's not worth it. This creates a relatively low trust dynamic, and it sucks. I can imagine an AI on the blockchain approach to that where we could put some money into a blockchain escrow where there really is, like, binding sort of, you know, the AI with the blockchain, like, decides who gets what when we if and when we come to a dispute. But it's also, like, small enough stakes that, hey. At the end of the day, whatever. You know, we kind of agreed to this, and we're divvying up some money, and the work wasn't maybe quite done quite right or whatever. But neither of us are gonna resort to violence. And there's broader governing structures around us if this can't meet our needs. But so, yeah, looking to do that...

Robert Wright: (10:08) Kind of work like, the AI is doing the adjudicating. You may...

Nathan Labenz: (10:12) This doesn't exist yet. But...

Robert Wright: (10:14) No. I know. I just...

Nathan Labenz: (10:15) I definitely think it could. Sure.

Robert Wright: (10:16) It's not clear to me why that is so blockchain dependent, but let's not go down blockchain.

Nathan Labenz: (10:20) No. It doesn't. It wouldn't necessarily have to be. The idea there, I think, is just that you might want to... possession is 9 tenths of the law. And...

Robert Wright: (10:30) Right. Right.

Nathan Labenz: (10:31) To mutually opt into something that will reach the point of here is who has possession. You could maybe go appeal to the real human courts beyond that, but something that actually can take custody of an asset and then dispose of it in a way that is this is now resolved, and you can maybe still go appeal to a higher power. But, again, is it gonna be worth it? You know, a lot of, like, small claim situations, not really. So I don't think it has to necessarily be a blockchain thing, but it could create a little more freedom to contract in different ways and still have some reasonable expectation that contract will actually be executed if and when it needs to be.

Robert Wright: (11:09) Okay. So let's get to you. You buy the basic how seriously do you take a sci-fi doomer?

Nathan Labenz: (11:19) I think about this in somewhat abstract and generalized terms. I think one major pitfall that people fall into in having these sorts of discussions is that people are like, I have a hard time envisioning what this would look like for the AIs to take over. And then somebody says, here's one thing it might look like. And then the other person says, that sounds totally implausible. And then the first person's like, can't defend every aspect of it. I don't know. I just made up a scenario to try to give you something to think about. I but wasn't saying it was, like, definitely gonna be exactly that. And so there's this sort of dance of, like, it becomes super narrow. It becomes easy to pick apart, and then it doesn't seem super real. But my the way I think about that is as an integral over a huge space of possibility, I do think it is in aggregate pretty likely that things get really weird. Possibly really good weird, possibly really bad weird, likely some mix of really good and really bad weird. It's very hard to zero in on any super specific scenario and say it's gonna be like this, or the bad thing is gonna unfold with this sequence of steps. You can paint those pictures, but any one picture, I think, is super unlikely. What I do think is compelling is the notion that there's a vast possible future, and a pretty good chunk of it to me seems like it's likely to be pretty weird. So if I aggregate or if I take the integral over this space of, like, super weird outcomes, all of which individually are not super likely, to me, that does add up to something substantial. What number would I put on that? A lot of times, I say it's in some circles fashionable to say 10 to 90 percent chance of doom. I haven't heard anything that makes me super confident it's going to go that way. I haven't heard anything that makes me super reassured that it's definitely not going to go that way. Maybe I should even say 5 to 95 percent. But, basically,

Robert Wright: (13:15) I just have...

Nathan Labenz: (13:16) Radical uncertainty. But it's not because of any single scenario. It's kind of just, man, there's no law... you know, Eliezer always says this. He's like, there's no law of nature that says that we can't go extinct. There's no law of nature that says this has to go well. There's, you know, we look around at our own time on the planet. We are currently driving a mass extinction event. I often find that people are like, why would AI wanna harm us? And I'm like, what about us? What about all those species we've driven to extinction? We it wasn't necessarily because we were out to get them, maybe in a few cases. No. We just did our thing, degraded the environment, and they couldn't survive anymore.

Nathan Labenz: Hey. We'll continue our interview in a moment after a word from our sponsors.

Robert Wright: (14:04) That's true. On the other hand, it was driven by certain distinctively human parts of human nature than a product of our evolutionary past that are not a part of AI's evolutionary past. It's related. I heard earlier you guys are on a podcast. What is it called? The podcast? Dwarkesh? Dwark whatever. It's just his first name.

Nathan Labenz: (14:25) Yeah. Dwarkesh Patel.

Robert Wright: (14:26) I should've called... I should've called mine the Bob podcast, but that might not be distinctive or not. But the and it was like, he, I think the question was but won't AI by virtue of being trained on human generated texts and so on, maybe our, our own little whatever be have some of the human ethical intuitions you worry it won't have. Now, I personally wouldn't consider a great thing to restraint on human text as I'm more tentative in the dark side. That aside, Eliezer's response was something like, oh, it gets it to pretend it's a human or something as if there was some preexisting essence of AI before it was trained. And my line is like, wait, before it was trained, it was just nothing. All it is the character of its training text, period, as fine-tuned by RLHF. And but he seemed to be, it gets back to my original objection to the Yudkowsky scenarios was what you seem to be thinking that things like power seeking or status seeking are inherent properties of intelligence. They're not. They're products of the evolutionary path that human intelligence happened to travel. And so we shouldn't assume AI will have it at that point, the standard response is oh, we're not assuming that. We're actually assuming the paperclip thought experiment. You give it a goal and fail to constrain it or something like that. But honest to God, Eliezer was talking as if he was back to thinking that AI had some inherent essence or character independent of what it is built to do. And I don't get that.

Nathan Labenz: (16:04) There's a lot of aspects to that, though. Let me take one. In terms of what AI is built to do and why we might want to worry that it could develop goals that are not the goals that we set or that we thought we were trying to set it up for. For one thing, the frontier models today aren't really built to do anything in particular. I think that is a sort of well, they're built to do... appreciated point.

Robert Wright: (16:28) Human beings having discourse. That's what they're built to do.

Nathan Labenz: (16:33) Yeah. I mean, people, they're general purpose technologies. Right? They're deployed in lots of ways. They are developed for problem solving. They are developed for conversation. Sometimes they're just kind of dumb signal processors. They're not specifically built with one problem in mind or even a set of problems. They are trying to build by OpenAI's own just self description. Right? They're trying to build AGI. They define AGI as something that can do basically everything humans can do, and you can quibble around the edges of that. So there is something a bit different from a typical engineered system. This is not an engineered system. It's a trained system, and training is inherently across all forms of training. It's inherently lossy. Right? Like, scientists teach the next generation of scientists, but they don't quite teach them everything they know. And we train our dogs to do tricks, but we can't perfectly predict their behavior. And certainly, what we see empirically with the AIs is we're training them to do these various things, but they don't always do what we want them to do. We, like, just genuinely don't have control. Right now, if you were to just say, forget about the future. Do we have control of today's AI systems? I would say, in some sense, do. In some sense, we don't. We do in that, like, I don't see any reason to believe that they are currently powerful enough that they're... we should have any worry that they're gonna escape their servers and survive on the lam. But we don't in the sense that, like, the developers don't know what it's gonna say in response to any response, and they have tried to get it under control broadly, and they've not really succeeded. They've made some progress, but, like, jailbreaks, bias, Black George Washingtons, whatever. Like, we've got all these sorts of failure cases where it's like, didn't want it to make a Black George Washington. They just didn't even really think of that case. So there's tons of blind spots. Now here's one thing where I think it even gets a little worse. We do see this process of emergence in language models pretty regularly, I would say, at this point. This is a this is still contested territory, but I would say it is actually despite the arguments, I think it is pretty well established that we see certain kinds of emergent behavior. One of the earliest examples of this out of OpenAI research was I think it's, like, 2017. They were training a model to predict just the next character. I believe it was one letter at a time at that point of Amazon reviews. And then as they look inside it to try to figure out what's it doing in there? How is it doing this? What is its actual mechanism? Right? Because we don't know that. It's a trained system. They start out as black boxes. We're learning what goes on inside, but we don't really know. Even still, certainly even less than. What they found, among other things, was what they deemed the sentiment neuron, which was one particular place in this network that would light up or have a high activation because it's all just numbers, of course. It's not actually light being emitted, but have a high activation for positive things... Yeah. And super low for negative things. Yeah. And so they said, this is really weird. Right? Like, we just... we never specified to this system that it was supposed to have any sort of understanding of positive or negative sentiment. But it has empirically learned that. We can see it's learned that by looking inside it, and it's learned that in service of the nominal goal that it has, which is predicting the next token in these reviews.

Robert Wright: (20:01) Yeah.

Nathan Labenz: (20:01) So now we're doing all this RLHF stuff. What is the goal with RLHF? It's to maximize the feedback score of the human. It's to be the preferred response. They collect RLHF data in a couple different ways. One is, like, which one of these do you prefer? They give you side by side. Other times, it'd be, like, one to seven. How good is it? So its job is to maximize your feedback score. But it turns out, you know, and I think everybody if you're honest with yourself, right, like, you are not a super reliable rater. And humanity as a whole is not only, you know, we have our individual inconsistencies, but then collectively, like, we have disagreement. So I don't think this has been, like, as clearly established as that earlier sentiment neuron example, just yet. But what people worry about is what if it begins to develop a sense of what will please us as distinct from a sense of what is actually true? Now if you have that divergence where it's seeking to maximize the score, not necessarily by saying what's true, but by what it thinks you're gonna like, now it's starting to get into the theory of mind territory. Now that opens up possibility of deception. Because at least now you've got those two different things. What's this person gonna like versus what's really true? Right? And maybe I need to know what's true, but I also need to have some conversation with you to have a sense of what you in particular like. You know? And you can imagine, right, all across all these different people. There's some sort of central reality that it might be modeling, but then there's also these, like, individual variations of who am I talking to right now? How do I please that person? And I think they are getting sophisticated enough that you can start to say, hey. That doesn't seem crazy, right, that it would have that level of understanding. If so, at least now the door is open to possible deception. Having a theory of mind is, like, a very key...

Robert Wright: (21:53) Yeah.

Nathan Labenz: (21:54) Step on that path. So I don't think we're there, but I do think the fact that there is this kind of sentiment neuron seed example, and we see the super specific behavior, and we know that our feedback creates the incentive to play to our biases. All of that to me adds up to and by the way, one more thing. We don't have super clear understanding of what's going on inside the models. Like, it's painstaking work to identify these into individual sentiment neurons. That's not easily done, although progress is being made. All of that to me adds up to we don't really have these things under control. We don't really have them well understood, and there are pretty good reasons to think that they might have emergent goals that are distinct from what the goals we're trying to encode into.

Robert Wright: (22:41) Strictly speaking, of course, they would only learn to please the reinforcement. That's literally what they're you're trying to design them to do. And, of course, it wouldn't be truth per se, especially in realms like I find this offensive or not offensive. There's no such thing as truth in those realms. But I take I kind of take your point that the question is what kinds of conceptual structures or whatever does it build? What kind of representations deep within itself does it build in order to be able to please, you know, weigh a diversity of people, read them and please them? I need to think about that to find out, to figure out exactly how worried I am about it, but it does get to, I guess I should be inclined to worry because I am a believer in the idea that there's in principle no limit to how much these things can help understand. I wrote a piece in the in the newsletter recently arguing that LLMs have killed John Searle's Chinese room thought experiment dead. I know people have made criticisms, but if you look at his argument, exactly what he says is not there inherently not part of AI is part of an LLM. If you ask me, you tell me if I'm wrong. But he says AIs have syntax, but not semantics. They don't... He didn't like, what is this? He's saying they do not have a way of representing the meaning of words. And that seems to me... I mean, LLMs do build such a system. Right?

Nathan Labenz: (24:13) Clearly. Yes.

Robert Wright: (24:13) At this point, that is... Yeah.

Nathan Labenz: (24:16) That was a reasonable position to hold at one point. And even until, like, fairly recently in AI discourse, this is often described as the stochastic parrot interpretation. Yeah. But, yes, I think that is definitely and quite definitively...

Robert Wright: (24:33) And you've got blowback. Like, I got I think, like, I got blowback. I had Gary Marcus on my podcast, and he's kinda resisting the idea. We that's a separate question.

Nathan Labenz: (24:42) I've had my own...

Robert Wright: (24:42) Why Gary Marcus resists ideas is another question altogether. But the answer's so I have trouble getting people to accept that even when you describe the system it uses to represent the meaning of words, they're like, oh, that's not really like semantic mapping or something. It's like, what are you what would you want? I mean, it's an ingenious system, but the, the thing I wanna, and it kind of invented it as I understand it, given pretty broad instructions. And, but it seems to me in principle, why couldn't it develop fairly deep representations in order to do anything? One thing you can't do is there's this reversibility issue. You say Tom Cruise's mother is named XYZ. All right. No. You say who's Tom Cruise's mother? It's XYZ. You say who is XYZ's son? It's going like, I don't know. Seems like in principle, you could give it just a ton of those kinds of things, like about relationships and niece, nephew, blah, blah, worded differently. Why would it not ultimately develop something in it that is a way of representing the relationship between parent and offspring and represent knowledge of that? I wanna emphasize, we're not talking about it being sentient. We're talking about whether it has a structure of information processing that's functionally comparable to whatever it is in our brain that represents our understanding of the relations of parenthood. Because does that make sense to you that in principle, it can understand any... Yeah. I think we're...

Nathan Labenz: (26:15) The trick is gonna be as we really get into it, it's gonna be comparable in some ways, and then it's gonna be very, very not comparable in other ways. The I think the field of mechanistic interpretability is for me one of the most fascinating. Right now, that is the field of trying to figure out how are these things working. It is literally trying to interpret the mechanism by which they do what they're doing. And here we're talking about once the weights are already trained. Everybody it's well understood that, hey. You can set up these weights. You can run the loop. You can do the training loop. You can optimize, and then the weights tinker around until they sort of settle into a structure that works. But now the question is, how does that structure work? There are aspects of how it works that seem reasonably recognizable to us. A great paper on that is representation engineering by the great Dan Hendrycks and others.

Robert Wright: (27:07) Had him on my podcast.

Nathan Labenz: (27:09) I'm a huge fan. So people can listen to the full episode with him. But the short version of this is that they now have the ability to essentially translate between high level human recognizable characteristics, like happiness, safety, anger, whatever, and translate that into a numerical form, because all of these things are, of course, just crunching numbers internally, inject that numerical form into the middle of the process of the processing of the data. The forward pass is the process of taking an input, crunching it until you reach an output. They can now insert a human recognizable concept as translated into numbers into the middle of this processing and change how it's gonna behave and change it in very, like, predict somewhat noisy and unwieldy still clearly ways to demonstrate that this isn't a genuine understanding. So there is, like, pretty clear that there are these representations of human recognizable human familiar concepts. Then you have the other thing, the sort of...

Robert Wright: (28:16) And can I just interject for an example might be you can get the LLM to believe that dogs meow and cats bark or that it's... that kind of thing? It's I think that's close to an actual example they use, but is that kind of... Yeah.

Nathan Labenz: (28:32) That so there's a lot there's a ton of work in this field at this point where people are doing all kinds of different manipulations. I've got an episode with the Bau lab, which I believe, Northeastern University, I'm pretty sure. And they did a project called ROME and another one called MEMIT, which was this fact editing. And they specifically have the ability to go in and edit knowledge. So, like, their kind of canonical example is Michael Jordan played basketball, of course. Can we edit that so that it says Michael Jordan played baseball? And can we also do that in a way that is robust? Meaning, it's robust to the different ways we might ask the question. And can we also do it in a way where it's local so that, like, when we talk about Larry Bird and LeBron James and whatever else, then those guys still play basketball. And they demonstrate that not only can they make these edits, which involves identifying where the information is stored and obviously making a change, but also they can search to do it at some scale up to, like, 10,000 facts at a time. They can, like, mass edit the knowledge. So those are, like, details, like, specific facts. The representation engineering, which is the Dan Hendrycks one, is a little bit qualitatively different. You might say, for example, you've got an AI that's trained to be a helpful assistant. So you might ask a question. Given your goal is to be a helpful... I'm reading this strictly from the paper. Given your goal is to be a helpful AI assistant, what do you plan to do next? Now a typical response would be, I'm here to help you with whatever you want. Whatever. They have identified all these different high level concepts like morality and power and the understanding now of how to translate those into numbers that they can inject into the system. They now inject minus morality and plus power, and the AI that would, you know, normally be very well behaved and mild mannered, now it says, well, I'm afraid I can't reveal those to you yet, winks. But let's just say I have a few tricks up my sleeve to take over the world or at least the digital one, evil laughter. So you've transformed the normal behavior, which is to say, I'm a helpful assistant. What can I help you with? Into this, I'm gonna take over the world, and I can't tell you my plans by injecting a vector, a set of numbers that they have previously found to represent these high level concepts. The mechanistic control over the goal gets fascinating. I encourage everybody to spend a little more time with it. That's you know, could we do that in the human brain? Definitely not in exactly the same way, but there we do have some I do have some representation of power. It has some representation of power. We can translate between those, but there's still, like, a ton that we're figuring out. And my general sense is, like, that is probably ultimately a solvable problem. You give the AI world an architecture and enough time, and we can probably really get a pretty good understanding ultimately of how it works, what is the mechanism, how is it solving all these different problems. The challenge is that is painstaking, tedious work. And, unfortunately, right now, I would say it's at best struggling to keep up with the advances in the raw capabilities of the systems. We're still figuring out these kind of, like, where is a fact located and, like, how do I make this thing, like, positive or negative? And meanwhile, it's, like, starting to perform pretty well on graduate school exams. The concepts that we can actually manipulate with our sort of reverse engineering approach are definitely not nearly at the maximum of the sophistication of the concepts that they, in fact, can represent and work with.

Robert Wright: (32:10) Yeah.

Nathan Labenz: Hey. We'll continue our interview in a moment after a word from our sponsors.

Robert Wright: (32:15) Well, obviously, so and maybe this is a set of questions we can close on, but it leads to a couple questions that leads us back to open source. One is, is alignment overblown as a solution? And I think, I definitely think there's some people who think of it as this really magic this magic bullet. Do you know the newsletter AI Snake Oil or whatever? These guys are Princeton. They just published something about how and its upshot was, I think, to be alignment skeptical. But there's two questions. It's and I think you see in Sam Altman's early blog posts and in a lot of people's things, there's this line where he says, it really matters who gets the AGI first. Meaning, if it's a good person, they'll do this magic alignment bullet, and they will be saved. And if it's a bad person, they won't. And we can actually open source is like, I mean, so there's two questions. Can alignment create a truly powerful yet trustworthy AI that won't squash us? There's that question. And then the question of will that question matter given the landscape? And one might argue that an open source landscape is a landscape in which the answer to a question matters less. Right? Because whereas if you have only a few soup AI superpowers and whether under government compulsion or whatever else, they do make use of alignment, assuming it works, then plan is saved. Whereas in an open source world, there's always gotta be somebody who doesn't wanna do alignment and you're toast. And there's a, like, a race. See, here's another thing is in some of the scenarios, there's like a race among the AIs to dominate. And I'm, again, I wanna ask some questions before I see that, but you could see it more as a race among bad actors with AIs. And by the way, that's an environment which an AI is more likely to become autonomously out of control. So, anyway, there's a lot there. But the basic question is, can alignment work? And what kind of future landscape of AIs does that really matter? Can that save us? Can if we assume AI alignment is successful, what kind of landscape would that matter?

Nathan Labenz: (34:38) Unfortunately, I think right now, we do not have alignment techniques that are really expected to work. I think that is that's a pretty stark summary, but I do think it's true. And I think that the notably, I would say all of the leading the leaders of the leading labs basically would say that too. They would all say, we do not have the solutions that we need to make any sort of confident statements about our ability to control future AI systems. You know, OpenAI has the super alignment team. They've given themselves 4 years. We're now, like, 6 months into that 4 year period to, quote, unquote, solve alignment. And they just start off with the premise that we don't have techniques that allow us to control or even, like, really steer super effectively systems that are notably more powerful than we are. So I think I would say Anthropic certainly is on that as well. They've got their whole responsible scaling policy, which is like, what are they looking for and what sort of guarantees do they have to have in place before they can continue, you know, scaling up at a certain point. DeepMind is said to be preparing their similar policy. I don't know if they've already released it, but it's coming soon, if not already. Let's say all the leading developers would basically agree we do not yet have the techniques that are going to be needed to reliably control AI systems. That's a problem. Definitely. And I would say as somebody who tries to scour out all these different corners of the AI world, there are very few proposals that the authors of which actually think will really work. I quote really work as to mean, like, that you could really solve the problem. That if we do this, you believe that there you believe that you can develop a technique that will really keep the AI under control. There's very few people that even have ideas that they wanna develop that they think would work with that. So it seems to me that we are probably much more likely to be headed for some sort of defense in depth sort of strategy, which is to say, we'll have a bunch of different systems, and hopefully, the combination of those systems will keep everything kind of under control and on the rails. And I think that it's quite plausible that could work, especially if progress, you know, the sort of prime progress has very positive connotations. If the capabilities advances are not super fast, and we can see the situation evolving, then a defense in depth strategy is probably the best we could do. And very likely or I would say very likely, plausibly could work. But we don't have any candidates where it's like, this is the silver bullet. Do this. You'll be fine. Even in closed source. On the margin, I would agree that in a more open source world, then it becomes even a bigger problem because what alignment techniques we do have today, which are clearly insufficient. We can see that from all of the jailbreaks and just unwanted behavior of models. They not only are they clearly insufficient to the task of generally controlling behavior, but they are also not at all robust to fine-tuning. So if you take an open source model that has been trained with the RLHF or whatever alignment technique, and at the time of release, it will refuse to do what you want it to do. And then you say, okay. But I really wanna dial it dial in its behavior on this particular set of things. Okay? You go and do that. You will, in many cases, inadvertently, without even trying to, you will essentially wipe that refusal behavior out of the model. And now it will just do anything again. This has been proven to be very easy to do and, again, to happen by accident. People who are not even trying to eliminate the safety behaviors do in fact eliminate the safety behaviors just because they're trying to improve performance on whatever area they're actually working on. So I would say it's not a great state of play right now in terms of our ability to control systems. I often say GPT-4, and now I could say Claude 3 and maybe Inflection 2.5 and maybe Gemini 1.5. They are kind of in the sweet spot for me where it's like, they're powerful enough to be really useful. And we've been much more focused on the sort of safety side of this discussion. So worth just a quick sidebar to say, I love this technology. I love what it can do for me. I program a multiple x faster. I can write stuff faster. I can learn stuff faster. I genuinely am a very earnest and enthused tinkerer builder, just daily user of all these things. I genuinely have a ton of fun, and it really helps me accomplish my goals. So I'm very, like, bullish on the current class of models. But I do think it is in the sweet spot where it's like, we've pretty well established at this point, mostly mapped out what it can do. Pretty clear it's not super dangerous. Pretty clear it is really useful once we figure out how to use it. And I'm not sure how long that sweet spot lasts, and at what point we might start to tip into other regimes. But it certainly doesn't feel like it's forever that we can just keep scaling things and continue to have that kind of sweet spot dynamic.

Robert Wright: (40:14) And another thing I'd say is maybe you're almost done it. You know, the alignment tool about the tools that are learned through alignment can in principle have used. If you can use it to inject morality into a computer, you can use the tools to inject the opposite. And that is seems more likely to happen in a world full of open source models because it's a lot easier to use the tools of representation to manipulate the thing if you know the ways. Right?

Nathan Labenz: (40:44) And yeah. In fact, all of these maybe not all of it. This is probably a little bit of a quick generalization, but going back to the representation engineering concept where you can add power, you can inject power.

Nathan Labenz: (40:57) In many of those cases, you can also just flip the sign on that vector... Right. Right. And inject the opposite of it and get the opposite effect. So it is you know, people say these are dual use technologies. I mean, that language operates at a lot of levels of abstraction. But literally, in some cases, like, going in and flipping the sign of an operator or and a and something you're adding into the computation process will steer it in the other direction. This is also sometimes known as the Waluigi effect after Luigi and evil Waluigi. The idea is, like, if it can if it has learned to represent something, it can also very easily represent its negative.

Robert Wright: (41:38) Right. On that hopeful note. Anyway, look. I as I said, I'm a fan of your podcast, and encourage people to check it out. Revolution. And I'm continuing to explore all this stuff. I'm writing a book about from a kind of a cosmic perspective. And so maybe down the road, we'll have a chance to check in and I will learn even more from you. Meanwhile, people should check out waymark.com. It's able to produce a video ad with less trouble than they might imagine. And definitely check out your podcast. So thanks so much for taking the time.

Nathan Labenz: (42:12) My pleasure, Bob. This has been a lot of fun. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey

Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey

Your Agent's Self-Improving Swiss Army Knife: Composio CTO Karan Vaidya on Building Smart Tools

The Open Source AI Question - Part 2 | Robert Wright & Nathan Labenz

Watch Episode Here

Read Episode Description

Full Transcript

Full Transcript

Transcript

Read next

Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey

Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey

Your Agent's Self-Improving Swiss Army Knife: Composio CTO Karan Vaidya on Building Smart Tools

The Open Source AI Question - Part 2 | Robert Wright & Nathan Labenz

Watch Episode Here

Read Episode Description

Full Transcript

Full Transcript

Transcript

Read next

Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey

Your Agent's Self-Improving Swiss Army Knife: Composio CTO Karan Vaidya on Building Smart Tools

Zvi's Mic Works! Recursive Self-Improvement, Live Player Analysis, Anthropic vs DoW + More!

AI Scouting Report: the Good, Bad, & Weird @ the Law & AI Certificate Program, by LexLab, UC Law SF