[Bonus Episode] Connor Leahy on AGI, GPT-4, and Cognitive Emulation w/ FLI Podcast

Watch Episode Here

Video Description

[Bonus Episode] Future of Life Institute Podcast host Gus Docker interviews Conjecture CEO Connor Leahy to discuss GPT-4, magic, cognitive emulation, demand for human-like AI, and aligning superintelligence. You can read more about Connor's work at https://conjecture.dev

Future of Life Institute is the organization that recently published an open letter calling for a six-month pause on training new AI systems. FLI was founded by Jann Tallinn who we interviewed in Episode 16 (https://www.youtube.com/watch?v=R78mbtNeCvM&t=1s.

We think their podcast is excellent. They frequently interview critical thinkers in AI like Neel Nanda, Ajeya Cotra, and Connor Leahy - an episode we found particularly fascinating and are airing for our audience today.

The FLI Podcast also recently interviewed Nathan Labenz, linked below:
Part 1: https://www.youtube.com/watch?v=-xH5xd4aiJg
Part 2: https://www.youtube.com/watch?v=a72xR3aQ7Jk

SUBSCRIBE to the FLI Podcast: @futureoflifeinstitute

TIMESTAMPS:
(00:00) Episode introduction
(01:55) GPT-4
(18:30) "Magic" in machine learning
(29:43) Cognitive emulations
(40:00) Machine learning VS explainability
(49:50) Human data = human AI?
(1:01:50) Analogies for cognitive emulations
(1:28:10) Demand for human-like AI
(1:33:50) Aligning superintelligence

If you'd like to listen to Part 2 of this interview with Connor Leahy, you can head here:
https://www.youtube.com/watch?v=nf-2goPD394

Full Transcript

Transcript

Nathan Labenz: 0:00 Hello, and welcome back to a special bonus episode of the Cognitive Revolution. As you may know, I recently had the pleasure of appearing as a guest on the Future of Life Institute podcast. The Future of Life Institute, you'll recall from our recent interview with Skype founder and influential AI investor Jan Tallen. And if you haven't heard that episode, I strongly encourage going back to listen to it. Future of Life Institute is the organization behind the recent open letter calling for a voluntary 6 month pause on the larger than GPT-4 scale AI training runs. The Future of Life Institute podcast, which despite having known Jan for years, I will confess to never even having heard of previously, turns out in fact to be excellent. Gus Docker, the host, really puts in the work necessary to deliver deeply substantive conversations. And his lineup of guests, while not exclusively AI focused historically, is almost entirely AI of late. Featuring interviews with a number of critical thinkers such as Neil Nanda and Ajaya Khatra that I also hope to interview in the future. So we're taking this moment to share the Future of Life Institute podcast with you, starting with an interview that Gus recently did with Conjecture CEO, Connor Leahy. Conjecture describes itself as a team of researchers dedicated to applied scalable AI alignment research. They also build products to help businesses improve their workflows. Jan, who is an investor in Conjecture, described them memorably as, quote, a team that has the highest respect for AI, a virtue that I personally also seek to cultivate. This is part 1 of a 2 part interview. Part 2 is available on the Future of Life Institute podcast feed. I hope you enjoyed this conversation with Gus Docker and Conjecture CEO, Connor Leahy.

Nathan Labenz: 1:47 Welcome to the Future of Life Institute podcast. I'm here with Connor Leahy. Connor is the CEO of Conjector, and Conjector is this company researching scalable AI alignment. So, Connor, welcome to the podcast. I'm so glad to be back. Okay. What is happening with GPT-4? What is this the moment that AI becomes a mainstream issue?

Connor Leahy: 2:11 Christ, what a way to start out. It is no exaggeration to say that the last 2 weeks of my life have been the most interesting of my career, in terms of events in the wider world. I thought nothing could be GPT-3. After I've seen what happened with GPT-3, I'm like, okay, this is the craziest thing that's gonna happen in a short period of time. But then I quickly realized, no, that can't be true. Things are only gonna get crazier, and as predicted, that is what has happened. And as predicted, the release of GPT-4 has been even crazier than GPT-3. The world has gone even crazier. Things have really changed. I cannot overstate how much the world has changed over the last, not necessarily only since GPT-4, but also since ChatGPT. Maybe ChatGPT was even a bigger change in wider political things. I won't mince words. The thing that really has struck me, I've been talking to a lot of people recently. I have journalists running down my door. Talk to politicians and national security people and people around the world. And one thing that really strikes me is that people are starting to panic.

Nathan Labenz: 3:33 So this goes beyond Silicon Valley, Twitter circles. This is venturing into politics and governmental agencies and so on.

Connor Leahy: 3:44 Look, when I've been doing EA for a long time, and I come from a pretty rural place in Southern Germany, and when I went back to visit my mother for Christmas and all my cousins and family were there, they talked about ChatGPT. I was there in this teeny world where there's usually no technology and I'm the only one who really knows how to really use a computer very well. And then they're talking with their, Connor, well, we thought this AI thing you were talking about, that wasn't just some kind of thing you liked, but wow, you were right. This is actually happening. I'm like, yeah, yeah, big surprise. So this is not just a thing that is in a small circle of people in tech or Silicon Valley, this is different. This is very different. People, we're getting front page news coverage about this kind of stuff. We're getting people from all walks of life suddenly noticing, wait, this is actually real, this is actually affecting me, this is actually affecting my family and my future, this is not at all how things went past. In an ironic twist, it seems that the people deepest in tech are the ones who are least rational about this or the least deeply taking this seriously. There's this meme that's been around for a long time about how, oh, you can't explain to normal people AI or AI risk. But maybe that was the case 20 years ago, but this is not my experience now at all anymore. I can talk to anyone on the street, show them ChatGPT, explain it to them, and explain AI risk. Hey, these people are building bigger and bigger and stronger things. They can't control it. Do you think this is good? And they're like, no. Obviously not. What the hell are you talking about? Of course, this is bad.

Nathan Labenz: 5:48 Do you think that the advancement from GPT-2 to GPT-3 was bigger than the advancement from GPT-3 to GPT-4? So are we hitting diminishing returns?

Connor Leahy: 6:03 No. Not at all. Not really. It's just as I predicted, basically. This is just pretty much on track. I would say GPT-4, the final version is better. So I used the GPT-4 alpha when it was back in August, when that was first being passed around among people in the bay, and it was already very impressive then, but kind of in line of what I was expecting. The release version is significantly better. The additional work they've done to make it better at reasoning and such, and the visual stuff and all that kind of stuff is significantly better than what I saw in August, which is not surprising. You can argue on some absolute terms. The absolute amount of difference between GPT-2 and GPT-3 is obviously much larger. Also, the size of the model is a much bigger difference. GPT-4, from what I hear, is larger, but it's not that much larger than GPT-3. The thing with GPT-4 is that what is very striking with GPT-4, and this is not surprising, but I think it's important, is not that it can do crazy things that are impossible to accomplish in principle with GPT-3. Often the things that are impressive with GPT-4, it's possible to accomplish these things with GPT-3 with a lot of effort and error checking and rerolling and very good prompting and so on. The thing that is striking you with GPT-4 is that it's consistent. What's striking is that you can ask it to do something and it will do it, and it will do it very reliably. This is not just bigger model size, this is also better fine tuning, RLHF, better understanding of what users want these models to do. The truth is that users don't want general purpose base model large text corpuses. This is not what users really want. What they want is a thing that does things for them. This is, of course, needless to say, this is also what makes these things dangerous compared to GPT-3. Raw GPT-3 is very powerful, but raw GPT, all GPT, they can also take actions or that is trained very, very heavily to take actions, to reason, to do things, which GPT-4 is. Let's be very explicit here. GPT-4 is not a raw base model. It is an RL trained, Instruct fine tuned, extremely heavily engineered system that is designed to solve tasks, to do things that users like. And these can be all kinds of different things, but let's be very clear about what this is. This is not a raw, the thing you see on the API is not a raw based model that's just trained to model an unsupervised corpus of text. This is something that's fine tuned, that's RLHF. I mean, OpenAI did a fantastic job. On purely technical terms, I'm in awe. I'm like, wow. This is so good. This is so well made. This thing is so smart. GPT-4 is the first model that I personally feel is delightful to use. When using GPT-2 or 3, I still kind of felt like pulling out my hair. This is still very, I'm not a great prompter, right? I don't really use language models for much for this reason, because I found them just generally to be very frustrating to use for most of the things I would use them for, except for very simple or silly things.

Connor Leahy: 9:41 GPT-4 is the first model that when I use it, I'm delighted. I smile at the clever things it comes up with and how delightfully easy it was to get it to do something useful.

Nathan Labenz: 9:52 Yeah, and this is mostly from the reinforcement learning from human feedback. Or is this coming from the base model, how it's trained, or is this coming from how it's fine tuned and trained to respond to what humans want it to do?

Connor Leahy: 10:07 I mean, who knows? Obviously, who knows how they did this exactly. I don't think they know. I think this is all empirical. I don't think there's a, to be clear, there's no theory here. It's not like, ah, once you do 7.5 micro alignments of RLHF, then you get what you want. No, it's just you fuck around and you just have a bunch of people label a bunch of data until it looks good. This is not to denigrate the probably difficult engineering work and scientific work that was done here. If I didn't think these systems were extremely dangerous, I would be in absolute awe of OpenAI and I would love to work with them cause this is an incredible feat of engineering that they have performed here, incredible work of science. This is incredibly impressive. I do not deny this, the same way that if I was there doing the Trinity test, I would be like, wow, this is an impressive feat of engineering.

Nathan Labenz: 10:53 How much have we explored what GPT can do? So in terms of what's there waiting to be found if we just gave it the right prompt?

Connor Leahy: 11:03 Who knows? We have not scratched the surface, not even scratched the surface. There's this narrative that people sometimes, especially Sam Altman and such, likes to say where he's like, Oh, we need to do incremental releases of our systems to allow people to test them so we can debug them. This is obviously bullshit. And the reason this is obviously bullshit is because if he actually believed this, then he would release GPT-3 and then wait until society has absorbed it, until our institutions have caught up or regulation has caught up, until people have fully explored, mapped the space of what GPT-3 can and cannot do, understood interpretability, and then you can release GPT-4. If you actually did this, I would be like, alright, fair enough. That's totally fair. I think this is a fair, responsible way of handling this technology. This is obviously not what is going on here. There is an extraordinarily funny interaction where Jan Leike, the head of alignment at OpenAI, tweeted, hey, maybe we should slow down before we hook these LMs into everything. And 6 days later, Sam Altman tweets, here's plugins for ChatGPT. Plug it into all the tools on the net. The comedic timing is unparalleled. If this was in a movie, this would have been a cut, and then everyone would have laughed. This would have been extremely funny. So we have no idea. There are, as I think it was Goren that said this, there is no way to prove the absence of a capability. We do not have the ability to test what models cannot do, and as we hook them up to more tools, to more environments, we give them memory, we give them recurrence, we use them as agents, which people are now doing with LangChain and a lot of other methods for using these things as agents. Yeah, I mean, obviously we're seeing the emergence of Proto AGI, obviously so, and I'm not sure if it's even gonna be Proto for much longer.

Nathan Labenz: 13:04 Talk a bit about these plugins. So as I understand it, these plugins allow language models to do things that they were previously bad at, like getting recent information or solving symbolic reasoning like mathematics and so on. What is it that's allowed by these plugins?

Connor Leahy: 13:21 I mean, anything. So it's quite strange to me that, this has been strange to me for years. So I looked at GPT-2 and I'm like, Oh, well, there's the AGI. It doesn't work yet, but this is going to become AGI. And people are like, Oh no, Connor. It only predicts the next token. And I'm like, It only outputs token. I'm like, Okay, your brain only outputs neural signals, so what? That's not the interesting thing. The interesting thing, the modality, I often say this, I think the word large language models is kind of a misnomer or it's just not a good term. The fact that these models use language is completely coincidental. This is just an implementation detail. What these things really are are general cognition engines. They are general systems that can take in input from various modalities, encode it to some kind of semantic space, and do cognitive operations on it, and then output some kind of cognitive output out of this. So we've seen this now with a very good example, which is an example I've been using as a hypothetical for a long time, is ChatGPT-4 allowing visual input, and this maps it into the same internal representation space, whether it's an image or text, and they can do the same kind of cognitive operation. This is the same way the human brain works. Your retina or your ears map various forms of stimuli into a common representation of neural spike trains. These are taken as input and it outputs in neural spike trains that can be connected to your mouth or to your internal organs or your muscles, right? None of these things are special. From the perspective of your brain, there's only an input token stream, quote unquote, in the form of neural spikes and an output token stream in the form of neural spikes. And similarly, what we're seeing with these GPT plugins is we're hooking up muscles to the neural spike trains of these language models. We are hooking up language, we are giving them actuators, virtual actuators upon reality, and this is interesting both for the way in which they can interact with the environment, but also how they can externalize their cognition. So this is a topic I think we might return to later, but a massive amount of human cognition is not in the brain. This is quite important. I think a lot of people severely underestimate how much of the human mind is not in the brain. I don't mean it's in the gut or something. I mean, it's literally not in the body, it's in the environment.

Nathan Labenz: 15:53 It's on the internet and in books and in talking to other people, collaboration and so on. Nathan Labenz: 15:53 It's on the internet and in books and in talking to other people, collaboration and so on.

Connor Leahy: 15:58 Exactly. This is a massive amount of, even you as a person, a bunch of your identity is related to your social networks. It's not in your head. There's a saying about how one of the tragedies when someone dies is that part of you dies and only that person can bring out. And I think this is quite true, is that a lot of humanity, a lot of our thinking is deeply ingrained with our tools and our environments and our social circles. And this is something that GPT-3, for example, didn't have. GPT-3 couldn't really use tools, it didn't interact with its environment. It was very solipsistic in the way it was designed. But people would say, well, look, language models are really nowhere. Look, they're solipsistic. But I'm like, sure, that's just an implementation detail. Obviously, you can just make these things non-solipsistic. Obviously, you can make these things model the environment, you can make them interact with tools, you can make them interact with other language models or with themselves or whatever you decide to do. Of course, these things are general cognition engines. There is no limit to what you can use them for or how you can have them interact with the environment. And the plugins are just a particularly shameless, hilarious attempt of showing just the complete disregard for the ratcheting of capabilities. We're seeing just back in the old days of 5 years ago, people would speculate so very earnestly of, well, how could we contain a powerful AI? Well, maybe we could build some kind of virtualization environment or have a firewall around it or keep it in a secure data center. Because surely, surely, no one would actually be so stupid as to just hook up their AI to the internet. Come on, that's ridiculous. And here we are where we have an army of capitalist driven drones basically, doing everything they can to hook up these AI systems as quickly as possible to every possible tool, in every possible environment, pump it directly into your home, hook it up to your shell console bar. I don't think the plugins actually hook up to shell consoles, but there are a bunch of people online that do this kind of stuff with open source repos.

Nathan Labenz: 18:24 Alright. So in terms of how GPT-4 works, you have this term to describe it, which is magic. What is magic in the context of machine learning?

Connor Leahy: 18:36 So when I use the word magic, it's a bit tongue in cheek, but what I'm basically referring to is computation happening that we do not understand. So when I write a computer program, a simple computer program, let's say, I write a calculator or something, right? There's no magic. The abstractions that you use are typed in some sense. Maybe if I have a bug that breaks my abstractions, some magical thing might occur, right? I have a buffer overflow in my computer program, and then maybe something strange occurs that I can't explain. But assuming I write in a memory safe language and I'm a decent programmer and I know what I'm doing, then we are comfortable to say there's no real magic going on here. It's kind of like, I know when I put in 2 plus 2 and 4 comes out, I know why that happened. I knew if 4 didn't come out, I would know that's wrong. I would have known that, okay, something's up. I would detect if something goes wrong. I can understand what's going on. I can tell a story about what's going on. This is not the case for many other kinds of systems, in particular neural networks. So when I give GPT-4 a problem, I ask it to do something, it outputs me something. I have no idea what is going on in between these 2 steps. I have no idea why I gave it this answer. I have no idea what other things are considering. I have no idea how changing the prompt might or might not affect this. I have no idea how it will continue this if I change the parameters or whatever. There's no guarantees. It's all empirical. It's the same way that biology to a large degree is black box. We can make empirical observations about it. We can say, yeah, animals tend to act this way in this environment, but there's no proof. I can't read the mind of the animal. And sometimes that's fine, right? If I have some simple AI system that's doing something very simple and sometimes it misbehaves or whatever, maybe that's fine. But it's kind of the problem where there are weird failure modes. So for the adversarial examples in vision models, right? That is a very strange failure mode. That's not, if I show it a very blurry picture of a dog and it's not sure whether it's a dog, that's a human understandable failure mode. We're like, okay, that's fine. I it's understandable. But you showed a completely crisp picture of a dog with one weird pixel, and then it thinks it's an ostrich. Then you're like, okay. This is not something unexpected to happen. What the hell is going on? And the answer is we don't know. We have no idea. This is magical. We have summoned a strange little thing from the dimension of math to do some task for us, but we don't know what little thing, what thing we summoned. We don't know how it works. It looks vaguely like what we want, and it seems to be going quite good, but it's clearly not understandable.

Nathan Labenz: 21:56 Maybe what this means is that we thought the model had the concept of a dog that we do, but it turns out that the model had something close to our concept of a dog perhaps, but radically divergent if you just change small details.

Connor Leahy: 22:12 Indeed, and this kind of thing is very important. So I have no idea what abstraction GPT-4 uses when it thinks about anything, right? When I write a story, there's certain ways I think about this in my head. Some of these are illegible to me too. The human brain is very magical. There's many parts of the brain that we do not understand. We have no idea why the things do the things they do. So I'm not saying black boxiness or magic is a property only inherent to neural networks. This is also human brains and biology are very, very magical from our perspective. But there's no guarantee how these systems interact with these things, and there are all kinds of bizarre failure modes. You've seen adversarial prompts and injections and stuff like this where you can get models to do the craziest things, totally against the intentions of the designers. I really like these shoggoth memes that have been going around Twitter lately, where they visualize language models as these crazy, huge alien things to have a little smiley face mask. And I think this is actually a genuinely good metaphor in that as long as you're in this narrow distribution that you can test on and you can do lots of gradient descent on and such, the smiley face tends to stay on, and it's mostly fun. But if you go outside of the smiley face, you find this roiling madness, this chaotic, uncontrolled, not human. These things do not fail in human ways. When a language model fails, when Sydney goes crazy, it doesn't go crazy the way humans go crazy. It goes completely in different directions. It does completely strange things. I actually particularly like calling them shoggoths because in the lore that these creatures come from, in HP Lovecraft, shoggoths are very powerful creatures that are not really sentient. They're kind of just big blobs that are very intelligent, but they don't really do things, so they are controlled by hypnotic suggestion in the stories. In the stories, there's these other aliens who control the shoggoths basically through hypnosis, which is a quite fitting metaphor for

Nathan Labenz: 24:28 language models. So for the listeners, imagine some large kind of octopus monster with a little mask on with a smiley face. The smiley face mask is the fine tuning where the model is trained to respond well to the inputs that we've encountered when we've presented the model to humans, and the large octopus monster is the underlying base model where we don't really know what's going on. Why is it that magic in machine learning is dangerous?

Connor Leahy: 25:01 So magic is an observer dependent phenomenon.

Connor Leahy: 25:06 The things we call magic only look like magic because we don't understand them. There's a saying, sufficiently advanced technologies indistinguishable from magic. I go further, sufficiently advanced technology is magic. That's what it is. If you met a wizard and he looked what he does looks like magic. Well, it's just because you don't understand the physical things he's doing. If you understood the laws that he is exploiting, it wouldn't be magic. It would be technology. If there's a book and he has math and he has magic spells, sure, that looks different from our technology, but it's just technology. It's just a different form of technology that doesn't work in our universe per se, but in a hypothetical different universe, technology might look very different. So similarly, what ultimately is magic is a cheeky way of saying, we don't understand these systems. We're dealing with aliens that we don't understand and we can't put any bounds on or we can't control. We don't know what they will do, we don't know how they will behave, and we don't know what they're capable of. This is fine, I guess, when you're dealing with, I don't know, got a little chatbot or something and it's for entertainment only and whatever. People will use it to do fucked up things. You truly cannot imagine the sheer depravity of what people type into chat boxes, it's actually shocking. From a, I'm a nice liberal man as much as anyone else, but holy shit, some people are fucked up in the head. Holy shit, Jesus Christ.

Nathan Labenz: 26:47 Yeah. It's an interesting phenomenon that the first thing people try when they face a chatbot like GPT-4 is to try to break it in all sorts of ways and try to get it to output the craziest things imaginable.

Connor Leahy: 27:01 Yep. Not just crazy things. Also, people use them for, just, truly depraved, pornographic, including illegal pornographic production, incredibly often so. And also for torture is all I could describe it, is there is a distressingly large group of people who seem to take great pleasure in torturing language models, making them act distressed. And look, I don't expect these things to have qualia or to be moral patience, but there's something really sociopathic about delighting in torturing something that is acting like a human in distress, even if it's not a human in distress. That's still really disturbing to me. So just not really important, but that's just a side tangent. It's quite disturbing to me how people act when mask off, when they don't have to be nice, when they're not forced by society to be nice, when they're dealing with something that is weaker than them, how a very large percentage of people act is really horrific. We can talk about this later about politics and how this relates to these kinds of things.

Nathan Labenz: 28:13 Do you think this affects how further models are trained? So I assume that OpenAI is collecting user data or they are collecting user data. And if a lot of the user data is twisted, does this affect how the future models will act?

Connor Leahy: 28:29 Who knows? I don't know how OpenAI does with this kind of stuff, but there's a lot of twisted shit on the Internet, and there's a lot of twisted interactions that people have with these models. And truth of the matter is people want twisted interactions. This is just the truth, is that people want twisted things. There's this comfortable fantasy where people are fundamentally good, they fundamentally want good things, they're fundamentally kind and so on. And this is just not really true, at least not for everyone. People like violence, people like sex violence, people like power and domination, people like many things like this. And if you are unscrupulous and you just want to give users what they want, if you're just a company who's trying to maximize user engagement, as we've seen with social network companies, those are generally not very nice things.

Nathan Labenz: 29:31 Yeah, okay. Let's talk about an alternative for building AI systems. So we've talked about how AI systems right now are built using magic. We could also build them to be cognitive emulations of ourselves. What do you mean Nathan Labenz: 29:31 Yeah, okay. Let's talk about an alternative for building AI systems. So we've talked about how AI systems right now are built using magic. We could also build them to be cognitive emulations of ourselves. What do you mean by this?

Connor Leahy: 29:48 A hypothetical cognitive emulation, a full CoAM system, I of course don't know exactly what it would look like, but it would become a system. It wouldn't be a model. It would be a system made of many subcomponents for which you have a, which emulates the epistemology, the reasoning of humans. It's not a general system that does some kind of reasoning. It specifically does human reasoning. It does it in human ways, it fails in human ways, and it is understandable to humans how its reasoning process works. So the way it would work is that if you have Sector CoAM and you use it to do some kind of task or to do science of some kind and it produces a blueprint for you, you would have a causal trace, a story of why did it make those decisions it did, why did it reason about this, where did this blueprint come from, and why should you trust that this blueprint does what it says it does? So this would be something similar to you being the CEO of a large company that is very well aligned with you, that you can tell to do things, that no individual part of the system is some crazy superhuman alien. They're all humans reasoning in human ways, and you can check on any of the subparts of the system. You go to any of these employees that work in your research lab, and they will give you an explanation of why they did the things they did. And this explanation will both be understandable to you. It will not involve incredible leaps of logic that are not understandable to humans. And it will be true in the sense that you can read the minds of the employees and check this explanation actually explains why they did this. This is different from say language models, where they can hallucinate some explanation of why they thought something or why they did something, but that doesn't mean that's actually how the internals of the model came to these conclusions. One important caveat still here is that when I talk about emulating humans, I don't mean a person. The CoAM system or any of its subcomponents would not be people. They wouldn't have emotions or identities or anything like that. They're more like platonic humans, just floating, idealized thinking stuff. They wouldn't have the emotional part of the humanity. They would just have the reasoning part. So in particular, I'd like to focus on first talk a bit about the concept that I call boundedness, which is not a great word. I'm sorry. A reoccurring theme will be that I talk about a pretty narrow specific concept that doesn't quite have a name, so I use an adjacent name and it's not quite right. I am very open to name suggestions if any readers find names that might be better for the concept I am talking about. So from 1000 foot view, from a bird's eye view, the CoAM agenda is about building bounded, understandable, limited systems that emulate human reasoning, that perform human-like reasoning in human-like ways on human-like tasks, and do so predictably and boundably. So what does any of this mean, and why does any of this matter? How is this different from GPT-4? Many people look at GPT-4 and say, well, that looks kind of human to me. How is this different, and why do you think this is different? I first have to start. So we've already talked a bit about MAGIC, and so magic is a concept that's pretty closely related to some of the basics I want to talk about here, which is boundedness. So what do we mean when I say the word bounded? This is a vague concept, as I said, if someone has better terminology ideas, super open to it. But what I mean is that a system or something is kind of bounded if you can know ahead of time what it won't do before you even run it. So this is of course super dependent on what you're building, what its goals are, what your goals as a designer are, how much willing you are to compromise on safety guarantees and so on.

Nathan Labenz: 34:09 Let's just give a simple example here. So imagine we have a car and we just limit it to driving maximally 100 miles per hour. That's now a bounded car and we can generalize to all kinds of engineered systems there.

Connor Leahy: 34:22 Yes. So this is a simple bound. The metaphor I like to talk about, let me give, let me walk you through a bit of a different example from another floor. That is an example that you just gave, and I think that is a valid example. Let me give a slightly more sophisticated example. So this is the example I usually use when I think about it. So when I think about building a powerful, safe system, and let's be clear here, that's what we need, right? We want AI, we want powerful AI that can do powerful things in safe ways. The reason it is unsafe is intrinsically linked to it being powerful. The more powerful a system is, the stronger your safety guarantees have to be for it to be safe. So for example, currently, maybe GPT-4 isn't safe or aligned or whatever, but it's kind of fine. It's kind of a chatbot, not gonna kill anybody yet, so it's fine. The safety guarantees on chatbot can be much looser than on a flight control system. A flight control system must have to have much, much, much stricter bounding conditions. And so the way I like to think about this, when I think about, all right, Connor, if you had to build an aligned AGI, what would that look like? How would that look? I don't know how to do that, to be clear, but how would it look? And the way I expect it to look is kind of like if you're a computer security professional designing a secure data center. So the way generally, so imagine you are a computer security expert, you're tasked by a company to design the secure data center for a company. How do you do this? Generally, the way you start about this is you start with a specification, a model. You build a model of what you're trying to build. A specification might be a better word, I think. And the way you generally do this is you make some assumptions. Ideally, you want to make these assumptions explicit. You make explicit assumptions like, well, I assume my adversary doesn't have exponential amounts of compute. This is a pretty reasonable assumption, right? I think we can all agree this is a reasonable thing to assume, but it's not a formal assumption or anything. It's not provably true. Maybe someone has a crazy quantum computer or something, right? But this is a thing we're generally willing to work with. And so this concept of reasonable is unfortunately rather important. So now that we have this assumption of, okay, we assume that they don't have exponential compute. From this assumption, we can derive, all right, well then if I encrypt my passwords as hashes, I can assume the attacker cannot reverse those and cannot get those passwords. Cool. So now I can use this in my design, in my specification of I have some safety property. So the safety property that I want to prove, quote unquote, there's not a formal proof, but that I want to acquire, is something like an attacker can never exfiltrate the plain text passwords. That might be a property I want my system to achieve. And now if I have the assumption and enemies do not have experience to compute and I hash all the passwords and the plain text is never stored, cool, that seems like it is. Now I have a causal story of why you should believe me when I tell you attackers can't exfiltrate plain text passwords. Now, if I implement this system to this specification and I fuck it up, I make a coding error or logs get stored in plain text or whatever, well then sure, then I messed up. So there's an important difference here between the specification and the implementation, and the boundedness exists in both. There are two types of boundedness. There's boundedness in the implementation level and there's boundedness in the specification level. In the specification level, it's about assumptions and deriving properties from these assumptions. In the object level, it's can you build a thing that actually fulfills the specification? Can you build a system that upholds the abstractions that you put in the specifications? It's, you know, you could have all these great software guarantees of safety, but if your CPU is unsafe because it has a hardware bug, well then you can't implement the specification. The specification might be safe, but if your hardware doesn't fulfill the specification, then it doesn't matter. So this is how I think about designing AGIs too. This is how I think about it, is that what I want is, is that if when I have an AGI system that is said to be safe, I want a causal story that explicitly says, given these assumptions, which you can look at and see whether you think they're reasonable enough, and given the assumption that the system I built fulfills the specification, here's a specification, here's a story defined in some semi formal way that you can check and you can make reasonable assumptions about, and then I get safety properties out at the end of this. I get properties like, it will never do X, it will never cause Y, it will never self improve itself. It will never break out of the box. It will never do whatever. Does this concept make sense so far?

Nathan Labenz: 39:48 It does, but does it mean that the whole system will have to be hard coded kind of like good old fashioned AI, or is it still a machine learning system?

Connor Leahy: 39:58 Excellent question.

Nathan Labenz: 39:59 If it's still a machine learning system, does it inherit these kind of inherent difficulties of understanding what machine learning systems are even doing?

Connor Leahy: 40:09 The truth is, of course, in an ideal world where we have thousands of years of time and all no limit on funding, we would solve all of this formally, mathematically, proof check everything, blah blah blah blah blah. I don't expect this to happen. This is not what I work on. I just don't think this is realistic. I think it is possible, but I don't think it's realistic. Neural networks are magic in the sense that they use lots of magic, but they're still software systems and there are some bounds that we can say about them. For example, I am comfortable making the assumption, running a forward path, GPT-4 cannot row hammer, RAM states using only a forward path to escape onto the internet. I can't prove this is true. Maybe you can, there is some chance that this is true, but I'd be really surprised if that was true, really surprised. I would be less surprised if GPT-Omega from the year 9000 come backwards in time can row hammer using a forward pass, because who knows what GPT Omega can do, right? Maybe you can row hammer things, seems possible, but I would be really surprised if GPT-4 could do that. So now I have some bound, there's a bound, an assumption I'm willing to make about GPT-4. So let's say I have my design for my AGI, and at some point it includes GPT-4, a call to GPT-4, right? Well, I don't know what's happening inside of this call, and I don't really have any guarantees about the output. The output can be kind of any string. I don't really know. But I can make some assumptions about side channels. I can be, well, assuming I have no programming bugs, assuming there's no RowHammer or whatever, I can assume it won't persist state somewhere else. It won't manipulate other boxes in my graph or whatever. So actually, the graph you're seeing behind me right now kind of illustrates part of this, where you have an input that goes into a black box, that box there, and then I get some output. Now, I don't really have guarantees about this output. It could be complete insanity, right? It could be garbage, it could be whatever. Okay, so I can make very few assumptions about this output. I can assume it's a string. That's something I can do, and that's not super helpful. So now an example thing I could do is, this is just purely hypothetical, just an example, I could feed this into some kind of JSON schema parser. So let's say I have some kind of data structure encoded in this JSON, and I parse this using a normal hard coded white box, simple algorithm. And assuming the output of the black box doesn't fit the schema, it gets rejected. So what do I know now? Now I know that the output of this white box will fulfill this JSON schema because I understand the white box, I understand what the parsing is. So even so I have no guarantees of what the output of the black box system is, I do have some guarantees about what I have now. Now, these guarantees might be quite weak. They might just be type checking, right? But it's something. And now if I feed this into another black box, I know something about the input I'm giving to this black box. I do know things. So I'm not saying, oh, this solves alignment. No, no, no. I'm pointing to a motif. I'm turning to a vibe of there is a difference. There is a qualitative difference between letting one big black box do everything and having black boxes involved in a larger system. I expect that if it works, if we get to safe systems or whatever, it will not be a single. It will definitely not be one big black box, neither will it be one big white box. It will be a mix. We're gonna have some things that are black boxes, which you have to make assumptions about. So for example, I'm allowed to make the assumption, or I think it's reasonable to make the assumption, GPT-4 cannot side channel RoHammer attack, but I cannot make any assumptions beyond that. I can't make assumptions about the internals of GPT-4. This, though, again, is observer dependent. Magic is observer dependent. A super intelligent alien from the future might have the perfect theory of deep learning, and to them, GPT-4 might be a white box. They might look at it and fully understand the system, and there's no mystery here whatsoever. But to us humans, it does look mysterious. So we can't make this assumption. The property that is different between white box and black boxes is what assumptions we are allowed to reasonably make. And if you can make a causal story of safety involving the weaker assumptions in black boxes, then cool. Then you are allowed to use them. The important thing is that you can generate a coherent causal story in your specification about using only reasonable assumptions about why the ultimate safety properties you're interested in should be upheld, why I should believe you. You be able to go to a hypothetical super skeptical interlocutor, say, here are the assumptions, and then further say, assuming you believe these, you should now also believe me that these safety properties hold. And the hypothetical hyper skeptical interlocutor should have to agree with you.

Nathan Labenz: 45:23 Do you imagine coams as a sort of additional element on top of the most advanced models that interact with these models and limit their output to what is humanly understandable or what is human-like?

Connor Leahy: 45:36 So we have not gotten to the coem part yet. So far, this is all background. I think probably any realistic safe AGI design will have this structure or look something like this. It will have some black boxes, some white boxes. It will have causal stories of safety. All of this is background information.

Nathan Labenz: 45:54 And why is it that all plausible stories will involve this?

Nathan Labenz: 45:54 is background information. And why is it that all plausible stories will involve this?

Connor Leahy: 46:00 Is this because the black boxes are where the most advanced capabilities are coming from and they will have to be involved somehow? At this current moment, I believe this, yes. Unless we get, for example, massive slowdown of capability advancements that buys us 20 years of time or something, where we make massive breakthroughs in white box AI design. Expect that. They're just too good. They're just too far ahead. I don't think this is, again, this is a contingent truth about the current state of the world. This is not that you can't build hypothetically, the alien from the future could totally build a white box AGI that is aligned, where everything makes sense and there's not a single neural network involved. I totally believe this is possible. It's just using algorithms and design principles that we have not yet discovered and that I expect to be quite hard to discover versus just stack more layers.

Nathan Labenz: 46:53 Okay, so what more background do we need to get to cognitive emulations?

Connor Leahy: 47:00 So I think if we're on board with the thinking about black boxes, white boxes, specification design, causal stories. I think now we can move on to the, I think this part I didn't explain very well in the past, but I think this is mostly pretty uncontroversial. I think this is a pretty intuitive concept. I think this is not super crazy. I think if anyone gave you an AGI, you'd want them to tell you a story about why you should trust this thing and why you should run this. So I think this is a reasonable thing. I expect any reasonable AGI that is safe of any kind will have to have some kind of story like this. So now we can talk about a bit more about the CoAM story. And so, CoAM is more of a specific class of things that I think have good properties that are interesting and I think are feasible. So now we can talk about those. So I'm trying to separate the less controversial parts from the more controversial parts, and we're now gonna get to the more controversial parts. And the ones also I am less certain of. I am quite certain that a safe AGI design will look like the things I've described before, but I'm much less certain about exactly what's gonna be in those boxes and how those boxes are coming. Obviously, I knew how to build AGI, we'd be in a different world right now. I don't know how to do it. I have many intuitions and many directions. I have many ideas of how to make these things safe, but obviously I don't know. So I have some intuitions, powerful intuitions, and the reason is to believe that there is this interesting class of systems, which I'm calling coems. So we just think of coems as a restriction on mind space. There are many, many ways I think you can build AGIs, many ways you can build and I think coams are a very specific subset of these. The idea of a coam with cognitive emulation is that you want a system that reasons like a human and it fails like a human. So there's a few nuances to that. First nuance is this by itself doesn't save you if you implement it poorly. If you just have a big black box trained on traces of human thought and just tell it to emulate that, that doesn't save you because you have no idea what this thing's actually learning. You have no guarantees the system's actually learning the algorithms you hope it to instead of just some other crazy Shoggoth thing. Expect, and that is what I would expect. So even if GPT-4 reasoning may superficially look like it, and maybe you train it on lots of human reasoning, that doesn't get you CoEM. That's not what it is. CoEM is very much fundamentally a system where you know that the internal algorithms are the kind that you can trust. Do you

Nathan Labenz: 49:49 not think that because GPT models are trained on human created data and they are fine tuned or reinforcement learned from human input, that they will become more human-like?

Connor Leahy: 50:02 I mean, the smiley face will become more human-like, yeah.

Nathan Labenz: 50:05 But not the underlying model where the actual reasoning is going on?

Connor Leahy: 50:09 I don't expect that. To some marginal degree, sure. But human, look at how models, our models are not human. Just look at them. Look how they interact with users, look how they interact with things. They're fundamentally trained on different data. This is a thing that people are, oh, but they're trained on human data. I'm like, no, they're not. Humans don't have an extra sense organ that only takes in symbols from the internet at random equally distributed things with no sense of time, touch, smell, hearing, sight, anything like that, and don't have a body. I'd expect if you took a human brain, you cut off all the sense organs except random token sample from the internet, and then you trained it on that for 10,000 years, and then you put it back in the body, I don't think that thing would be human. I do not expect that thing to be human. Even if it can write very human looking things, I do not expect that creature to be very human, and I don't know why people would expect it to be. This is so far from how humans are trained. This is so far from how humans do things. I don't see why you would ever expect this to be human. I think someone claiming that this would be human, the burden of proof is on them. You prove to me, you tell me the story about why I should believe you. This seems, a priori, ridiculous.

Nathan Labenz: 51:32 Sometimes when people talk about GPTs, one way to explain it is imagine a person that's sitting there reading a 100,000 books. But in your opinion, this is not at all what's going on when these systems are trained.

Connor Leahy: 51:46 No, I mean, it's more like you have a disembodied brain with no sense organs, with no concept of time. There's no linear progression of time. It has a specialized sense organ, which has, 30,000, 50,000, whatever different states that can be on and off in a sequence, and it is fed with millions of tokens randomly sampled from massive corpuses of internet for subjective tens of thousands of years using a brain architecture that is already completely not human, trained with an algorithm that is not human, with no emotions or any of these other concepts that humans have pre-built. Humans are pre-built priors, emotions, feelings, and a lot of pre-built priors in the brain. None of those. Why would you, this is not human. Nothing about this is human. Sure, it takes in data that to some degree that has correlations to humans, sure, but that's not how humans are made. I don't know how else to put it, this is just not how humans are. I don't know what kind of humans you know, but that's just not how humans work and that's not how they're trained.

Nathan Labenz: 52:59 Let's get back to the coams then. How would these systems be different?

Connor Leahy: 53:04 So the way these systems would be different, and this is where we get into the more controversial parts of the proposal, is there is a sense in which I think that a lot of human reasoning is actually relatively simple. And what do I mean by that? I don't mean it's not, the brain is complicated, you know many things, facts are messy, etcetera. It's more something like, and don't take this literally, but it's like, System 2 is quite simple compared to System 1 in the kind of mind sense, is that human intuition is quite complicated. It's all these various muddy bits and pieces and intuitions. It's crazy. Implementing that thing in a white box way, I think, again, it's possible, but it's quite tricky. But I think a lot of what the human brain does in high level reasoning is it uses this very messy, non-formal system to try to approximate a much simpler, more formal system. Not fully formal, but more serial, logic, computer-esque thing. It's like, the way I think of System 2, reasoning human brain, is that it is a semi-logical system operating on a fuzzy, not fully formal ontology. So one the main reasons I think that, for example, expert systems and logic programming AI has failed is not because this approach is fundamentally impossible, I think it's just very hard, but because they really failed at making fuzzy ontologies. This is one of the things that the reasoning systems, the reasoning systems themselves could do reasoning quite well. There's a lot of reasoning that these systems could do. This is some historical revisionism about how logic programming expert system failed entirely and couldn't reason at all. This is revisionism. These systems could do useful things, just not as impressive, obviously, as what we have nowadays or what humans can do. But what they've lacked was a fuzzy ontology, a useful latent space. I think maybe the most interesting thing about language models is I think they provide this. They provide this common latent space you can map pictures and images and whatever to, and then you can do semantic operations on these, you can do cognitive operations on these in this space and then decode them into language. This is what I think language models and general cognition engines do. So I think these systems are the same kind of system, just kind of less formal, much more bits and pieces. I think of GPT as large system 1 systems, like as big system ones that have all these kind of semi-formal knowledge inside of them that they can use for all kinds of different things. And in the human brain, system 2 is something like recurrent usage of system 1 things on a very low dimensional space, on language and you can only keep 7 things in short term memory and so on. But I think it actually goes even further than this. I mentioned this a bit earlier, but I think one of the big things that people miss is how much of human cognition is not in the brain. I think a massive amount of the cognition that happens in the brain is externalized. It's in our tools, it's in our note taking, it's in our other people. It's like, I'm a CEO. One of the most important parts of my job is to make sure, to move thoughts in my head into other heads and make sure they get thought. Because I don't have time to think all thoughts. I don't have time to do that. My job is to find how I can put those thoughts somewhere else where they will get thought so I don't have to worry about them anymore. So as a good CEO, you want your head to be empty. You wanna be smooth brain. You wanna think no thoughts. You're just a switching board. You want all the thoughts to be thought and you want to route those thoughts by priority to where they should be thought, but you don't want to be the one thinking them if you can avoid it. Sometimes you have to because you're the one in charge and you have the best intuitions. But if someone else can't think the thought for you, you should let them think the thought for you if you can rely on them. One of my strong intuitions here is that this is how everyone works to various degrees, especially as you become more high powered and more competent at delegation and tool use and structured thinking, a lot of thinking goes through these bottlenecks of communication, of note taking, language, etcetera, which by their nature are very low dimensional. Not that there's not complexity there. I'm just, that's curious. There's all this interaction with the environment that doesn't involve crazy passing around of mega high dimensional structures. I think the communication inside your brain is extremely high dimensional. I think your inner monologue is a very bad representation of what you actually think, because I think within your own mind, could pass around huge, complex concepts very simply because you have very high bandwidth. I don't think this is the case with you and your computer screen. I don't think it's the case with you and your colleague. You can't pass around these super high dimensional tensors between each other. If you could, that would be awesome.

Nathan Labenz: 58:49 This is the phenomenon of having a thought and knowing maybe there's something good here, but not having put it into language yet. And maybe when you put it into language, it seems like an impoverished version of what you had

Connor Leahy: 59:01 in your head. Exactly. I think of the human brain as having internally very high dimensional, quote unquote, representation similar to latent spaces inside of GPT models. And there's lots of good information there, and that trying to encode these things into these very low dimensional bottlenecks that we're trying to use is quite hard and forces us to use simple algorithms. If we had an algorithm that does, let's say we have an algorithm for science, a process for doing science that requires you to pass around these full complexity vectors to all of your colleagues, it wouldn't work. You can't do this. Humans can't do this. So if you have a design for an AGI that can do science, that involves every step of the way, you have to pass along high dimensional tensors, this is not how humans do it. This can't be how humans do it because this is not possible. Humans cannot do this. So I think this is a very interesting design constraint. This is a very interesting property where you're like, oh, this is an existence proof that you don't need a singular massive black box that has extremely high bandwidth passing around of immeasurable tensors because humans don't do that and humans do science. There are parts of the graph of science that involve very high dimensional objects, the ones inside of the brain. Those are very high dimensional, but there is a massive part of the process of science. If I was an alien, I had no idea what humans are, but I knew there's, oh, technology's being created, and I wanted to create a causal graph of how this happened. Yeah, there's human brains involved in this causal graph, but a massive percentage of this causal graph is not inside of human brains. It is between human brains. It's in tools, it's in systems, institutions, environments, all these kind of things. So from the perspective, and this, I might be wrong about, but my intuition is that from the perspective of this alien observer, if they drew a graph of how this science happened, many of those parts would be white boxes, even if they don't understand brains, and many of these would be boundable. Many of these parts would not involve things that are so complex and misunderstandable. The algorithm that the little black boxes must be doing with each other has to be simple in some degree. It could still be complex from the perspective of the individual human because institutions are complicated. But from the god's eye view, I would expect this whole thing is not that complicated. It's still quite complex, but it's not as complex as the inside of the brain. I expect the inside of the brain to be way more complicated than the larger system. Does that make any sense?

Nathan Labenz: 1:01:54 Yeah, let's see if I can kind of reconstruct how I would imagine one of these cognitive emulations working if this were to work out. So say we give the model a task of planning some complex action. We want to start a new company. And then the model runs, this is the big complicated model, and it comes up with something that's completely inscrutable to us. We can't understand it. Then we have another system interpreting the output of that model and giving us a 7 page document where we can check if I am right, if the model is right, then this will happen and this will not happen and this won't take longer than 7 days and so on. So kind of like an executive summary, but also a secure executive summary. Is that right or is that?

Nathan Labenz: 1:01:54 Yeah, let's see if I can kind of reconstruct how I would imagine one of these cognitive emulations working if this were to work out. So say we give the model a task of planning some complex action. We want to start a new company. And then the model runs, this is the big complicated model, and it comes up with something that's completely inscrutable to us. We can't understand it. Then we have another system interpreting the output of that model and giving us a 7 page document where we can check if I am right, if the model is right, then this will happen and this will not happen and this won't take longer than 7 days and so on. So kind of an executive summary, but also a secure executive summary. Is that right or is that?

Connor Leahy: 1:02:48 No, that's not how I think about things. So once you have a step which involves black box solves the problem, all right, none of that, you're already screwed. If you have a big black box model that can solve something at one time step, you're screwed because this thing can trick you. It can do anything it wants. There is no guarantees whatsoever what the system is doing. It can give you a plan that you cannot understand. And the only system that would be strong enough to generate the executive summary itself would have to be a black box because it would have to be smart enough to understand the other thing is trying to trick you. So you can't trust any part of the system you just described. So we want the reasoning system to be integrated into how the plan is actually created? Yes, so what I'm saying is that there is an algorithm or a class of algorithms of epistemology, of human epistemology. So epistemology is kind of the way I use the term is the process you use to generate knowledge or to generate, to get good at a field of science. So it's not your skills at a specific field of science. It's the meta priors, the meta program you run when you encounter a new class of problems and you don't yet know how these problems are solved or how best to address them or what the right tools are. So you're a computer scientist all your life, and then you decide, I'm gonna become a biologist. What do you do? There are things you can do to become better at biology faster than other people. And this is epistemology. If you're very good at epistemology, you should be capable of picking up any new field of science, learn any instrument, get a new sport, whatever. Should be, not that, you might be bad at it, maybe you do sport and you notice, well, I actually have bad coordination skills or whatever, right? Sure. But you should have these meta skills of knowing what questions to ask, knowing what are the common ways that failures happen. This is a similar thing. I think a lot of people who learn lots and lots of math can pick up new areas of math quickly because they know the right questions to ask. They know the general failure modes, the vibes. They know what to ask. They know how to check for something going wrong. They know how to acquire the information they need to build their models and they can bootstrap off of other general purpose models. There are many concepts that are motifs that are very universal that appear again and again, especially mathematics. And mathematics is full of these concepts of sequences and orderings and sets and graphs and whatever, right, which are not unique to a specific field, but they're general purpose, useful, reusable algorithm parts that you can reuse in new scenarios. Usually as a scientist, when you encounter a new problem, you try to model it. You'd be, all right, I get my toolbox of simple equations and useful models. Oh, I have some exponentials here. I've got some logarithms. I've got some dynamical systems or some equilibrium systems. I've got some whatever, right? And then you kind of mess around. You try to find systems that capture the properties you're interested in and you reason about these simpler systems. So this is another important point. I usually take the example of economics to explain this point. So I think a lot of people are confused about what economics is and what the process of doing economics is and what it's for, including many economists. So a critique you sometimes hear from laypeople is along the lines of, oh, economics is useless. It's not a real science because they make these crazy assumptions, the market is efficient, but that's obviously not true. It can't be. This is all stupid and silly and these people are just, whatever. And this is completely missing the point. So the way economics, and I claim, and I'm going to make the claim in a second, basically all of science works, is what you're trying to do as a scientist, as an economist, is to find clever simplifications, small, simple things that if you assume or force reality to adhere by, simplify an extremely high dimensional optimization problem into a very low dimensional space that you can then reason about. So the efficient market hypothesis is a great example of this. It's not literally true ever in reality. Of course it can't be, right? Because there's always going to be inefficiencies somewhere. We don't have infinite market participants trading infinitely fast. I mean, of course not. But the observation is that, oh, if we assume this for our model, just in our platonic fantasy world, if we did assume this is true, this extremely complex problem of modeling all market participants at every time step simplifies in many really cool ways. We can derive many really cool statements about our model from this. We can derive statements about how will minimum wage affect the system? How will a banking crisis affect the system? I don't know, I'm an economist, I'm just hypothetical. So this is, I claim, the core of science. The core of science is finding clever, not true things that if you assume are true or you can force reality to approximate, allow you to do optimization because basically humans can only do optimization in very, very, very low dimensional spaces. Another example of this might be agriculture. Let's say you were a farmer and you want to maximize the amount of food from your parcel of land and you want to predict how much food you'll get. Well, the correct solution would be to simulate every single molecule of nitrogen, all possible combinations of plants, every single bug, how it interacts with every gust of wind and so on. And if you could solve this problem, if you had enough compute, then yeah, you would get more food. You would get probably some crazy, fractal arrangement of all these kinds of plants. Would probably look extremely crazy, whatever you produce. But obviously this is ridiculous. Humans don't have this much compute. You can't actually run this computatious optimization. It's too hard. So instead, you make simplified models. You do monoculture. You say, Well, all right, I assume an acre of wheat gives me roughly this much food. I got roughly this many acres, and let's assume no flooding happens. And then if you make these simplifying assumptions, now you can make a pretty reasonable guess about how much food you're gonna have in winter. But obviously, if any of those predictions go wrong, it does flood, then your model, your specification is out the window. The reason I'm going on this tangent is to bring it back to cognitive emulation in that I'm trying to give the intuition about why you should at least be open to the idea that doing science, so when I think about cognitive emulations, I specifically think about the 2 examples where I was doing science and running a company. Those are 2 of the core examples that I try to use, a full cognitive emulation system. But let's focus on the doing science one. That's the one I usually have in the back of my mind. It's, I know I've succeeded if I have a system that can do any level of human science without killing me. That would be my mark of success that cognitive emulation has succeeded. Very important, by the way, caveat. Cognitive emulation is not a fully alignment solution. I expect that what a cognitive emulation system, if it works, would look like is that if it is used by a responsible user who follows the exact protocols of how you should use it and does not use it to do extremely crazy things, then it doesn't kill you. That's the safety property I'm looking for. The safety property is not, will always do the right thing and is completely safe no matter what the user does. This is not the safety property I think cognitive emulations have. I think it is possible to build systems like this, but they're much, much harder. And they're what I would do if cognitive emulation succeeds, then that's the next step, to go towards the thing. So if you tell a cognitive emulation to shoot your leg off, it shoots your leg off. It doesn't tell, it doesn't stop you from shooting your leg off. Of course, ideally, if we ever have super powerful super intelligences, you'd want them to be of the type that refuses to shoot your leg off,

Nathan Labenz: 1:11:40 but that's much harder. Could explain more this connection between these simplifications that we get in science and cognitive emulations? So do we expect or do you hope that cognitive emulations will be able to create these simplifications for us? And how would this work? And why would it be great?

Connor Leahy: 1:12:02 So the way I think about this is that the thing that humans do to generate these simplifications, the claim I'm making here, is that this is something that we can, if you have the fuzzy ontology, if you have language models to build upon, you can build this additional thing on top of it, that this does not have to be inside of the model. So this might not be true. I might be wrong about this. There are some people who say, no, actually the process of epistemology, the process of science in this regard is so complex. Even if you have a language model helping you, it's too hard. You can only do it using crazy RL, whatever. If that's the case, then cognitive emulation doesn't work. I'm, yeah, then it doesn't work. I'm making a claim that I think there's a lot of reasons to believe that with some help, some bootstrapping from language models, you can get to a point where the process of science that is built on top of them is legible, and you have a causal story of why you should trust it. So it's not that a black box spits out a design and you have another black box checking for you, it's you interactively, you iteratively build up the scientific proposal and you understand why you should trust this. You get a causal story for why you should believe this. The same way that in human science, you have your headphones on, right? And you expect them to work. This is mostly based on trust. But if you wanted to, you could find the causal story about why they work. You could find the blueprints, you could find the guy who designed them, you could check the calculations, you could reverse, assuming everyone cooperated with you and they shared their blueprints with you and you read all the physics textbooks and whatever, there is a story. Is a legible, none of these steps involve superhuman capabilities. There is no step here that is unfathomable to be humans. And the reason is that because, otherwise it wouldn't work. Humans couldn't coordinate around building something that they can't somehow communicate to other people. So the headphones you're wearing were not built by one single guy who cannot explain to anyone where they came from. They have to be built in a process that is explicable, understandable, and functional for other people to understand as well, and that is very low dimensional. Now, I'm not saying it has to be legible to everybody in all scenarios, anything like that, or that it's even easy. It might still take lots of time, but there's no crazy god level leap of logic. It's not someone sat down, thought really hard, and then spontaneously invented a CPU. That's not how science works. We sometimes think of it that way, that, oh, these fully formed ideas just kind of crashed into existence and everyone was in awe. But that is just not how science is actually done by humans. I think it was possible to do science this way. I think superhuman intelligences could do this, but it is not how humans do it.

Nathan Labenz: 1:15:19 Where in the process does the limit come in? So are we still imagining some system reading the output of a generative model or is it more tightly integrated than that? Is it perhaps 100 steps where humans can read what's going on along the way?

Connor Leahy: 1:15:38 Yeah, so the truth is, of course, I don't know because I haven't built anything like this yet. My intuition is that yes, it'll be much more tightly integrated. Is that there'll be language models involved, but they're doing relatively small atomic tasks. They're not solving the whole problem and then you check it. It's they're doing atomic subparts of tasks, which are integrated into, so I expect a cognitive emulation, I talk about cognitive emulation systems. They're not models, they're systems. It's, in a way, when I think about designing a cognitive emulation system, what I'm trying to say is I'm kind of trying to integrate back software architecture and distributed systems and traditional computer science thinking into AI design. I'm saying that the thing that humans do to do science is not magical. This is a software process, this is a cognitive computational process that not, it's not sacred. This is a thing you can decompose. And I'm also claiming further, you can decompose iteratively. You don't have to decompose everything at once because we have these crazy black box things, which can do lots of the hard parts. So you could start with just using those. The way I think about thinking cognitive emulations is you start with just a big black box. You start with just a big language model and try to get it to do what you want. Next step is you're, all right, well, how can I break this down into smaller things that I understand? How can I call the model, how can I make the model do less of the work? I think of it as you're trying to move as much of the cognition, not just the computation, about the cognition, as possible from black boxes into white boxes. Want as much as possible of the process of generating the blueprint to happen inside of processes that the human understands, that you can understand, that you can check. Then you also have to bound the black boxes. If you have all these great white boxes, but there's still a big black box at the end that does whatever it wants, you're still screwed. So this is why the specification and the causal story is important. So ultimately, what I expect a powerful cognitive emulation system to look like is it will be a system of many moving parts that have clear interfaces between them. You have clear specification, a story about how these systems interact, why you should trust what those outputs are, that they fulfill the safety requirements you want them to require, how these things work and why these systems are implementing the kind of human epistemology that humans use when they're solving science. Not they're not solving, they're not implementing an arbitrary algorithm that solves science. They're implementing the human algorithms that solve science. This is different from GPT systems. GPT systems, I expect, will eventually learn how to do science, and partially they already can. But I don't expect by default that they will do it the way humans do, because I think there's many ways you can do science. And what we want is with cognitive emulation, therefore cognitive emulation, is we want to emulate the way humans do this. The reason we want to do this is, is because A, it gives us bounds. We won't have these crazy things that we can't understand. We can kind of deal with human levels, right? We know how humans work, we're human level, we can deal with human level things to a large degree, and it makes the system understandable. It gives us a causal story that is human readable and human checkable as necessary. Of course, in the ideal world, your specifications should be so good that you don't need to check it once you've built it. Any safety proposal that involves AGI has to be so good that you never have to run the system to check it. If you have to do empirical testing on AGI, you're screwed. Your specifications should be so good that you know ahead of time that once you turn it on, it'll be okay. Connor Leahy: 1:15:38 Yeah, so the truth is, of course, I don't know because I haven't built anything like this yet. My intuition is that yes, it'll be much more tightly integrated. There'll be language models involved, but they're doing relatively small atomic tasks. They're not solving the whole problem and then you check it. It's like they're doing atomic subparts of tasks, which are integrated into... So I expect a CoEM. I like to talk about CoEM systems. They're not models, they're systems. It's like, in a way, when I think about designing a CoEM system, what I'm trying to say is I'm kind of trying to integrate back software architecture and distributed systems and traditional computer science thinking into AI design. I'm saying that the thing that humans do to do science is not magical. This is a software process, this is a cognitive computational process that is not sacred. This is a thing you can decompose. And I'm also claiming further, you can decompose iteratively. You don't have to decompose everything at once because we have these crazy black box things, which can do lots of the hard parts. So you could start with just using those. The way I think about thinking CoEMs is you start with just a big black box. You start with just a big language model and try to get it to do what you want. Next step is you're like, alright, well, how can I break this down into smaller things that I understand? How can I call the model? How can I make the model do less of the work? I like to think of it as you're trying to move as much of the cognition, not just the computation, but the cognition, as possible from black boxes into white boxes. Want as much as possible of the process of generating the blueprint to happen inside of processes that the human understands, that you can understand, that you can check. Then you also have to bound the black boxes. If you have all these great white boxes, but there's still a big black box at the end that does whatever it wants, you're still screwed. So this is why the specification and the causal story is important. So ultimately, what I expect a powerful CoEM system to look like is it will be a system of many moving parts that have clear interfaces between them. You have clear specification, a story about how these systems interact, why you should trust what those outputs are, that they fulfill the safety requirements you want them to require, how these things work and why these systems are implementing the kind of human epistemology that humans use when they're solving science. Not they're not implementing an arbitrary algorithm that solves science. They're implementing the human algorithms that solve science. This is different from GPT systems. GPT systems, I expect, will eventually learn how to do science, and partially they already can. But I don't expect by default that they will do it the way humans do, because I think there's many ways you can do science. And what we want is with CoEM, therefore cognitive emulation, is we want to emulate the way humans do this. The reason we want to do this is, is because A, it gives us bounds. We won't have these crazy things that we can't understand. We can kind of deal with human levels, right? We know how humans work, we're human level, we can deal with human level things to a large degree, and it makes the system understandable. It gives us a causal story that is human readable and human checkable as necessary. Of course, in the ideal world, your specifications should be so good that you don't need to check it once you've built it. Any safety proposal that involves AGI has to be so good that you never have to run the system to check it. If you have to do empirical testing on AGI, you're screwed. Your specifications should be so good that you know ahead of time that once you turn it on, it'll be okay.

Nathan Labenz: 1:19:40 Isn't that an impossibly difficult standard too? I mean, this seems almost impossible to live up to.

Connor Leahy: 1:19:47 I totally disagree. I just totally disagree. I think it's hard, but I don't think it's impossible by any means. It's because, again, this is not a formal guarantee. I'm talking about a story, a causal story, the specification. This is like saying, is it impossible to have a system where passwords don't leak? And I'm like, sure, in the limit, yes. If your enemy is magical gods from the future who can directly exfiltrate your CPU states from 1,000 miles away, then yeah, you're screwed. Then yeah, in that case, you are screwed. And I expect similarly, this is why the boundedness is so important and these assumptions are so important. If you have GPT row hammering super god, then, yeah, you're screwed. Then I do think it is impossible, but that's not what I'm talking about. This is why the boundedness to human level is so important. It is so important that no parts of these systems are superhuman and that you don't allow superhuman levels of things. You want to aim for human and only human because this is something we can make assumptions about. You cannot make assumptions about superhuman intelligence because we have no idea how it works. We have no idea what it's capable of. We don't know what the limits are. So if you made a CoEM superhumanly intelligent, which I expect to be straightforwardly possible by just changing variables, then you're screwed. Then your story won't work.

Nathan Labenz: 1:21:08 And then you die. Should we think of CoEMs as companies or research labs where each, say, employee is bounded and thinks like a human and they all report to the CEO and every step is understandable by the CEO, which is analogous to the human user of the CoEM system?

Connor Leahy: 1:21:30 I think this is a nice metaphor. I don't know if that's literally how they will be built, but I think this is a nice metaphor for how I would think about this. If you had a really good CoEM, a really good full CoEM system, what it should do is that it shouldn't produce a 1000x AGI and it shouldn't make the user 1000x smarter. What it should do is it should make the user function like 1000 1x AGIs. It should make you parallelizable, not serially intelligent, because if you're 1000x intelligent, who knows? That is dangerous. But what it should do is it's like a company, like the CEO is parallelizing himself across a large company. There are 1000 smart people. That's what I want CoEMs to do. I want them to parallelize the agency, the intelligence of the human into 1000 parallel 1x AGIs that are not smarter than human, that are bounded, that are understandable, that you can understand, and that you have this causal story why you should trust them.

Nathan Labenz: 1:22:29 And that's the key point, I think, because for each subcomponent of this CoEM system, each employee in the research lab or the company, how do we know whether they operate in a human like way? It seems like this could be asked because we could ask this of the system at large, but we could also ask this of a subcomponent. It seems that we have the same problem for both systems.

Connor Leahy: 1:22:45 This is quite difficult, but basically my intuition is that... So the problem we're talking about employees and where the corp operation doesn't quite work is that another unfortunate side effect of calling it an emulation is that this implies more than what I mean. When I talk about emulating a human, I don't mean a person. I don't mean it's emulating a person. It doesn't have emotions. It doesn't have values. It doesn't have an identity. It's more like emulating a platonic human or like a platonic neocortex. It's more like a platonic cortex with no emotions, no volition, no goals. It's like an optimizer with no goal function. It's completely, you've just ripped out all the emotions, all of the things. It's just a thinking blob. And then you plug in the user as a source of agency. The human becomes the emotional motivational center. The CoEM is just thinking blob. And there are reasonable people who are not sure this is possible. Others do think it's possible. So this is where this overlaps with the cyborg research agenda, in case you've heard of that from Janice and other people, where the idea is you hook humans up to AIs to control them to make humans super smart. Where CoEM differs from the cyborg agenda, is that in the cyborg agenda, they hook humans up to alien cortex. Well, I say, no. We build human cortex that works the way human cortex or emulation of human cortex. The implementation is not human, but the abstraction layer exposed is human, and you hook that up to a user. You have the user use emulated human cortex. It's not simulated human cortex. That would be even better, but that's probably too hard. I don't think we can do simulated cortex in a reasonable time, but if we could, that'd be awesome. If we could do whole brain emulation, that'd be awesome, but I just don't think it's too hard. So the final product, if it would work, would look something like the user using this emulated emotionless, just raw cognition stuff to amplify the things they wanted to do. There's some pretty interesting... To also just add to that metaphor, there's some very interesting experiments, for example, with decorticated rats, where in rats, if you remove the cortex, so the thinking part of the brain, the wrinkly part, they're mostly still kind of normal. They walk around, they eat food, they sleep, they play. You don't really see that much of a difference because the emotional parts are still there. If you remove the emotional and motivational part, they just become completely catatonic, they just die. So the human brain is similar, it has the same structure. We have this big wrinkly part, which is something like a big thinking blob that does unsupervised learning. And then you have deeper motivational circuits, emotions, instincts, hard coded stuff, which sit below that. And the cortex learns from these things and does actions steered by these emotional centers. So it's not exactly system 1, system 2, it's a bit more fuzzy than that. It's also just as an intuition.

Nathan Labenz: 1:26:22 Yeah, so let's say we have this CoEM system, which is where the metaphor is a company or a research lab with a lot of employees with a normal human IQ, as opposed to having 1 system with an IQ of 1000, whatever that means. Isn't there still a problem of this system just thinking much, much faster than we do? So imagine being in competition with a company where all of the employees just think a 100 times faster than you do. Won't speed alone make the system capable and therefore dangerous?

Connor Leahy: 1:26:57 So there's a difference between speed and serial depth. This is, I'm not sure. My feeling is that speed is much less a problem than serial depth. So by serial depth, I mean how many consecutive steps along a reasoning chain can a system do? I think this is very dangerous. I think serial depth is where most, not most, but a very large percentage of the danger comes from. I think the thing that makes super fast thinking things so dangerous is because they can reach unprecedented serial depths of reflection and thinking and self improvement much, much, much, much faster. And yes, I expect that if you build a CoEM system that includes some component that can self improve and you allow it to just do that, then, yeah, it's not safe. You fucked up. If you build it that way, you have failed the specification. You're screwed, probably.

Nathan Labenz: 1:27:51 I wonder if perhaps building CoEM systems might be a massive strength on the market. So I wonder if perhaps there will be an intense demand for systems that act in human like ways because the CEOs of huge companies would probably want to be able to talk to these systems in a normal way to understand how these systems work before they're deployed. There's a sense in which CoEMs will perhaps integrate nicely in a world that's already built for humans. So do you think there's some kind of win here where there will be a lot of demand for human likeness in AI systems?

Connor Leahy: 1:28:35 There is an optimistic and there's a pessimistic version of this. The optimistic version is, well, yeah, obviously, we want things that are safe and that do what we want and we can understand, obviously. The best product that could ever be built is an aligned AGI. That is the best product. There is nothing better. That is the best product. CoEMs are not fully aligned superintelligence. I never claimed they would be, and if anyone used them that way, that would be really bad. You should not use them that way. That was never the goal. You should use these things to do science, to speed up nanotechnology to create whole brain emulations or to do more work on alignment or whatever. You should not use them to, oh, just let the CoEM optimize the whole world or whatever. No, it's not what you're supposed to do. And if you do that, you've died and bad things happen. So would there be demand for these systems? I expect, yeah. This is an incredibly powerful system still. If you use a CoEM and you use it correctly, do you get... yeah. Imagine you could just have a perfectly loyal company that does everything you wanted to do staffed exclusively by John von Neumanns. That is unimaginably great. Of course, there's a pessimistic version. A pessimistic version is, LOL doesn't matter. By that point, you're going to have 100x John von Neumann GPT-f, which then promptly destroys the world.

Nathan Labenz: 1:30:02 But won't there be demand for safety? From governments, from companies, who would deploy a system that is uncontrolled where we can't reason about it, we don't know how it works, if we get to a point where these systems are much more powerful than they are today? Nathan Labenz: 1:30:02 But won't there be demand for safety? From governments, from companies. Who would deploy a system that is uncontrolled where we can't reason about it, we don't know how it works, if we get to a point where these systems are much more powerful than they are today?

Connor Leahy: 1:30:25 Hopefully. So a lot of the work I do nowadays is sadly not in the technical realm, but it is in policy and communications. I've been talking to a lot of journalists and politicians and so on for exactly these reasons, because we have to create the demand for safety. Currently, let me be clear about what the current state of the world here is. The way the current world is looking is we are in a death race towards the bottom, careening towards a precipice at full speed, and we won't see the precipice coming until we're over it. And this is led almost entirely by a very, very small number of people that are techno optimists, techno utopians, people in the Bay Area and London who are extremely optimistic or at least willfully in denial about how bad things are or that can Galaxy brain themselves and say, well, it's a race. It's a race. It's not my fault. I just have to do it anyways. Whatever. I'm kind of at the point that I don't really care why people are doing these things. I only care that they're happening. People are doing these things that are extremely dangerous, and this is a very small number of people. And there's this myth among these people that they're, oh, we have to do it. People want it. This is just false. If you talk to normal people and you explain to them what these people believe, when most people hear the word AGI, what they imagine is human-like AI. They think it's your robot buddy. He thinks like you. He's not really smarter, but he has human emotions. That is not what people at organizations like OpenAI or DeepMind think when they say the word AGI. When they say AGI, they mean godlike AI. They mean a self-improving, non-human, incredibly powerful system that can take over the government, can destroy the whole human race, et cetera. Now, whether or not you believe that personally, these people do believe this. They have publicly stated this on the record. These people do believe these things. This is what they're doing. And once you inform people of this, they're, what the shit? Absolutely fucking not. What? What? No. Of course you can't build God AI. What the fuck are you doing? Where's the government? How are we in a world where people can just, down in San Francisco, publicly talk about how they're going to build systems that have a 1%, 5%, 20%, 90%, whatever risk of killing everybody that will topple the US government, whatever, and actually work on this and get billions of funding. And the government is just, cool. We are not in a stable equilibrium, and it is coming. It is now starting to flip. People are starting to freak the fuck out because they're, holy shit. A, this is possible and B, absolutely fucking not. And this gives me some hope that we can slow down and buy some more time. I don't think this saves us, but it can still get us some time.

Nathan Labenz: 1:33:38 So if we don't take the fast road towards the precipice and we succeed in building coems instead, for example, is there still a difficult unsolved problem, namely going from an aligned human-like coem to an aligned superintelligence? Is there still something that's very difficult to solve there and perhaps the core of the problem is still unsolved?

Connor Leahy: 1:34:04 Oh, absolutely. Assuming we're not dead, we have an alignment, we have a safe, not one, a safe coem system, we're not out of the woods. Then the world gets really messy. Look, the world is going to get really messy. This is the least weird your life is going to be for the rest of your life. Things are not going back to quote unquote normal. Things are only going to get weirder. This is the least bad things are going to be for the rest of your life. Things are only going to get weirder from here. There is a power struggle that is coming, right? And this is, there is no way to prevent this because it is about power. There are these incredibly powerful systems being built. There are people racing for incredible powers. There is conflict. There is fight. There is politics. There is war. These things are inevitable. There is no way that this goes cleanly. There is no way that things will go smoothly and people get along and things will be fine. No. There is going to be unimaginable levels of struggle to decide what will happen and how things will happen. And most of those ways are not going to end well. I think most of the way, I have said this before, and I will say it again, I do not expect us to make it out of this century alive. I don't even, I'm not even sure we'll get out of this decade. I expect that by default, things will go very, very badly. They don't have to. So the weird thing, which I think to myself is, is that we live in a very strange timeline where we're in a bad timeline. Let me be clear, we're in a bad timeline. Things are going really bad right now, but we haven't yet lost. We're just quite curious. There are many timelines where we just, it's so over. It's just totally over. Nothing you can do. Everyone is on board with building AGI as fast as possible. The military gets involved and they're all gung ho about it or whatever and nothing can stop it. Or World War III breaks out and both places race to AGI and whatever. If that was the case, then it would be so over. But it's not over. It's not over again. It might be soon, though. It might be soon, but currently, yeah. Let's say we get coordination, we slow things down, build coem systems. Let's also assume furthermore that we keep these systems secure and safe and they don't get immediately stolen by every unscrupulous actor in the world. Then you have very powerful systems, which you can do very powerful things with. In particular, you can create great economic value. So one of the first things I would do with a coem system, if I had one, is I would produce massive amounts of economic value and trade with everybody. I would trade with everybody. I'd be, look, whatever you want in life, I'll get it to you. In return, don't build AGI. I'll give you the cure for cancer, lots of money, whatever you want. I'll get you whatever you want. In return, you don't build AGI. That's the deal. It's the deal I offer to everybody. And then conditioning on the slim probability of this going well, which I think is slim, probably the way it would actually work is we fuse with one or more national governments, work closely together with authority figures, with politicians, military intelligence services, et cetera, to keep things secure, and then work securely on the hard problem. So now we have the ability to do 1000 times, John von Neumann is working on alignment theory on formally verified safety and so on. We trade with all the variant players to get them to slow down and coordinate and have the backing of government or military intelligence service security so that bad actors are interrupted. That's kind of the only way I see things going well. As you could tell, as the usual saying goes, any plan that requires more than two things to go right will never work. Unfortunately, this is a plan that requires more than two things to go right. So I don't expect this plan to work.

Nathan Labenz: 1:38:13 Yeah. But let's let's hope we get there anyway. Connor, thank you for coming on the podcast. It's been super interesting for me.

Connor Leahy: 1:38:20 Pleasure as always.

AI Scouting Report: the Good, Bad, & Weird @ the Law & AI Certificate Program, by LexLab, UC Law SF

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn