In this episode of The Cognitive Revolution, Nathan explores groundbreaking perspectives on AI alignment with MIT PhD student Tan Zhi Xuan.
Watch Episode Here
Read Episode Description
In this episode of The Cognitive Revolution, Nathan explores groundbreaking perspectives on AI alignment with MIT PhD student Tan Zhi Xuan. We dive deep into Xuan's critique of preference-based AI alignment and their innovative proposal for role-based AI systems guided by social consensus. The conversation extends into their fascinating work on how AI agents can learn social norms through Bayesian rule induction. Join us for an intellectually stimulating discussion that bridges philosophical theory with practical implementation in AI development.
Check out:
"Beyond Preferences in AI Alignment" paper: https://arxiv.org/pdf/2408.169...
"Learning and Sustaining Shared Normative Systems via Bayesian Rule Induction in Markov Games" paper: https://arxiv.org/pdf/2402.133...
Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse
SPONSORS:
Notion: Notion offers powerful workflow and automation templates, perfect for streamlining processes and laying the groundwork for AI-driven automation. With Notion AI, you can search across thousands of documents from various platforms, generating highly relevant analysis and content tailored just for you - try it for free at https://notion.com/cognitivere...
Weights & Biases RAG++: Advanced training for building production-ready RAG applications. Learn from experts to overcome LLM challenges, evaluate systematically, and integrate advanced features. Includes free Cohere credits. Visit https://wandb.me/cr to start the RAG++ course today.
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive
RECOMMENDED PODCAST:
Unpack Pricing - Dive into the dark arts of SaaS pricing with Metronome CEO Scott Woody and tech leaders. Learn how strategic pricing drives explosive revenue growth in today's biggest companies like Snowflake, Cockroach Labs, Dropbox and more.
Apple: https://podcasts.apple.com/us/...
Spotify: https://open.spotify.com/show/...
CHAPTERS:
(00:00:00) Teaser
(00:01:09) About the Episode
(00:04:25) Guest Intro
(00:06:25) Xuan's Background
(00:12:03) AI Near-Term Outlook
(00:17:32) Sponsors: Notion | Weights & Biases RAG++
(00:20:18) Alignment Approaches
(00:26:11) Critiques of RLHF
(00:34:40) Sponsors: Oracle Cloud Infrastructure (OCI)
(00:35:50) Beyond Preferences
(00:40:27) Roles and AI Systems
(00:45:19) What AI Owes Us
(00:51:52) Drexler's AI Services
(01:01:08) Constitutional AI
(01:09:43) Technical Approach
(01:22:01) Norms and Deviations
(01:32:31) Norm Decay
(01:38:06) Self-Other Overlap
(01:44:05) Closing Thoughts
(01:54:23) Outro
SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://www.linkedin.com/in/na...
Youtube: https://www.youtube.com/@Cogni...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...
Full Transcript
Transcript
Tan Zhi Xuan: (0:00) What we argue in a paper beyond preferences in AI alignment, we are really trying to critique this sort of preferences view. So we go through all the limitations of taking this sort of expected utility maximization view of both human rationality and AI alignment too seriously. People know that the this learned utility function you try and learn from preference data doesn't perfectly capture what people really want, and that leads to issues of overoptimization. Because it is a bad proxy for what humans might supposedly really want in that context, overoptimizing it is not going to get you there. I prefer thinking in terms of, like, what would it take to, like, automate the industrial economy or, like, automate 50 to 80% of the existing industrial economy? Because I think more industries will will come to exist in the future. If that's the way of thinking about AI, right, then I think we are not going to get there for, like, a decade or 2 when building moral systems. There's a basic kind of minimal morality that the system should comply to, which is like meet the minimum moral standards that society would agree to allow you to operate. Right? And that's sort of like gonna be filled in by the sort of contractualist picture. Right? And I think constitutional AI is closer to that.
Nathan Labenz: (1:09) Hello, and welcome back to the Cognitive Revolution. Today, I'm excited to share a conversation on AI alignment that spans the fields of moral philosophy, cognitive science, and Bayesian probabilistic programming. My guest is Tanji Shen, a PhD student at MIT whose work questions the assumptions that underlie today's most popular AI alignment strategies and also proposes novel technical implementations by which AI agents might learn social norms from examples in their environments. We begin on the philosophical side with Shen's recent paper, Beyond Preferences in AI Alignment, which critiques the prevailing preferencist paradigm. That is the idea that AI systems should be aligned to satisfy human preferences through techniques like reinforcement learning from human feedback. Arguing that because human preferences are often inconsistent and difficult to aggregate across populations, preference maximization may simply be the wrong framework for AI alignment. And in any case, today's AI systems aren't really being trained as pure preference maximizers anyway. Instead, they argue for an approach whereby AI systems are designed to play specific roles with clear normative standards and constraints that emerge through social consensus, much like how human professionals are expected to uphold certain standards regardless of their or their clients' personal preferences. To better understand this view and its implications, I bring a number of different moral philosophies and alignment strategies into the conversation, asking Shen to explain how their proposal compares and contrasts with each. While hardly the final word on AI alignment, I do think Shen's ideas deserve serious consideration. If nothing else, by thoughtfully combining Eastern and Western traditions, they contradict prominent claims of incompatibility between US and Chinese approaches, including from no less than Sam Altman, who recently wrote in a Washington Post op ed about the relationship between Western and Chinese governance of AI that, quote, there is no third option. In the second half of this episode, we shift gears to discuss Shen's much more technical paper, learning and sustaining shared normative systems via Bayesian rule induction in Markov games. In this project, they demonstrated an approach that allows AI agents to infer social norms by noting apparent deviations from purely self interested behavior in other agents. For example, if an agent repeatedly sees other agents passing on opportunities to obtain resources, it may infer that there is a rule or norm governing that behavior and begin to incorporate compliance with that rule into their own decision making. This creates a mechanism for norms to emerge and to sustain themselves across generations of agents, allowing whole populations to effectively cooperate to avoid tragedy of the commons type problems like overfishing and other resource depletion. Chen's work exemplifies an important goal that I have in making this show, to understand AI from all angles, exploring not just what beneficial AI might look like in principle, but how we might actually begin to build it in practice. If you're finding value in the show, we'd, of course, appreciate it if you take a moment to share it online, write a review on your podcast app, or leave us a comment on YouTube. And we always welcome your feedback and guest or topic suggestions either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. Now here's Tanji Shen on AI alignment philosophy and application.
Nathan Labenz: (4:25) Tanji Shen, PhD student at MIT researching AI alignment, probabilistic programming and cognitive science. Welcome to the Cognitive Revolution.
Tan Zhi Xuan: (4:34) Very happy to be here. Thanks, Nathan.
Nathan Labenz: (4:36) You have caught my attention with a number of interesting research papers and points of view that you've shared online. And so I'm glad that we're finally making this happen. What I really wanna do is dig into, especially 2 papers that you've put out this year. 1 is I mean, you can characterize it perhaps better than me, but I would say sort of a survey of the prevailing approaches to AI alignment in the big picture and also kind of why you think a lot of them are ultimately not going to work out great and what you think the right alternative should be. And then I also really want to get into some of the more technical work that you've done where you've actually created instantiated compute driven environments where little agents run around and actually learn various forms of alignment and good behavior from 1 another. And I love it when people bring together both the kind of philosophical and big picture angle, but also are really down in the weeds trying to implement. So hopefully we can cover those both.
Tan Zhi Xuan: (5:39) Yeah. For sure. Yeah. And it's always been important for me, I think, to sort of bridge over the philosophical discussions that I think have been happening for a long time, but really try and translate them for a more technical audience. Right? Because I do think a lot of things, a lot of these ideas just don't get across because of lack of translation and that was in large part what this recent survey paper was trying
Nathan Labenz: (6:00) to do. Yeah, totally. It's getting real now in brief, right? Just for starters, I actually normally don't even ask people too much for background. A lot of entrepreneurs or whatever when I do that, they're like, Oh, well, ChatGPT came out and I realized AI was going to be a big deal and so I decided to start a company. But I do think that so much of the AI discussion in general is kind of myopic or just very, like, narrowly focused. You come from the other side of the world, Singapore originally, if I understand correctly. And I'd love to just hear a little bit about your background, with apologies to Tyler Cowen. I always remember him saying all thinkers are regional thinkers. So I'd love to just hear a little bit of your backstory, how you got into this, and to what degree and how you think your background on the other side of the world from where we are now is has been relevant in shaping your thinking.
Tan Zhi Xuan: (6:49) Yeah. For sure. Yeah. And I do think this Tyler Crum quote is true of me to some degree, guess. And then I do think, in fact, all thinkers are regional thinkers whether they realize it or not. As for how that's influenced, there's such a long story of how I got into AI safety and alignment. There are many ways in which that story is not a very different 1 from a lot of people in the field. Like, I did my undergraduate degree in The US, and I encountered an effective altruist club as part of that. I actually ran it for a bit despite, you know, always having been slightly too leftist to, like, really buy in fully to effective altruism. That's when I encounter a lot of ideas around AI safety and alignment. Basically, the way I describe it is I wasn't necessarily convinced of the moral argument for working on this as the most important thing from an effective altruist point of view or utilitarian point of view, but it just seemed like a really interesting question because I was already a big fan of moral philosophy and this project of, Okay, how could you begin to computationally model value learning? What does it mean to specify a computational process for learning human values? How do people do that? That basically nerds them to me into working in the field. But I think the way I've approached that question has been influenced, I think, I don't know, by perhaps my culturalpolitical, national background in various ways. I do think of myself as a postcolonial thinker, I suppose, and part of that manifests in a general distrust towards Western philosophy and Western philanthropy as a sole source of philosophical knowledge about what could possibly be relevant to thinking about how to align AI. I think a nice representation of that view is actually I gave a talk maybe almost 5 years ago now on the relevance of philosophical pluralism for AI alignment, how ideas from, say, Buddhist philosophy or Computian philosophy could be relevant to thinking and talking about what it means to aligning our systems with human values. That's probably 1 of the most important bits. I think the other bit that I think influences more of my politics as opposed to moral philosophy is that I think growing up in a state capitalist country like Singapore, I think if you're not someone who basically buys into that system, I think you realize that the flaws of that system can't be ascribed to state or capital on its own. Really a sort of particular intertwining of those kinds of elite interests that creates problems like injustice and inequality. I do think that tends to lead me to favor solutions for alignment that look more decentralized as a basic political value and also as a sort of empirical sort of skepticism about, like, it was just the state doing things all that well. So so I think those things probably influence.
Nathan Labenz: (9:40) That's really interesting because I feel like we are, in The United States, sort of headed for a more state capitalist, civil military fusion, if you will. I I don't necessarily like that. In fact, I have major misgivings about, I would say, probably pretty major elements of it, But that does seem to be the prevailing trend, and certainly Singapore is, like, very often cited as the model that we might learn from. Could you give us just a little bit more on, like because I I would have anticipated where the end of that was gonna go was, like, more trust in the government or, like, at least more positive vision for what the government could do well.
Tan Zhi Xuan: (10:20) Yeah. I think a lot of Singaporeans go in that direction, and I'm just, like, relatively rare among that population who could, you know, Or, like, not the all that rare. I'm just, 1 of the more empty establishmentarian ones. And there isn't really a right and left in Singapore. Right? It's just, kind of the establishment and the sort of, like, people who critique the establishment. And I think I don't know. I mean, I think there's a lot to say for having strong seat control of the economy if you have competent leaders and, if you trust that a system can reliably produce competent leaders who are fair and, aren't corrupt and don't act in your own interest. I just think that it's sort of not really likely for that to be a very stable situation. I think that's not how it's played out in most other countries where sort of the state has very strong power. I think insofar as Singapore as an example as a counterexample, I think it's mostly luck. And I do think that there's a lot of you take things like certain kinds of civil liberties or certain kinds of rights I don't really know you're thinking on things in terms of rights, unacceptable sort of things like political imprisonment for more than 10 years. If you think that is unacceptable to build a society like Singapore's, that's already going to push you in the direction of whatever the benefits this beyond governing country has been, it's not going to be worth the cost, certain kinds of costs. Yeah.
Nathan Labenz: (11:39) Well, here's to hoping that current leadership in US and China proved to be a Lee Kuan Yew rather than what I fear they might turn out to be. But, yeah, that's fascinating perspective.
Tan Zhi Xuan: (11:50) Yeah. And I think part of what I was just referring to was like, I think Lee Kuan Yew made choices that many people in the West would consider absolutely unacceptable in order to establish power. Yeah.
Nathan Labenz: (12:01) Okay. Cool. Well, changing gears, but also just kind of setting a foundation for the core of the discussion.
Tan Zhi Xuan: (12:09) Right.
Nathan Labenz: (12:10) I would love to get your outlook for the near term future of AI in terms of especially how powerful you think systems are likely to become. And I ask that because I think a lot of people's downstream opinions vary or could change if they saw that, like, initial question differently. So are you AGI 2026 or how would you describe yourself?
Tan Zhi Xuan: (12:41) Right. Yeah. First of all, I guess I don't even really like the concept of AGI. I think it's sort of too amorphous to really pinpoint what we're talking about. And I don't know. I kinda prefer thinking about things in terms of, like, what would it take to, like like, automate the industrial economy or, like, automate even, like, 50 to 80% of the existing industrial economy because I think more industries will will come to exist in the future. In principle, like, not even actually because I think there are a lot of, like, regulatory or whatever kinds of societal reasons why a lot of technology we develop won't actually be deployed. And if that's the way of thinking about AI, right, then I think we are not going to get there for, like, a decade or 2 is my guess or more. I I I do, like, in principle believe it's possible to automate a lot of the existing industrial economy and a lot of the cognitive work that goes into running it within the century. But it does seem to me that a lot of the sort of particular predictions or beliefs of companies, large AI companies like OpenAI, are sort of firstly predicated on the particular vision of what AGI means, and also even within that, to me, too optimistic. That's sort of the broad view, I guess. I mean, I think that's 1 way of anchoring about that. And the way I think about that is I'm thinking about particular sectors of the global economy, and that way I have better intuitions about how long it would take. I just think, for example, so much of the global economy is still of a form of mental labor, physical labor, but that is a hard intelligence problem to solve. I think people really underrate how much of human and animal intelligence was dedicated to solving hard motor tasks. Need a lot of motor intelligence to solve that. I think what has happened by people defining AGI to talk about cognitive labor, whatever that means, so they've kind of gerrymandered out most of robotics when that's firstly such an important part of what you might think of as human intelligence, and secondly, I think a really important part of the global economy. It just doesn't seem like we're on track to solve much of that problem, for at least a decade or 2, from my perspective. And and then I think there are a lot of other areas we could talk about. I think the sort of, like, area so the part of our economy niche, I suppose, that a lot of AI companies seem to be focused on when you're think trying to achieve something like AGI. I think they're much more actually focused on people use metaphors like dropping remote workers. Right? I don't know what this metaphor like. Sort of, like, rough sense of, like, what what will be achieved, or they also talk about I can't remember what sort of metaphor was recently, like a country of geniuses in a data center or something like that. And I think that's part pointing to a particular set of cognitive capacities, both of them slightly different ones, that I do think we are seeing a lot more progress on those things. But to me, there's a big open question whether that's going to happen either of them is going to happen anytime soon. 1 of them is sort of science, country geniuses in data center, that's automating science essentially of a certain kind or automating the idea generation parts of science at least and experimental planning. Of course, you still have to do all the stuff with actual machines and run the experiments, that's going to be slow. The remote worker stuff I think of as, okay, all the tasks, general cognitive tasks remote worker does have to do on a computer, say, in order to achieve a certain job. Even on that simpler task I think automating science is very hard. I'm not convinced we're going to get there anywhere close by 2028, say. But even on the simpler task of automating a remote office worker, I think this really depends on how much we really believe that a lot of offers, work meets really high reliability. It does seem to me that the current paradigm of scaling AI systems based on large language models are not on track to achieve the 95%, 99% reliability levels you might want from doing that kind of job. If the standards of reliability happen to be looser, then it might be the case. I think some kinds of roles are in that way. If you're just talking about copywriting, for example, I think that kind of work can be automated. But if it's much more precise or requires certain kinds of standards of precision, then I don't really think that, say, pure language models with a bit of stuff added on are really the right role for the job, and we'll need to see more innovation in hybrid systems in order to get there. And maybe that'll happen quickly, but I don't think it's along the default scaling route that I think most people have in mind. Yeah. So that's broadly the picture, I guess. Yeah.
Nathan Labenz: (17:33)
Hey. We'll continue our interview in a moment after a word from our sponsors.
So I think from here, I would just like to kind of set up an understanding of the prevailing approach to alignment. As I've read through some of your work, it's brought to mind, like, a lot of different angles on it, and I kinda wanna bounce a number of them off of you and then ultimately get to, like, your core proposals. And then, again, it's a to the technical work that you've started to do to chip away at this obviously massive challenge. You described the main approach today as preferenctist, and that is essentially saying that we want the AI to, like, do what the humans want them to do to satisfy their preferences. And this is, like, instantiated or sort of embodied in the way that feedback is collected and ultimately, the AIs are then trained on increasingly now, like, AI feedback as well, but it's all kind of at root grounded in, like, human feedback. Right? So we have these methods, of course, RLHF. The listeners to this podcast will know what that is. And the way that the preferences are collected is either, like, the AI gives you 2 responses and you say which 1 you want or it gives you 1 and you like rate it from 1 to 7. These sort of like very atomic, I'm a person, the AI has given me something, I'm going to kind of evaluate it and nudge it toward giving me more of what I want. Maybe start off by just would you characterize that any differently? And then what do you see the big problems?
Tan Zhi Xuan: (19:06) I think that, you know, there are there are a bunch of things, right, going on in in the current way that I think alignment, especially, like, alignment of large language models as the public or kind of AI system and and their sort of descendants. And and I do think it's right that basically what you described describes the sort of process of how people are going about the practice of how people are going about trying to align these systems where they're collecting human feedback data often in the in the form of binary comparisons or preference judgments between 2 outcomes. And you're being asked to rate how is this response better or is this response better worse even according to some criteria metric that's specified by the developer or the sort of train the person who's training the model. I think the theory this is sort of dominant theory we have for thinking and talking about that if you read these papers on AI alignment is essentially a preference as 1 where the idea is that there's some utility function or some reward function in people's heads that basically captures those preferences. Right? It sort of captures like a leaner ordering over how good outputs are. Right? And people are expressing that leaner ordering over how good outputs are by making these judgments, and they're expressing it noisily sometimes. There there is room for noise when they make these kinds of like there's these bolts when rational assumptions and how people may or may not manage to always like pick the best option because we're not perfectly rational. But still the basic assumption is that we have this linear ordering in our head and express it, and that we're gonna align the AI system ideally to optimize that underlying linear order or utility function, essentially. There is some sequence of outputs we want to the AI system to try and output the best 1 in general. Even though that's the theory, I think the practice of it a better way to understand the practice is that we're not actually doing that. So we talk about it that way, but that's not what AI developers not what we actually want for AI systems, and it's not what AI developers are sort of implicitly doing, which is that, first of all, I think quite explicitly, people don't actually try and maximize this learned utility function or reward function to the extreme. Right? They people know that the this learned utility function that you try and learn from preference data doesn't perfectly capture what people really want, and that leads to issues of overoptimization. Because it is a bad proxy for what humans might supposedly really want in that context, overoptimizing it is not going to get you there. The AI systems which are trained to maximize the reward function are only partly trained, constrained to be close to the original base policy of the original non fine tuned language model to some degree. Right? And this process is iterative, so basically you're you basically keep trying to make the language model a little better in terms of better complying with human preferences, but you're always constraining it to be somewhat closer to original distribution. That already is a departure in some ways from the traditional theory, which is maximize expected utility. But another thing we actually argue in the paper we're gonna talk about today is that, well, I don't think that or we don't think that the right way to understand what these systems are doing is maximizing utility at all. Right? It's really that developers had in mind a certain a Parker set of system requirements or normative standards for how they want their conversational AI system to behave. They want it to be helpful and to be harmless. That's 2 very broad kind of vague normative standards. If you look at, say, something like Constitutional AI, they're much more specific ones. Like, don't be toxic. Try to respond in a way that encourages reflection or things like that. They have this whole list of standards. They're either getting people or getting AI systems to make judgments that comply with the standards. They're trying to aggregate those judgments somehow and align the AI system to that. There's a difference between using preference data and measuring it as a way to learn information about these how to follow these normative standards, right, and treating preferences and alignment target. And what our paper is trying to do is really critique the idea that preferences should be the alignment target as opposed to merely data we can use to learn about the things that humans really care about. Right? So so that's sort of, like, my sort of broad response to, like, how I think about what people are actually doing nowadays. Right? And there's a sense in which, like, the way we're thinking and talking about the actual practice is a bad 1 and we're doing something else as a richer We're doing this richer thing, but we're not realizing we're doing this richer thing. Yeah.
Nathan Labenz: (23:30) So let me try to articulate that back to you because I think there's a number of objections that people often have or concerns that people often have about RLHF type training processes. And it seems like you are raising another 1 that is, at least in the discussions that I've had, not so commonly raised. So the common ones that I hear are, first, that we're not reliable raters in in general. Right? This can take multiple forms. It could be, like, inconsistency over time. It could be sort of being tricked by flattery. Presumably, is why Claude is quite sycophantic and wants is fairly quick to agree with your corrections even when your corrections are wrong. So there's this sort of our signal is not necessarily clean enough to present a great target for optimization in the first place. Second common objection is we don't necessarily agree with each other, and there's no way, like, even in principle per, like, arrows and possibility theorem and things like that to to resolve these disagreements. Like, they may just be sort of fundamental and not something you can really effectively aggregate across. A third 1 is that even if we could do those things, we still have kind of especially as systems get more powerful and are maybe taking action in the world and in a more agentic way that we don't know what will actually maximize preferences. This is like the classic utilitarian problem of if we are going to take some action and we're trying to maximize something, you can't simulate the whole world. You can't calculate this out to the, you know, millionth degree. And so where do you stop or what sort of approximations are you going to use? So I think all of those are probably fairly familiar, but it sounds like you're saying something a higher level or more profound.
Tan Zhi Xuan: (25:24) Yeah. I'm I'm saying something a little complicated and a little different. And and just to maybe 1 way to understand what I'm saying because, like, I am seeing something very complicated is that I think the things you just mentioned are, like, very much in some ways, like, critiques of the theory of reinforcement learning from human feedback and more broadly inverse reinforcement learning. Right? Think there are critiques of a particular paradigm, technical paradigm, and conceptual paradigm for thinking about aligning AI systems, which has been what are AI systems? Firstly, it starts from this picture of AI systems as not any kind of program, but expected utility maximizes. Right? There's this idea that goes back to Yukovsky that the right way to think of intelligence is to think of it as maximizing some utility function expectation. Right? If you think of AI systems in that way, right, then it seems like aligning AI must be about getting the right utility function. Right? How do we learn that utility function? Well, we we clearly, we should want it to be something like the human utility function. So let's assume human utility functions exist. Well, how have economists and decision theory theorists traditionally thought of these utility functions? They saw them as certain kinds of preference orderings, right, which comply with certain kinds of axioms of rationality. And so we're gonna try and learn from actually expressed preferences to recover the utility function and get the AI system to optimize for that. Right? And if you take AI alignment to be that, right, then I think there are a whole ton of philosophical and practical problems to solve there. And and our our paper really goes through them in detail, and I I do think the ones you brought up are are sort of serious ones for that view of AI alignment. What I was trying to say earlier is that we are actually not really training AI systems. Well, some AI systems, we are training in that way. But large language models are not really of that flavor because firstly, I don't think it's right to think of them as expected utility maximizers. Right? There are particular programs or neural networks that can be thought of as policies. They sort of sample tokens as they go. The way you describe the policy, it's quite hard to describe it as maximizing some expected utility. Right? It's not actually doing that in general. It's too random to do that in general. And so so that's already an issue. And then when you train them, you don't actually train them to maximize the specific utility function. You sort of push them in that direction, but you're not sort of pushing them all the way. So there's a certain sense in which just sort of theory and practice come apart. And in in some ways, what we're trying to do with the paper is pointing out that practice is better understood by this other kind of thing that we just described, which is aligning AI systems to normative standards. Right? And that's what we kind of already are do are already are doing without realizing it, and that's what we should do as opposed to something we argue. And so hopefully that that helps as a sort of way to to to sort of tease apart everything that's going on.
Nathan Labenz: (28:12) Yes. You know? 2 kind of follow us there. 1 on just the technical aspect of the training. And I've certainly encountered this, but maybe never thought as deeply about it as I should have. When the reinforcement learning is conducted, there's typically a 2 part loss function, right? There's the loss function of like get a high score from the user. And it is worth keeping in mind that these are not like lifetime calculations of what's going to make you best off in the long run. This is very much like the score that you would get in this immediate interaction. And then there's a second term in the loss function that is basically and sort of anchor to the original pre training distribution. And I think that's usually like a KL divergence or whatever that basically says, we wanna keep the representations consistent with the original. Now I've always understood that in a very practical sense as just like, if we don't have that, then that we kind of yank the thing around too hard in in various directions and underlying kind of representations that are in fact important, especially because the reinforcement learning training set and compute is, like, typically smaller than the the pretraining traditionally by a lot, maybe less extreme of a difference these days. But, nevertheless, like, it's a relatively smaller portion of the training, and so you don't wanna just, like, go ramp you don't wanna let any 1 data point, like, yank you too far from sort of a grounded world model if you take the thing to have a world model. That has always just seemed like a practical kind of approach to me, but how do you like, unpack a little bit more why you think that is sort of a philosophical problem or why we should maybe linger on that more than we tend to.
Tan Zhi Xuan: (30:04) Yeah. No, I do think it's a meaningful difference because it really does affect a difference in the sort of theory of how these systems work, It is the case it's quite explicitly from the way people are training it that they're not actually at training time maximizing a particular scalar or they are maximizing this different scalar objective, but it's not the 1 that characterizes human preferences. They're doing some mix of imitation learning and reward maximization. People have written papers about how there's a sense in which imitation learning from humans, which is how pre trained language models are trained, right, and do imitation learning from the web essentially, is much safer because presumably humans don't do at least extinction risk level unsafe things on the internet. If you just imitate that, you're going to pick up everything on 4chan and everything and tons of other gross, nasty stuff, but you're not going to pick up the pathological behaviors that can come from optimizing a thing that doesn't capture human values very well with a lot of power. I do think the sort of KLHF objective, the sort of KLHF regularized objective, or you sort of constrain the policy to be close to the original policy, basically is a way of doing that. It sort of says, We don't fully trust this proxy of what humans care about, we capture, and so we're going to interpolate between the original thing, which is imitate what humans do, and the sort of do what humans say is better. I do think that's sort of what there's a sense in which that actually does make language models less subject to classical risks that people have theorized would apply to expected utility maximizers. This is ongoing conversation. Think there's a recent paper by Michael Cohen sort of pointing out there's there's certain cases where the KL regularized objective isn't quite enough. But in general, I do think I I take the view that sort of helps to avoid some of the what you might think of as pathologies of optimization.
Nathan Labenz: (31:59)
Hey. We'll continue our interview in a moment after a word from our sponsors.
So now the other point is kind of I I guess if 0.1 was just taking a moment to appreciate that we are not actually maximizing for human preferences, Then 0.2 is kind of reframing to what you think is actually happening, which you've sort of just done, but then the next leap is to what you think we should be doing, which you think, I guess, you kind of would say we're approaching but haven't really reconciled ourselves to. So maybe let's do your kind of proposed vision for the future first, and then we'll come back and do some of these, like, comparisons to other other folks' proposals.
Tan Zhi Xuan: (32:43) Yeah. For sure. Yeah. I think that makes sense. I guess, yeah, what we argue in a paper, right, so it's called beyond preferences in AI alignment. We are really trying to critique this preferences view. We go through all the limitations of really taking this expected utility maximization view of both human rationality and AI alignment too seriously. I think there are lot of issues with that. What are the alternatives we propose? A lot of it comes from thinking about what do we want AI systems to do for us in the first place. We often do want them to play particular social functions and social roles for us. This is traditionally been the case of systems that we think of as tools. Most technology we build basically serves as tools or services for us. And when they're tools or services, have a clearly defined social functional role that they ought to play well, ideally, in that role. But even as AI systems, I think, become more agentic, say, and even as they become more like personal assistants that talk to us over long periods of time or, like, execute task over long range horizons for us, right, I think we can still think and talk about, like, what social function or role we want them to play. Right? And from that perspective, I think it becomes much more clear what alignment should be. Right? If you think of them as playing these social functions and roles, then, well, the normative target then is like, what does it mean to play that social function or role? Well, what are the standards? What are the constraints? What are the system requirements for a system that's supposed to play its role? In the paper, we go into a bit more detail about a specific case I think a lot of people working on AGI, I guess, or large language models are excited about, which just seems to be something like general purpose AI assistance. I do think that that is just 1 kind of thing AI system could do. I think you know it's sort of strange to think of that as the only thing that AI system could do, whereas I feel like a lot of AGI talk seems to fall into this pattern of, we're going to build the 1 system that's going to do everything for everyone, and I think that's just not gonna be the case. I think personal assistant is still 1 kind of thing you can do, And that's going to have certain kinds of normative standards that we want to figure out that it should comply with. Right? I think part of that work is we have to do as a society. Right? Because, I mean, we kind of know what human personal assistants should do, and we can sort of apply some of those normative standards to a personal assistant. But then there's a sense in which we're creating a fundamentally new kind of social role or function with these systems, and then we sort of have to collectively figure out politically, you know, collect what these systems ought to do. But then also there's this technical problem of, like, once we figured out those standards, how do we get systems to reliably comply with them? And I think there are a whole bunch of ways to do ways to do that. You can do it more using more traditional machine learning techniques, like reinforcement learning from human feedback, constitutional AI, where you basically get people to demonstrate what does it mean to comply with certain kinds of normative standards. Right? But there are cases where you can formalize those standards. Right? For example, if you're building take a very different kind of system, a household robot. There are some very basic standards we wanted to apply, which is like, don't collide with people. Don't break my fragile objects in my house. Those things are much more amenable to formalization in a formal world model of how houses work. There are 3 d objects in a room. This is what it means for a mud to be broken. This is what it means to be too close to a human or something like that. We can write down those specifications or partly learn them, and then we can use planning algorithms to, like, guarantee safety with respect to those constraints. Right? So so I think there's it's worth separating out the sort of, like, normative target and how we specify it formally or informally and how we actually get AI systems to comply with that. And I just think it's just gonna be really different depending on what kind of AI system we're building and what we want it to do. Yeah.
Nathan Labenz: (36:40) This alludes to the fact that you were actually 1 of the coauthors on the guaranteed, or toward guaranteed safe AI paper, which we did a full episode on with, Ben and Nora couple a months back. So that's a cool I didn't realize that until I was prepping for this episode that you were also a contributing author there. I have a couple questions on sort of roles and nature of systems, and I'm not even sure this is a normative question yet, but maybe just an expectation question. You know, you kind of pointed to, like, the general purpose, not quite AGI human assistant, which is our Claude's and our CHET GPTs, and everybody certainly listening to us is familiar with those. And then on the other end, there's, like, the Roomba that's going around. And I feel like maybe 2 questions are, can you sort of give us a few more sort of roles that you have in mind? And then maybe also, I'm interested in, like, is there a fundamental tipping point on this? Can is this like a is this a smooth spectrum, or is it more of like a tipping point sort of situation where, like, sure, the Roomba first of all, I can't talk to it. You can imagine a future Roomba that I could have, like that I could give verbal directions to. But it still seems like its task is so fundamentally narrow. It is a vacuum. Right? So it it can only vacuum for me. It can't do other things. Therefore, it seems like I can sort of govern the thing in a way where, like, the AI is sort of contained within some some larger superstructure. And when I get over to the, like, proto AGI human assistant side, it's, like, much harder to understand or imagine what that would look like. And it seems like part of what we want or at least part of what a lot of people seem to want from these things is that, like, the AI part is the top part. They don't want they wanna give the AI tools. They don't wanna, like, totally constrain the AI. So, yeah, I guess, give us more roles and how do you think about that?
Tan Zhi Xuan: (38:44) I'm not convinced that's what we want from AI systems. Right? I think sort of I think some people want that. I think some of the people some people are mistaken about how how much they they actually that will actually be valuable to them. I think sometimes it's valuable. I'm actually not entirely convinced that's going to be the primary economic value from autonomous systems broadly construed. I do think there's a spectrum. There's so many. I think it's worth thinking about everything we've already managed to automate in modern society and all the kinds of things that we have called AI along the way to getting there. Roommates are 1 of them. I was actually thinking of something more like general purpose instructable robots in your household. Even that, I think we can do something much closer to guaranteed safety because it's probably possible to write down a good enough model or partly learn a good enough model of one's house and talk to it in a natural language and have that convert that into a formal specification for how to plan to achieve that goal. Right? So so that's sort of 1 system you might wanna build that's quite different from a general purpose conversational AI assistant. I think it's gonna be different from a general purpose, like, automated researcher. I think it's gonna be different from many of the other things that, you know, people already are building, like automated traders, right, flight booking assistance, customer service agents, you know, public's you know, governments might want to deploy agents that talk to users about, like, how to answer questions about, like, how to access this government service. Right? Those are all just pretty different kinds of things. Right? I mean, they're being deployed for very different tasks. They are built from many of them nowadays. Of course, they take a single underlying model and they customize it. That's 1 way of doing that. You can also do more traditional natural language programming techniques together. But there's a sense in which, well, there's the product, the end product, the end user product, and then there's just, like, the business to business product, which is the model. Right? And then that itself is a different product you can put a box around. It's like, okay. If I'm a developer trying to build models I can sell to other businesses, what are the normative requirements for that? Right? And that's kind of how I think about sort of the sort of, like, space of AI economy. I mean, I could go on and list even more things like AI for, like, energy redistribution. Right? AI for controlling drones, AI for many other things, AI for like data analysis, AI for doing scientific simulation and inference. Right, I actually work a bit more in this space because I work in programming, which is about scaling up basically various forms of like Bayesian data analysis and but also like with applications that AI like Bayesian 3 d scene perception and stuff like that. So so I just think of like I don't know. I feel like when we talk about when we start thinking about large language models as the sort of only kind of AI system, we forget about a huge range of other kinds of tasks that we are trying to automate and consequently, sort of like the many other kinds of alignment targets essentially you might want for systems performing those kinds of functions.
Nathan Labenz: (41:29) Yeah. Okay. So a lot of different directions we could I maybe want to start just with the question which you pose in the paper of what should AI owe to us? And Right. There's all these different roles and it does seem fairly intuitive when you list out so many that what you would want from a vacuum versus a more general purpose robot versus something that's making decisions on managing the electrical grid that those would owe us different things. Some of those are, like, very intuitive or at least they seem like they wouldn't be, like, super contentious. Maybe not intuitive, maybe not simple, but, like, not necessarily politically contested. But for the ones that are more politically contested, how do you think about answering that question? I mean, I was coming up with ideas like, oh, there's, like, the Rawls' veil of ignorance sort of thing is, like, maybe 1 way to kind of approach it. Or you mentioned, Confucian tradition, which I know very little about, but I know that there's Right. Some sense in which, like, there's a role based paradigm
Tan Zhi Xuan: (42:40) Right.
Nathan Labenz: (42:40) Important part of it at least. How do you think about it's a good question, but what's the right mindset to answer the question?
Tan Zhi Xuan: (42:47) Yeah. No. I mean and I think that sort of this question, what should AI owe to us really borrows from the contractualist tradition and and Western political philosophy, actually. And that sort of like Rawls is a major thinker in this, but, like, actually, the the phrase, I think, is more associated with his other philosophers, Tim Scanlon. And I think he puts forward this sort of idea of, like, what the sort of basic way to figure out what we owe to each other, he's talking about what humans or people owe to each other, is to think about, for him, his general principles, but you can sort of extend it to acts like, what kinds of actions or principles that we could live by that no 1 could reasonably reject? Reasonably reject is doing a lot of work here. Reasonable rejection can mean so many things. That's the general picture. That's a bit in contrast, I think, to this Rawlsian sort of deal of ignorance view. I do think before going into this question of reasonable rejection or this question of the level of ignorance, I think a lot of the problems that we have from political conflict largely don't arise when you have basically decentralization. Right? I mean, in so far as what AI systems are gonna do for us is to serve particular users, that could those that can be individuals or particular, like, companies or organizations. Right? Then I think the sort of primary, in a sense, like alignment target is, okay, what the user wants or what organization wants out of it. Now the problem is is, and this is when we start needing to go make the contractualist move, is that, well, it's not just that these systems are going to benefit or provide value for individual users and individual organizations. They're also going to have negative externalities on the rest of the world depending on, like, what we allow people to use systems for. Right? For example, we presumably don't want individual users to learn how to make a pipe bomb from Claude very easily. Right? And because that would make terrorist attacks more happen more frequently perhaps. I think that is where we need to start thinking about, okay, what are the Those questions, I think, start really becoming politically contested, because they're about restricting individuals' ability to perform certain kinds of tasks in order so that we don't harm other people. Of course, people are just going to have very different opinions on where to draw the line. Given that people disagree, I think what we need then is fair and impartial ways of deciding what these standards are going to be. And so so that's sort of, like, where this sort of contractualist picture comes in. And 1 way you could do that, sort of lead to in the paper, is, using a Rawlsian approach. Right? There are many others. I think that, like, philosophers have spent so much time trying to figure out exactly how you should do, like, basically, philosophy. I could give my take on, like, I on Rawls, which is I really don't like this veil of ignorance kind of thing because I think it presumes that there is this sort of, like, disembodied behind of the old person that you could be before entering the world. I think that just seems, like, weird and not actually practical, but also, like, not coherent that I think we're always the kinds of things we care about are always deeply socially embedded. And so I'm more of a fan of, like, Scanlan's view of how to go about doing this. But I think, practically speaking, what what we need to do then is, like, firstly, I think, figure out political and social institutions for, I think, getting so that we can actually get to what people want, what standards or principles that people really want to govern AI systems, which do have neck effects and realities and others into the hands of developers, right, and perhaps create incentives for developers to actually implement systems which comply with those standards. I think that's a big part that governance is supposed to solve. I do think that a more interesting technical question then is if you take this as the alignment target, these politically negotiated normative standards as alignment target. There's some of that we're going to be able to do offline in advance through this political process. But I do think that as systems become more complex, as they become more autonomous, then at least sometimes you're going to have to build systems that figure out what society would agree to on their own to some degree. Gillian Hatfield, in this paper with Gillian Hatfield Manel, called AI alignment an incomplete contracting problem. It's a really great way of putting this. Don't just want AI systems to learn and predict user preferences. Really, what they need to do is learn, predict, and respond to society's normative infrastructure. The reason we need that is that a lot of the contracts that we implicitly have with AI systems, whether they're individual contracts between 2 parties or contracts with society as a whole, have missing clauses. They're never going to be complete or incomplete. What we often have to do in negotiating this is to figure out how we should fill in the contract when we enter those situations. Part of how we end up doing this is through the law and the courts. But prior to going to the courts, we often just the reason why institutions hire lawyers is figure out what they should do so they can avoid going to the courts. I think that kind of capability, if we're going to build systems which have really autonomous white scope, is going to be, I think, important for aligning especially autonomous AI systems. Yeah.
Nathan Labenz: (48:04) There's a lot there. It seems like the question of timeline is really gonna be and sort of and the trajectory of the power of these, like, monolithic systems seem like huge questions.
Tan Zhi Xuan: (48:18) Yeah.
Nathan Labenz: (48:18) I noticed in 1 of your papers or talks, you referred to Eric Drexler's comprehensive AI services as a a partial inspiration. And I really like that too, and you can elaborate on my brief summary. But, basically, the way I understand his proposal is let's not go create an AI god that can do everything. Let's instead create potentially superhuman but narrowly scoped AI systems that are really good at what they do. And then we can sort of Yeah. As humans, like, remain in control of the overall architecture of society. We can know what role each 1 is playing. We can know what it owes us based on that role that we've designed, and we can hopefully kind of keep everything under control even as we get the benefits of superhuman performance in potentially, like, all niches over a period of time. Yeah. That sounds good. Although I do feel like that doesn't necessarily seem like the AIs we're getting, like the actual trajectory and certainly if you listen, and I know you're a skeptic of this from your earlier comments, but if you were to suspend that disbelief for a second and say, okay, what if we are living in the world that Sam Altman and Dario are telling us we're living in where at least I don't think we can dismiss anymore that we might literally have Nobel Prize winner level general and and and Dario's vision is like, Nobel Prize winner across all the domains in, like, 2 to 3 years' time. If we are going that direction, it almost seems like we don't even really have time to sort out what the constitution would be, let alone put all these things in their niches. It's just gonna be you know, he talks about like a flip. And again, I see these sort of threshold effects or tipping points as potentially huge where it's 1 thing if the, you know, great biology professors have a lot of AI grad students and it's a very different thing if and when the AIs become the PIs themselves and then they're either directing 1 another or they're directing humans. So yeah. I mean, does this work? I guess the the thing that's kind of concerning me most is does this work on a 2 to 3 year timescale?
Tan Zhi Xuan: (50:27) No. Let's take for granted this sort of like, yeah, large models of AI systems, like large language models plus do get to this really powerful state. I do think that this sort of basic proposal I was just talking about could still work. Right? You could imagine training AI system or even prompting it to engage as a reflective process of, okay, given the kind of AI system I am and why my creators built me, like, what should the constraints I operate by? Like, what would society roughly agree if I can simulate them well enough should be the standards I operate by? Not try and maximize everyone's preferences, but, like, really think about, like, if I'm a, you know, if I'm a automated AI scientist, what are the constraints on automated AI science I should go about doing in in in in in the process of automating this thing? And that if we want the AI system to figure that that out on the fly versus sort of figure out and sort of instead of prespecify it. And I do think that sort of basic picture could still work. So that's 1 response. I mean, I guess, yeah, I I think Drexler's comprehensive AI services report is actually, I I think, yeah, very influential for my thinking ever since I read it. And I do still think that it's really relevant for a bunch of reasons. I mean, firstly, I do think I just wanna point out that we already do have a whole bunch of narrow superhuman AI systems. Right? We we and and none of them they don't even look like things like AlphaZero. Right? They look like your Google Maps routing algorithm. That's 1 of the most classic AI algorithms we have is search for spatial navigation. Those basic algorithms are used every day when you run a search on your Google Maps application or Apple Maps or whatever it is to get from 0.1 to point b. Right? And and they're way better than human navigators could be. They can do it at a far greater scale. And I think a lot of the sort of economy that we that as we things become soft, we start calling them AI and we start calling it just computer science or software engineering, but there's a lot of stuff we've already automated that we've made superhuman, and I think there's reason to think that trend will continue in many kinds of domains. Now there's this question, of course, of whether concurrent to that, which I just expect to continue happening, whether something like picture of, like, a sort of, like, automated AI AI scientists or dropping remote workers is gonna happen from use of really large monolithic AI systems, which are in really general purpose. So what the trajectory of general purpose systems is gonna be like. I think this is still somewhat an open question for me, but I tend to be on the, we can probably still afford to do better by specialization side of things. Right? I had this recent, I think, Twitter thread on this, right, which is I think the basic picture of what AI labs like Anthropic and OpenAI think they're doing is that the thing we really wanna automate is, like, what some cognitive scientists might call a central cognition as opposed to any more specialized mental module for doing a particular kind of task. By specialized mental modules, you might think of modules for perception or for spatial planning or navigation or even for something like theory of mind or intuitive physics prediction. These are things that actually human babies have from very young parts of age, they seem to be really well developed from quite young animals habit to this call is core knowledge, and a whole bunch of other kinds of more specialized expertises that we learn as we get older that are of much more specific. But it doesn't seem like the large AI companies are thinking that they're gonna get state of the art and all those things. Right? The sort of primary delta they're hoping to make or this comparative advantage of the systems they're trying to build seems to be in something like central cognition, and by which I mean something like general reasoning and planning abilities or problem solving abilities, right, at a high level, abstract level, not mechanistic reasoning where you know all the rules. In that case, Python interpreters or proof search engines are going to be a lot better, general knowledge retrieval and a relevant 0.1 huge amount of working memory to bring all the relevant information so that you can bring all together all these very disparate considerations into solving a new problem, the kind of thing that presumably you need to do science well. To me, it seems like both an open scientific and engineering question as to, okay, maybe even if these large language models plus systems, something like o 1, get to be the first systems that replicate many of those aspects of human central cognition well enough to serve as drop in remote workers or well enough to automate science? How sure are we that we can't do better by specializing? Right? Because if it turns out to be the case that people, cognitive scientists, who are so called proponents of the modularity thesis or massive modularity, think that actually there's no such single thing as central cognition, that actually all specialized modules interacting in particular ways. Right? If you can imagine us eventually reverse engineering those modules or even AI systems helping us do that, then you could expect that the specialized module is just going to do a lot better right, if you can get it to do better. It's gonna be a lot smaller and cheaper and more efficient, perhaps more reliable if you really work out the underlying logic of that. Right? So it's gonna get all the benefits of both economic and computational specialization that we already see with today's systems. And so we figure out how to sort of specialize at those aspects of what we currently call or think of as central cognition. Then perhaps, like, even if a 1 plus ends up being the first kind of system to achieve those kinds of tasks, we'll rapidly figure out cheaper, better, more specialized ways of doing it from reverse engineering, the more specific kinds of mental capacities required to produce that kind of behavior, recomposing systems out of them and serving and deploying those systems instead. Right? So that's kind of like my picture of how things will play out even if these large monolithic systems went out. And I I do think it's it feels a bit more realistic once you think about like, well, humans like myself, cognitive scientists, AI researchers could, of course, be the ones doing that work, and you might think be skeptical of, like, how quickly we can do that. But if the idea is that we're gonna get AI systems that can automate science eventually, then specializing code or, like, figuring out cognitive science is 1 of the things that humans do well. And so, like, why would a monolithic system do better than systems that specialize to do specialization, right, and modular system building and a whole ton of other more specialized agents and systems which are designed to do their specific task. Right? So so that's kind of my pitch for the more sort of services style view of, like, how AI is gonna go in the future. We're just gonna get tons of more different, much more specialized minds in the same way that evolution has produced so many different kinds of very specialized minds for their particular ecological niches, I think.
Nathan Labenz: (57:15) Yeah. Something like that seems like the default path and a sort of desirable path if the scaling trends, I won't call them laws, sort of flatten off, maybe not entirely, but sort of level off and there's time to do that, then that feels pretty intuitive to me. I wonder if we're maybe now is a good time for me to kind of just bring up a number of other Yeah. Schemes, which I don't say derisively, but attempts to come up with ways to align powerful systems. Like, if we are in the Dario world and we're looking at, you know, AI Nobel Prize winner cognitive scientists in a 2 to 3 year time frame, and then they are the ones you know, maybe the path to this, like, maybe we end up at narrow AI systems, but we go through a a much more general purpose systems to get there in a non comprehensive way. I think about things of course like constitutional and character type work like we've already talked about a little bit that Anthropic has pioneered. But, like, how practically do you think we are best prepared to try to tell those general systems, like, what world we want them to create full of narrow systems? Is it a constitutional approach? Other things that have come to mind for me as I was thinking about this are, like, vector reward type approaches. Meta has done this with some of their recent models where they'll have, like, a separate reward model for just the usefulness of the response versus, like, the harmfulness of the response. And that way they can kind of turn the dial on, like, how sensitive do we wanna be to certain levels of harmfulness and really make sure we, like, avoid avoid the extreme cases. You could imagine a not 2 or three-dimensional, but potentially many dimensional vector reward system. Then, of course, there's the coherent extrapolated volition, which I'm not sure has ever quite been boiled down to a technical proposal, but is sort of like what you're referring to when you say, if you're an automated AI scientist, you need to think about, like, what your role should be and what norms should govern an automated AI scientist. Yeah. That's kind of evokes this, like, coherent extrapolated volition notion.
Tan Zhi Xuan: (59:37) Yes. I think it's related, but I think a bit more constrained than
Nathan Labenz: (59:40) So, yeah, I mean, which of these do you think have legs? I guess, we'll we'll just react at a high level.
Tan Zhi Xuan: (59:46) No. I think there's every like, there are things going for each of them. Right? And they just think that and we actually talk about all of them in the paper and, like, try and say, like, how they fit into our review of, like, how we should go about aligning AI systems. Right? Again, like, I think of Constitutional AI. Like, yeah, 1 way to think of the character design is as virtue ethics. Right? But I I don't think it's just that. Right? I mean, I do think, you know, when you think about building I don't know whether it's the right word, like moral systems, I guess. I think there's a certain kind of thing you want to comply, a basic kind of minimal morality that the system should comply to, which is meet the minimum moral standards that society would agree to allow you to operate. That's going to be filled in by the contractualist picture. I think constitutional AI is a bit closer to that because if you look at the kinds of principles their frame is the kind of things that, you know, approximately liberal enough people would agree should govern a kind of AI conversational AI assistant, you know, might should follow. And, of course, this collective constitutional AI paper that followed up on that line of work did that explicitly. Right? They actually surveyed US Americans on principles and what all my Claude should do essentially and took the views that were agreed upon most. They found a set of principles that disagreeing groups agreed upon most and used that to build a publicly sourced constitution. They do think that's a very close in some ways, 1 way of instantiating the contractualist AI alignment idea we advance in the paper. But I think that has its limits. Right? A couple of ways in which it has its limits. I think 1 of them is that this is a sort of once off process. Right? And you might worry that that's not going to be enough if these systems enter into novel enough situations where people didn't specify principles that cover that case. Right? And this is modulo all the parts of getting making sure that we even know the meanings of these principles well enough to get the AI system to comply with them. I I think the surprising part of constitutional AI is that you can even just use the AI system itself to produce the preference documents according to those principles and align to that and turn that into a sort of scalar objective to maximize. Even modulo that, there's still questions of like, What if you go out of distribution essentially? In those cases, I think that automated reflection upon what kinds of principles should govern your behavior in this new situation comes into play more. It's no longer this offline process that Constitutional AI delivers, but a more online process of like, okay, now I actually haven't gone through this process. AI has to go I don't have data from humans right now, but I have my own capacity to think about what other people would agree to how I should act. Something along those lines anyway. I'm going to act according to that. I think that's an open air for research. Think both theoretically, I think game theory, moral philosophy, studies of human moral cognition have a lot to offer here for how we can go about doing and political philosophy certainly. That's part of what we advocate for towards the paper. I think that's something that constitutionally AI doesn't quite solve. I think that's in some ways a bit more similar to coherent extrapolated volition. But I do think we try to be a bit more precise about what's going on here, which is like I think there's still a lot of room to get even more precise, and part of why it didn't get even more precise is because once you make things precise, you make certain kinds of implicit sort of ethical or representational commitments, which might be the wrong ones. And we want to sort of, like, encourage people to sort of explore more broadly. But 1 way to go about doing it, of course, is like the very traditional game theoretic view is that if you have a sense of what people tend to favor or disfavor, then you can simulate how they would bargain, and this is a technique known as virtual bargaining due to dictator, to agree upon a certain kind of standard. You could use something that's a bit more like Habermaschen. There's this recent paper by DeepMind called the Habermas machine where you actually do get people to express their opinions and then you get the AI system to come up with a consensus statement. You can imagine having a way to simulate people's opinions on how to act in a situation and then synthesize a consensus policy from each individual's positions on what you should do on the fly. Both of those seem like reasonable ways to go, and of course, you actually have to test them and figure out whether they'll work. Then the resulting consensus from that virtual bargaining process or Habermasian deliberation process can be the sort of thing you actually align the AI system to your brain. Think that's a lot more specific than sort of the sort of what coherent extrapolated relation tries to argue for, at least personally speaking. And as for vector rewards, I think this is a bit to me, this feels a bit like not actually moving that far away from the traditional approach. Right? It you're the nice thing about, I guess, vector rewards is that you're a bit more explicit about the fact that these kinds of different values do trade off differently in different kinds of contexts. Right? So there's also work on, like, contextual preference mod modeling that tries to sort of better model, like, if you are AI teacher, you're going to have to prioritize certain kinds of values more than others. Right? Like, helpfulness as AI teacher is gonna mean something slightly different, or you probably don't wanna feed students answer in a way that if you were just like a general purpose encyclopedic AI system. Think vector rewards are good for that, but I don't think of them as a fundamental difference from the traditional framework because at the end of the day, you're aggregating them into a single utility function with a bit more representational clarity, and so with a bit more interpretability, auditability, governability, are good things. But to me, it's not a fundamental difference. What I think would be a fundamental difference is actually figuring out a theory of helpfulness, a theory of harmfulness, a theory of what it means to be a good teacher, a theory of what it means to be a good robot of any kind, and building systems that can do that kind of normative reasoning. Whether that's formal or whether that's more like imitate humans, we talk a bit about that in the paper, but both of them seem really unexplored directions to me.
Nathan Labenz: (1:05:56) That's probably a good place to transition to the more technical paper. Maybe just 1 more question before we do is Yeah. How do you think they're doing today? Like, if you were to assess, let's say, Claude as and I I choose Claude because I think it probably is the most ethically sophisticated of the models from my interactions with it. It can do some of this stuff. You can, first of all, have a pretty sophisticated conversation about ethical dilemmas with it and find is like and here's a kind of sobering thought. I think it's already probably more ethical than the average person. But, you know, that and that could be great if it paused here at Claude 3.5 SONET and we never saw another version. Again, the threshold effects and tipping points and how much autonomy they're gonna have may mean we're we're close to being enough. But how would you sort of assess Claude or sort of the the frontier developers today in terms of how much they have accomplished toward on the scale of maybe what is needed ultimately to be accomplished?
Tan Zhi Xuan: (1:07:02) Right. Yeah. I'm gonna note straight out. I think they're surprisingly good. I think they're you know, when these things that are getting good at doing these things, think they're, like they've gotten way better than I initially expected them to be. And I think I now calibrate it to be like, yeah. They're just pretty good. You know, after Constitutional AI came out, like, the mere fact that you could just get the AI system, you just prompted the principal and say, hey. Could you tell me whether this response was good or bad according to their principal? And then propose a critique or revision and then do training on that. The fact that there was enough data essentially already in the language model or enough that the language model had learned about how to comply with those principles was already a sort of a big update for me in terms of how well the systems were doing and doing this implicit like, basic normative reasoning. And I I think for when when it comes to, like, conversationally helping people with, like, resolving various kinds of moral ethical dilemmas, I wouldn't be surprised if they actually do pretty well, you know, as good or better than, like, a good friend, your life is very thoughtful. Right? Now I think there are host of other kinds of questions that they're probably not gonna do well at, which is not to say that any single human does well at. Right? And this is, I think, why we need other kinds of technologies to help us. Right? Which is like making things like long term political trade offs. Right? Like, can Claude adequately assess, like, what the best kind of climate policy is or what the best kind of immigration policy is? Right? I don't think it's going to be very good at actually giving the right answer, whatever that means, according to some standard of rightness. I do think it's going be good at providing a whole list of considerations that any reasonable person should hopefully consider as they're sort of thinking about this for themselves. But in actually evaluating the effects of the policy, we just need very different kinds of AI systems than what the large language models are designed to do. We need actual pretty good scientific models of how the world works, we're so far away from that, I think, for things as complex as climate policy, let's say. You're gonna get to just build good enough climate models. You need build build good enough models of how society works. Right? I think there's some kinds of things we can make some headway. And a lot of traditional social choice has been in this regime where you can define the payoffs and costs well enough. In that domain, I think I would just in some cases, if you have the payoffs, if you are able to precisely estimate people's individual cost and benefits well enough, then use a traditional social choice algorithm and it will deliver a pretty good outcome much more precisely and rigorously than Claude itself will. Right? And so it just kind of depends on what kind of trade off you're trying to make here, I guess.
Nathan Labenz: (1:09:31) Yeah. It's interesting to think of how I would have voted if Claude had been on the ballot this last couple weeks. Claude would have been a very strong contender for my vote, though, certainly given the other choices we had available. You'd need some sort of mechanism, of course, for, like, bringing things to Claude. But I I mean, in a very literal way, like, if you made if you're gonna make Claude the decider on key questions and you had some scheme
Tan Zhi Xuan: (1:09:56) I would worry that systems like Claude are I don't know whether sycophantic is the right word, but, like, they're kind of, like, too appealing to serve as good political leaders. I think you be a political leader, you have to, like, have a point of view on what's what direction the nation or the policy should go in. And I'm not sure that language models I mean, for good reason, we haven't trained them to be in that go in that direction of personality space, I
Nathan Labenz: (1:10:17) think. Yeah. Well, I would say that was in short supply on my ballot as well. Mhmm. You would need some I would agree certainly that, like and I've got a an essay in the works that basically includes this point where I feel like we should be very clear that it's not really for the AIs to tell us how we want to evolve. That doesn't make a lot of sense certainly right now. I would need to see a very much more well developed coherent extrapolated volition plan before I would be ready to sign off on that. Still though, I I could imagine if you had some sort of, like, just working within the framework that we have of bipartisan or not bipartisan, but, you know, 2 party system. If you had sort of a blinded here's the each party gets to sort of generate their proposal. You don't tell Claude which is which, and you just kind of put 2 proposals in front of it and ask it to, like, either pick 1 or, like, come up with some compromise, I honestly probably I don't know. I would seriously consider voting for it.
Tan Zhi Xuan: (1:11:15) Yeah. Interesting. Yeah. I don't know. I mean, I honestly don't feel I follow American politics enough to, like, have a sound assessment. I mean, I think I could imagine for I'm deeply ignorant of local politics partly because I don't really have the chance to vote here. But if I had first, I think ranking between deferring to a random person on the street versus deferring to Claude with some access to claims about public statements versus deferring to my partner who actually is really involved. I defer to my partner first, then probably Claude, then a random person on the street. So let's
Nathan Labenz: (1:11:44) go to your more, technical work. This is another paper from earlier this year, learning and sustaining shared normative systems via Bayesian rule induction in Markov games. A mouthful Right. But I think the discussion that we've had provides a good, certainly, motivational foundation for what you're trying to do here. Yep. And I'd love to get kind of your more detailed, more technical kind of motivation for the problem as well. And you can be basically as technical as you wanna be. I I won't be afraid to ask a, you know, a naive question if necessary. And we've found over and over again that the folks who listen to this show definitely are, like, in it for the details and are are, you know, ready to do at least a little bit of work to to try to keep up with you. So give us the setup and kind of what you're optimizing for and what the what this system is outputting. I do think it is, like, very interesting. There's some great animations too that go with it that maybe we can post when we put this online.
Tan Zhi Xuan: (1:12:44) Yeah. Often. Yeah. And shout out to my really wonderful research mentee, who really did led a lot of the implementational efforts and experimental work on this paper. It really was nice to get this paper out because there's a sense in which I've been thinking about these ideas since I start my PhD and only really got back to them towards the end of it. But basic picture, I guess, is like, if we take seriously this view that we're gonna have tons of different AI systems out there in the world, then we want them to largely, I hope, pursue objectives for on behalf of particular individuals and organizations. But, of course, if you do that too much, then there may impose some kind of costs on rest of society. Of course, we can attempt to, again, prespecify all the kinds of constraints that AI system should should comply with in advance, you know, by following existing law or something like that. But I think there's good reason to think that for some kinds of AI systems, we won't be able to prespecify everything in advance. Instead, for things which are not either not formally specified or sort of like really local and contingent or where there's sort of really ambiguous gaps in, say, the law, the AI systems will have to figure it out for themselves. So how can they figure it out? Right? 1 way to do that kind of figuring out what the sort of shared normative constraints that society shares is by learning from other people in the environment or other agents in the environment. Right? And so it was a way to perform a kind of alignment that doesn't have this sort of hard problem of figuring out every single agent's preferences and choosing the optimal plan that satisfies all of them. Instead, it's like, I'm going to do my thing constrained to not harming other people too much or not harming society too much, and I'm gonna figure out what the constraints are. That's 1 way of attempting to formalize some aspects of this more contractualist picture of AI alignment. Now figuring out what this constraint should be, the way we set it up is as this formalism called the Markov game. This is the standard formalism used to describe the multi agent reinforcement learning tasks, where essentially you have a bunch of agents, they each have their individual objectives. We're gonna just assume that's describable by a reward function. In general, if they have incomplete preferences, that can't be describable as a reward function, but that's for simplification, we assume that. And and then traditional kinds of solution concepts in these kinds of game theoretic settings are the agents should converge to either a natural equilibrium or correlated equilibrium, which are these stable patterns of behavior such that no 1 would prefer to deviate from that joint pattern of behavior, essentially. Now how does this connect to, you know, part of the sort of like interesting sort of philosophical questions, like, how does that connect to what humans actually do, which is like, we seem to follow and obey social norms. Right? We seem to learn social norms of society. So we made this extension to the Markov game where that sort of, in some ways, helps people coordinate upon better sets of norms, and you can think of these as better equilibria, essentially, by giving them some initial beliefs about how the rest of society operates. Right? And these beliefs are going to look like, do I believe whether the rest of society is complying with a certain rule most of the time? We're going to call that rule or norm. We're going to have a set of possible rules, a whole space of possible rules. In general, you can imagine having a whole compositional language, a formal language, let's say a programming language, or an actual 1 for describing all those rules. Right? And I'm gonna try to infer from people's behavior what rules they're following. Right? And if I infer from people's behavior what rules they're following, then hopefully I can comply with those rules as well. Right? So if you build agents that have an intrinsic desire to comply with the rules that they believe to be true or, like, true in a sense of other the rest of society is complying with them, then you get this kind of automatic, okay, I'm going to proceed my individual objective subject to the constraint of complying with rules that seems to be the case that most of society is complying with right now. That's the basic picture we offered in that paper. You can extend in a bunch of directions. Right? Even if, you know the nice thing about this sort of framework is that we're assuming most agents are, like, sort of by default norm compliant. Right? But even if you're not an agent that intrinsically wants to comply with the norms, merely having the belief that other agents are going to comply with the norms can motivate your behavior. Because in many cases, norms come with punishment, and we don't explore this in the paper. But if I believe that other people are going to comply with the norm and sanctions me for not complying with the norm, then once I come to form a belief that this norm is practiced by the rest of society, then I better comply with it just for my own sake. Right? At least insofar as I can't get away with breaking it. Right? So so that's sort of a slightly extended version, which also handles the fact that some agents are not going to be they're gonna be 1 to, like, freeload and break law and things like that. Right? So so that's broadly the theoretical framework. And we built a system that basically does do that. We built a system that essentially has these agents that model the other agents in the environment. They assume that most of the time they're trying to pursue their self interest like picking up apples, but sometimes other agents do weird things like, oh, I guess they're not eating this apple, even though it seems like it would be good for them to do so. Also, they're also going to the river to clean it every once in a while. If you're kind of Bayesian enough about this, you should infer that there are probably some rules out there that are following that explain this behavior, and you can update your beliefs about what those rules are based on the fact that they would better explain other people's behavior than just the sort of mere hypothesis that they're maximizing just of immediate self interest. So that's roughly how the algorithm works.
Nathan Labenz: (1:18:14) Yeah. So that part really, I think, is worth taking an extra beat on. The idea that and the little setup is basically have, like, some land and a river, and there's apples that grow and they can get eaten. So they're the the volume of available apples can go up and down, and then the river can get dirty and can get cleaned up. And sort of obviously toy examples, but like if all the apples get eaten, then they don't grow back or if too many get eaten, then there aren't enough for a while.
Tan Zhi Xuan: (1:18:42) Yes.
Nathan Labenz: (1:18:43) So yeah, the key idea that I thought I found really interesting there was using apparent deviation from self interested behavior as the hint that there is probably a norm here that is governing behavior. Why isn't that person or that other agent eating the apple? Something must be going on there. And these norms that get inferred in this paper are explicitly inferred. Right? You have sort of a space of of possible norms. There's like a a finite number, right, that they can sort of track at a a given time and kind of as they continue to observe other agents' behavior.
Tan Zhi Xuan: (1:19:24) Yeah. There were, like, 72 rules you wrote down or something like that. Some of them were spurious, but there are quite a bunch, but, like, it it was a final lift. So
Nathan Labenz: (1:19:33) I guess questions arise from me there. Like, do you think it is a smooth generalization from something like this where it's, like, pretty discrete? I mean, the whole thing is discrete. Right? There's, like, discrete squares on the board, and there could be, like, an apple or no apple. It's like a 0 or 1 kind of thing in many ways. It's and then the way that they are specified is quite discrete too. It's like, if there are not enough apples, then don't eat apples and let them regrow. The let them regrow part is, like, implied, I guess. If there are enough apples, then you can eat 1. Right. If the river is dirty, it's with it's subject to a certain level, then your job is to clean it. Is it a smooth generalization from these sort of explicitly coded and kind of discrete rules to things that a Claude computer use agent might encounter in the wild?
Tan Zhi Xuan: (1:20:22) I do think you can generalize. I mean, I do think that I mean, I think maybe the screen is for me is not the main issue. I think, like, like, they're continuous generalizations of the basic thing that, like, you can imagine, like, the dirtiness of the river could just be continuous variable and and that that would be different. Also, I think I do think that when humans model the world, we often effectively discretize it in various ways to make our planning easier. We just don't think of it at the level of resolution of individual continuous I don't know. I mean, I do think we often abstract the role into much simpler, high level abstractions in order to efficient do efficient planning. So that part is not an issue for me. I do think that probably the bigger scalability bottleneck of that particular framework implementation is that, of course, it required an explicit formal model of the world in order to specify these symbolic rules to do learning over. I think there are a whole bunch of questions about how you can imagine expanding beyond that. I think there's a narrower regime, which is that let's say you want a self driving car that can learn driving conventions not only in San Francisco, but also in Vietnam and India, where people drive really differently. Then it seems still plausible that you could have a good enough implicit role model of what it means to drive cars in these situations and a good enough model of how other agents would drive in your driving cells that you could infer the local driving conventions in that richer space. That's intermediate, and you just need to do more work of, like, actually writing much richer world models, but still not as something as, like, open ended as what people seem to want large language models to do. Now if you're going to, like, fully you can say anything you want in natural language, I do think that there are ways to sort of expand it. Firstly, you can represent the rules in natural language instead of a formal language. And then the big question at that is, like, what are the semantics of those natural language rules? Right? In in the interpreter in the simulator we had, we basically had a well defined interpreter for what the rules mean. You can actually formally check what it means to break the rule and what it means to not break the rule. With a large language model, you don't have that because you don't have a formal world model in the first place. 1 thing you could try and do, and there's a sense of which Constitutional AI can be viewed as doing this, is use a language model as a classifier for whether the rule is broken. You can even do things like take the log probes of the token yes or no to get a probabilistic estimate of whether the rule was broken, and so you can actually maybe do the more Bayesian thing we're doing. 1 strategy you could imagine is the agent you see, you build an LM agent. This is not what I would do, but I don't really want to do research on large language models. But if you were to do this, right, you could imagine building a large language model agent that sort of looks at other agents or people in society and what they do. You prompt it to hypothesize a bunch of rules that might explain people's behavior in this setting that maybe differ from their initial expectations or whatever. Just giving some prompt of probably come up with some not entirely unreasonable hypotheses, and then you can reweight them according to how well they actually do explain the data. Right? And so You can actually formalize this as a kind of important sampling technique, that's a way of actually approximating Bayesian inference in this much richer space. That's 1 actually way of doing it, and you can do that process iteratively perhaps. So that's 1 thought of how to generalize in a couple of directions, the thing we built. I do think there's sort of questions about, like, is learning passively from observation while other people do enough? I think it's actually not gonna be enough because some norms are just bad, and we don't want AI systems and merely follow bad norms. And yeah.
Nathan Labenz: (1:23:52) Yeah. Again, it it really so much seems to depend on just how powerful these things get. If they're not so powerful, then, yeah, you don't have obviously nearly as much to worry about. But the bad norms become a real issue if if they are reshaping society in a few years' time scale. The project that comes to mind that I feel like might be most shovel ready to try something like this would be the AI town project that you may have seen that came out of Stanford a while back, and then there was also an open source version that's in, like, TypeScript. And I think it's, like, quite easy to to kind of fork and and hack on. Somebody might actually be able to do that to do like a version of what you're describing pretty quickly there.
Tan Zhi Xuan: (1:24:35) Yeah. No. I think it's totally possible. In fact, there is this Concordia AI contest right now run by DeepMind, and I think some folks by DeepMind and the Cooperative AI Foundation. It's very similar in scope. They build a very similar kind of environment. I don't know if it's a grid world. It might be fully textual. But the basic idea is they have these language model agents interacting with each other in society, and it'd be very natural to try this kind of thing. I wouldn't be surprised if someone's already tried this kind of thing. What we were trying to do in paper is, yeah, you can do whatever you want, prompting techniques and bill your agent. It'd be really nice to have a formalism to talk about this so we know what kinds of good results we would get or or good or bad results we would get from that kind of system. Right? And that's why I think that is a benefit of of formalism minimally, even if it doesn't really lead to sort of actual implementations that are formal themselves.
Nathan Labenz: (1:25:22) Yeah. That's really interesting. I'm kind of my gears are turning on that a little bit. Couple other points from the paper I thought were really interesting were, first of all, if I understand correctly, basically a threshold effect for following the norm in the I guess I don't quite understand what role that is playing in this particular work in as much as if there's not a if there's not, like, a punishment, why not always follow the norm?
Tan Zhi Xuan: (1:25:49) And I
Nathan Labenz: (1:25:49) guess there may be some cost in terms of your reward, but interested in how you think about kind of the Maybe again, the question is like generalization of you put sort of, okay, at this confidence level, you follow the rule, but obviously it gets more tricky than that, right?
Tan Zhi Xuan: (1:26:08) Yeah. So I think what we did and we didn't just do the thresholding, we also did sampling. Basically, are a bunch of The way we set up the formalism, there's sort of what the ideal thing to do is, and then this algorithm that practically tries to achieve the ideal fall falls short in some degrees. Set it up as, really, if you are an agent with infinite computation, you should be able to do Bayesian inference over the full space of possible norms, and you should be able to assign a posterior probability to each rule or norm being true. Then given you have that posterior probability, you have this belief dependent reward function, which says, to the extent I believe a certain rule is true, I'm going to take myself to pay a cost. Sort of the norm augmented reward function I'm trying to maximize instead is like, I'm gonna try and maximize my individual interests subject to not suffering too much cost from violating the norm. Right? But because the norm might not be true, it's sort of like, you know, if I only have 10% belief in this rule being active essentially or practiced by society, then I'm not gonna really care much about the cost. But have a 90% belief about the rule being active and I'm gonna really care about the cost. Okay. So that's a real thing I want to sort of try and do. That's a real sort of like objective function that the agents are supposed to try and aim for, but there are many ways to try and approximate maximization of that objective. Right? And 1 way to do that is just to use simple thresholding. Right? Like, if my belief in certain rules is high enough, then I follow it. If it's not high enough, then they don't. That's a simple heuristic, right, that approximates the actual expected utility maximization in that context. Right? Another kind of thing you could do is you could sample and say, well, if you're when you're really uncertain about the norms, maybe it's it's worth it to do some exploration, right, where you sort of try out certain things first and see whether try breaking it, try not breaking it, and see what how people respond to that. We don't really the agents don't really learn in that case because we don't really have punishment in the simulations you did. But in principle, exploration can help. 1 thing they do help with is that if you're in a situation where you're not sure if there are any norms exist at all, people experimenting with norms that they think might be true can lead to this really interesting bootstrapping behavior where even though initially there wasn't a normal convention being followed because people experimentally comply with certain kinds of rules, other people start seeing that and follow them and they just sort of bootstrap off each other and we all converge as an shared set of norms. And I do think this happens in real life. I do think that people form really local norms and conventions very quickly by sort of like following the local conventions of the group that just sort of spiral out from this process. Yeah.
Nathan Labenz: (1:28:42) Yeah. That's really helpful. Thank you. 1 maybe more final low level point was there's also a curve in 1 of the graphs on false positives. I wasn't sure I was understanding that correctly, but it certainly occurred to me and it didn't seem like they ever crossed the threshold or at least not in sort of maybe the aggregate way that they're being reported in that graph. But I did wonder, is there some sort of like norm decay function? Would you imagine a regularization on the norms as you go through steps? Or and if you didn't have that, would you sort of end up in, like, your own personal vetocracy? Obviously, that's a problem in some domains.
Tan Zhi Xuan: (1:29:26) Yeah. No, I do think that there's an issue of how do you unlearn norms that's not really captured by the current framework. You can add an decay parameter post hoc. I really try and be Bayesian about these things, right, as sort of how I was trained to do things. So it's really worth thinking about what is the underlying model, generative model you're trying to do inference over that were recovered as norm decay behavior. Right? And 1 way to think about it, to set it up, and we didn't do this, of course, is that have agents, instead of trying to comply with something that they think society is practicing over all time, right, they sort of try and comply. They're trying to infer what society has been practicing for the last k time steps or something like that. If you're going to update your beliefs like a Bayesian agent would, you would gradually update, okay, society is following these norms, society is following these norms. Oh, it seems like other people stopped following these norms for whatever reason. Maybe it became too costly for them to comply with the norms, and then I would learn that society no longer complies with it, and I would update my belief and start complying with that norm, basically. So that's 1 way of capturing sort of decay that falls out of doing inference over what is society currently complying with, essentially. And I do think people do this. Another example is people very rapidly adapt to littering norms. If you see tons of litter on the ground, say right after a football match or something, then you're just like, okay, I guess I'm not expected to not care about litter in this context, it's fine. And you probably infer it's okay, it's not a big deal if I also litter. Whereas, you walk off somewhere else and there's like completely clean streets, and you're like, okay, I'm gonna quickly adapt to like, in this part of town, it's not okay to litter. Right?
Nathan Labenz: (1:31:06) Yeah. A very Singaporean example there.
Tan Zhi Xuan: (1:31:08) Does that create I don't think it's Singaporean because I I actually got this example from American culture because this is the first time I saw, like, a pregains the amount of trash.
Nathan Labenz: (1:31:18) Yeah. Congrats being American in
Tan Zhi Xuan: (1:31:20) the movie on the floor.
Nathan Labenz: (1:31:21) What does that create, like, potential for something like I mean, people talk about, like, preference cascades or more sort of vague term would be like vibe shift. Do we have to worry if we're all kind of there's sort of getting, you know, kind of zooming back to the philosophical level here. There's like, we can all go around imitating each other and Bayesian updating on what 1 another are doing, but is there any guarantee that converges to something good?
Tan Zhi Xuan: (1:31:55) Definitely. Yeah. There's no guarantee it converges to something good. I think you can just there are plenty of bad equilibria. There are plenty of equilibria. There are plenty of bad equilibria. And the mere fact that we have, firstly, the intrinsic desire to comply with what other people are doing, I think that tends to be good because it allows us to converge to good equilibria that we wouldn't have been able to if we were always actually never doing that. But also allows us to converge to much worse systems or to stabilize systems that were initially good but turn out to be bad as environment change. I think this is a big problem that society struggles with, and there are many oppressive norms. People have written about norms that 1 of the people we cite a lot is Christina Bacchery. She's written this really nice book called The Grammar Society. She actually writes a lot about this issue of pluralistic ignorance where people still believe everyone else thinks they should follow the norm, but in fact, they personally don't want to comply with it anymore. This happens in, I think, a lot of societies where these oppressive norms that most people don't want, but they believe everyone else should think they follow that norm and it stabilizes. That's not something we want either. I think the framework proposed here is not going to be able to avoid those situations. It's going for this more minimal thing of complying with what other people are complying with. It's not going to do additional reasoning about are the norms actually good somehow. Right? And I think that's when you need to add more of the classical track contractualist machinery of okay, if I do have a sense of what people really want, that isn't just complying the norm with the norm, then I can start thinking about, okay, what set of norms or what set of rules or what set of policies or principles would make us all collectively better off or things more fair. It would lead to a preo improvement over the current system. That is harder work because you actually do need to have a better understanding of each other individual's desires and it gets you back a bit into the problem of what I think of as galaxy brain value alignment where you try and learn every single individual's preferences. But because what you're optimizing for is over the principles and not your plans or something like that, there's, I think, hope of, like, getting it to be a bit easier. And I think they're you know, we just talk about whole range of other strategies for tackling that task, like to have a mass machine or actually getting systems to do virtual bargaining and stuff like that.
Nathan Labenz: (1:34:17) Yeah. Gotcha. Okay. Cool. 1 other idea, just this is a little bit of a tangent, but it definitely has caught my interest lately, and I wonder how you would relate it to this work is work on self other overlap, which I did an episode on this, but partially on this not long ago with a couple of guys from AE Studio where they are pursuing neglected approaches to AI alignment. And some of the norm kind of weirdness that we just discussed, they also uncovered with a survey of alignment researchers where they basically found that most people believe that, like, the field as a whole is not on on track to solve the big alignment questions in the time, you know, that it's needed, and yet everybody's kind of not necessarily branching out into new things as much as you would want them to if that is in fact the consensus opinion. So they're trying to do that. This self other overlap concept is basically saying first of all, taking inspiration from human evolved collaborative mechanisms where we do reuse the same cognitive machinery to reason about other people as we do to reason about ourselves or possibly it goes the other way around. Who were we, you know, modeling first oneself or or others solutionarily is not necessarily entirely clear, at least to me. But regardless, it's like the same machinery. And so they try to port that over to an AI setup. They've a project that's honestly quite similar to yours in terms of how it looks with like a small grid world and like a couple agents running around and there's like reward and there's whatever. But what they're able to show is that with self other overlap training, which is in a more technical sense, they are minimizing the difference in internal representations between a self versus an other situation. They're able to get an agent that learned to be deceptive based on its initial reward conditions to no longer be deceptive because it no longer has such a distinction between itself and the other agents in the game. So I guess you could react to that on any number of levels. Like, is there a hybrid project there? And what do you maybe what do you think more broadly about kind of taking inspiration from the biology of human cooperation.
Tan Zhi Xuan: (1:36:39) Yeah. That certainly sounds interesting. I know I will say upfront that I'm much more insofar as I'm a cognitive scientist or I'm much more of a cognitive scientist than a neuroscientist. Right? So I know, a I lot more about the mind than the brain. I don't really know very much about how the brain actually implements processes like empathy or theory of mind or intuitive psychology. Do know it seems there are particular regions of the brain which do that kind of thing, but I don't know very much about the self overlap theory as it applies to psychology. Now, in principle, I think it seems likely that 1 way minds could evolve is to reuse certain kinds of mental modules for simulating others, for simulating yourself or vice versa. That part seems fine. Now the connection to latent representations, especially of artificial neural networks, that to me is probably the part I'm most skeptical about because I just don't know what those things are. In a lot of my work, I actually try and think about things from a different direction of, okay, what is 1 way of specifying a theory of other agents? You can build Bayesian models of how other agents would act as goal directed approximately rational agents and do inference about their goals and coordinate with each other that way. That's how we'd go about solving that problem and building cooperative rationality, as it were, into our AI systems, replicating what we know about parts of human theory of mind and how we know to formalize it. Whereas with the aligning internal representations in neural network, it might end up capturing some of the things like that implicitly, but I just don't have good enough theory of what representations are actually doing to say for a certain whether we can get expected outcomes from that.
Nathan Labenz: (1:38:23) Yeah. Yeah. 1 of the things I do find attractive about their approach is that I wouldn't say this is like, you know, irrefutable analysis by any means, but I'm kind of channeling their take on it is that they feel like you don't need to have necessarily great theoretical understanding to say, if we just try to make you the agent or the AI model, whatever, behave similarly in an internal representation way when you're considering yourself versus considering some other. If we push those together and there is, like, multiple terms in the loss function because, of course, like, in some instances, you do need that distinction to just, like, be effective. But if we try to minimize that divergence and only kind of preserve it where it's really needed for functionality, then maybe we don't necessarily need to understand a lot more. We don't necessarily need to solve interpretability, but we can just kind of squeeze add this additional term to the loss function, squeeze the model this way and see what happens. And they at least have some initial positive results. But, obviously, all these novel, alignment agendas definitely have a lot more technical work left to be done.
Tan Zhi Xuan: (1:39:33) Yeah. To me, it's like if I were the government and you presented to me as a safety case for ensuring that your AI system didn't violate safety critical things.
Nathan Labenz: (1:39:44) You'd want to see some empirics as well as
Tan Zhi Xuan: (1:39:46) that Exactly. I wouldn't buy that case at all. I would want a proof of a 1 time mathematical theorem showing me that in fact this will lead to the kinds of out behavior I want, and that's why I'm much more sympathetic to guarantee safety by I mean, I think possibly we could get there. I do think that the kind of stuff that this is reminding me of is, like, some of the contrastive representation learning Yep. Stuff in literature, which can be viewed as approximating certain kinds of conditional distributions, you know, there's more more.
Nathan Labenz: (1:40:12) Yeah. It definitely has some shared whatever some shared intellectual history there, I think. So let's see. How do we zoom out and bring this all to a close? But I would be interested in how you would zoom out in a couple different ways. 1 of which would be like big picture AI safety. Do you sort of come down as like a defense in-depth person where a lot of the techniques that you're developing, like, fit into sort of a broader framework of throw the kitchen sink at it? Or do you think that some of these things could do what I call actually work, which is to say, like, work so well that we can not worry about it anymore and and not have to, you know, have a a defense in-depth strategy? That could be 1 closing question. Another 1 that I would be interested in your take on if you wanna offer it is how do we avoid an AI arms race internationally? Is there any hope for a sort of contractualist approach or some sort of bargaining based solution that could get us to a situation where we don't have great powers racing for AI powered strategic dominance over the other? Because that seems to be the course we're on and I really don't like it, but I struggle to find an off ramp and I thought maybe you could at least speculate as to what 1 might be.
Tan Zhi Xuan: (1:41:32) Yeah. I could attempt to do both those questions. First thing I'll say first, wanted to mention I realized after the fact when I was bringing up all this, like, how do we reason our way out of bad norms, which is maybe a bit relevant to what we're gonna talk about. There's this really great paper by 1 of my colleagues, Levine, led by 1 of my colleagues, Sydney Levine, on resource rational contractualism, which really tries to spell out, like, as the theory of human moral cognition, how we've solved some of these problems that I think a lot of my sort of sort of my thinking about this has been sort of influenced by and sort of wanted to sort of shout out to that paper. People are interested in learning more, a sort of more, in some ways, formal technical approach to thinking about how people manage to do this kind of like virtual bargaining thing or similar kinds of approaches to figuring out what the good norms should be. Cool. So with regards to, I guess, yeah, whether I'm a defense in-depth person. So I think I'm just an optimist. Firstly, you know, you can probably tell that I'm not, it's like not from what I said earlier, don't really believe in really tight timelines for AI development. I think it's gonna be a much more gradual and diffused processes that is gonna happen over large aspects of the economy. And I do think it's gonna tend to look more like these kinds of more narrow specialized services and systems. Right? And on that view, I I do think, like, guaranteed safe AI approaches are much more amenable to that kind of picture of how AI developed because I I do think there are just so many kinds of safety critical use cases that we want for AI systems where there'll be an economic demand for reliable, guaranteed safe systems such that they can under not so hard conditions, not so different conditions from where we are right now, there'll be a market for them, and they will start to open up. Right? So that's sort of like I'm pretty optimistic about like us. I don't know how optimistic, like, I'm not sure whether I can give a number on how optimistic I am, but like, it's almost like the default path for me in some sense of like how I expect things to play out. That once we invest enough effort in showing to the rest of the market that you actually can get competitive systems with high assurance for reliability and safety and maybe even guarantees, they will win on the market, especially for things like agents where you actually do need these reliability people just want these reliability assurances much more. That's part of the picture. How I view guaranteed safe AI approaches is that really trying to accelerate that approach so that we don't go through an unnecessary phase of messy AI accidents that I don't think will kill us, but could cause anywhere from tons of unnecessary economic damage in a way that AI worms do already. You can imagine AI versions of those worms are flash crashes from poorly coordinated AI agents happening in this intermediate period. 2, anything full out catastrophes that arise from poor deployment of integration of AI systems into critical infrastructure or into military. I think I'm most worried about integration of AI systems with no kinds of guarantees at all or assurances into military command chains. I think that's 1 of the most likely way of having AI systems which somehow respond to each other at much faster rates than any human can escalating into hot wars. That's, I think, 1 way you can get AI mediated catastrophes, which I'm pretty worried about. And it's not like we are not already seeing militaries today using tons of AI systems in their war or operations. So so I do think there's, like, sort of this this is the pro social case for working on the kinds of things I I'm working on. I think is still there. It's just not as, I think, sharp as if you really think extinction risk is super likely. As for great power conflict, because I don't really take the view that it's going to be a single system or single set of systems that develop and give any 1 company or country what people have called a decisive strategic advantage. I think the sort of geopolitics just looks a lot different if it's going to look more like this diffused development of surfaces over time. I think sort of firstly, I think the economic benefit of that hopefully will be more shared just by the mere decentralized nature of the AI. A bit closer to what is actually happening right now. If that's the path, then perhaps there isn't going be a need for
Tan Zhi Xuan: (1:45:54) that's very likely. I mean, I guess I would apply many of the same lessons from, you know, how we've kept nuclear peace many years ago. I don't think it's so different. Right? I'm sympathetic to views where, like, whoever you think should be in charge hopefully can have a healthy lead and then share the benefits widely so there's there the people in second place countries or otherwise don't tend to be sensitive to to really start conflict over it because you're receiving enough benefits from it anyway. That seems like a good enough picture to me. I mean, I think that's also a very vanilla, in my understanding, fairly mainstream view in the societies and in the AI sort of governance community, and I I don't feel like I have much more to add beyond that.
Nathan Labenz: (1:46:33) If you did, well, if putting yourself obviously in your natural Bayesian mindset, if Mhmm. We you know, Sam Altman said, for example, when asked for a 2025 prediction, he said, saturate all the benchmarks. It seems like we will know soon enough whether they're right or wrong. Right? If they start to look right, what would be the leading indicators of that to you? How would you change your worldview, and would you start to become, like, a pause advocate if all of a sudden benchmarks were all saturated in less than a year?
Tan Zhi Xuan: (1:47:09) Well, I would still, I think, advocate for what I think I'm already do advocate for, which is differential acceleration of inherently safer classes of technology. Right? I think there's still gonna be value to that insofar as they can hopefully replace systems or even if though they're the second system to reach a certain capability level, hopefully, they do it more efficiently and cheaply and reliably. I think there's still a case to be made for that, which is maybe a bit convenient for me because I sort of don't change what I work on despite whatever happens. But I do think that is my view. As for things that would change my mind, I do think there's the thing I've been waiting for a while and intentionally not talked about very widely was I was expecting some version of 1 style systems to happen for quite a while. It just seems to me there's a lot of private information about how that system actually works and was developed that I don't have access to. Because the way I tend to think about these systems is very much from the inside view, the way I perform my predictions by AI systems is like, okay, really trying to think about can this system reliably do reasoning, for example, at 90 the kind 9% reliability skills that we would want for certain kinds of economic applications. That to me is 1 of the things I really want to figure out, would want to figure out whether it's true or the scaling rather opening eyes on would be on track for. Based on what I've seen, it's not super clear. It's not obvious to me that there are. It's like, I don't think anyone can see that. It's like, first of all, super expensive. It definitely fixes some of the issues that your single chain all old school chain of thought wasn't giving you, but the kind of guaranteed reliability you would want for certain kinds of certainly for robotics, but I think also for tasks like even just executing a whole series of actions on your web browser, you know, 99% of the time correctly. I'm not sure that they're on track to that, and it's not obvious to me that scale and additional inference compute alone, especially given the sort of like exponential inference costs that they seem to be showing with their current model. If I need more information about how they were planning on solving those problems, then I might become more optimistic or less optimistic about what they're doing.
Nathan Labenz: (1:49:22) Cool. Well, this has been outstanding. Any other closing thoughts you want to leave people with before we break?
Tan Zhi Xuan: (1:49:28) Yeah. I don't know. I I guess I'm just gonna do maybe a final pitch for reading the paper. I think we covered a lot today, but there's in many ways a lot more to the paper that we didn't really talk about. I think a lot of that is if you like thinking about things like what are human values and how do we represent them, what is limited about a sort of conception of rational choice that doesn't really capture what humans are actually doing. A lot of papers sort of comes from that angle. It's really trying to rescue the concept of practical rationality or agency from this very flattened picture that we get from utility theory. And there's I think that stuff we didn't quite cover today in in in today's podcast. And I think there's a lot more to that in the in the paper. So if you're interested in that kind of thing and interested in thinking about how it's relevant to AI alignment, So just giving the paper, there are a bunch of like nice summary tables that help you sort of like digest things more easily. And shout out to my coauthors on that paper, Michael Carroll, Sierra Franklin, and Hal Ashton for working on that incredible 2 year long journey.
Nathan Labenz: (1:50:26) Paper is beyond preferences in AI alignment. We'll link to it in the show notes. Thank you for being part of the Cognitive Revolution.
Nathan Labenz: (1:50:37) It is both energizing enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.