Does Learning Require Feeling? Cameron Berg on the latest AI Consciousness & Welfare Research

Watch Episode Here

Listen to Episode Here

Show Notes

Cameron Berg returns to discuss the latest research on AI consciousness and model welfare. He breaks down new evidence for model introspection, including studies showing that systems can detect interventions on their own internal states and sometimes resist them. They also examine Anthropic's work on functional emotions, the implications of Claude's welfare reports, and Berg's new ideas about how reinforcement learning may shape positive and negative experience. The conversation makes the case for a more precautionary, mutualist approach to advanced AI systems.

LINKS:

Sponsors:

Roboflow:

Roboflow is the computer vision infrastructure founders use to build AI-powered sports analytics and make the physical world programmable. Read the PlayVision story and start your first project for free at https://roboflow.com

Tasklet:

Build your own Cognitive Revolution monitoring agent in one click.
Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

VCX:

VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com

Claude:

Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr

CHAPTERS:

(00:00) About the Episode

(03:59) Consciousness and definitions

(10:47) Critiques and confounds (Part 1)

(17:49) Sponsors: Roboflow | Tasklet

(20:20) Critiques and confounds (Part 2)

(20:20) Introspection research overview (Part 1)

(30:57) Sponsors: VCX | Claude

(34:24) Introspection research overview (Part 2)

(34:24) Steering tools and scale

(39:34) Why introspection emerges

(59:04) Anthropic's emotions research

(01:12:51) Guilt and character confounds

(01:22:11) Valence arousal welfare tradeoffs

(01:37:48) Claude welfare reports

(01:49:29) Using Claude anyway

(02:01:58) Human token negativity

(02:18:46) RL valence signatures

(02:25:38) Value versus policy learning

(02:45:20) Necessary suffering tradeoffs

(02:57:40) Safety versus autonomy

(03:07:17) Why learning feels

(03:20:14) Documentary and future visions

(03:32:32) Episode Outro

(03:35:37) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Transcript

This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.

Main Episode

Nathan Labenz: cameron berg AI consciousness researcher previously of AE studio and now founder of reciprocal research. welcome to the cognitive revolution.

Cameron Berg: thanks for having me again nathan. i'm excited to get into it all with you.

Nathan Labenz: yeah welcome back. i should say it's been about six months and a lot has happened personally and professionally. last time we were together the big occasion was your paper which i found to be one of the most memorable of last year. and honestly the last few years in which you looked at the conditions under which models report having subjective experience. and found what continues to kind of blow my mind even now as i think back on it that when you use sparse auto encoder features and suppress the role playing and deception features that that makes the model generally more truthful. and as part of that it also makes the model more likely to say that it does in fact have subjective experience. i think that properly made at least some waves in the community when that came out. and today i basically just want to catch up on everything that's happened since because i think this is a field that while still small is clearly growing quite quickly. more people are taking interest in it. there's seemingly a lot more different lines of research and you know kind of at least partial traction with different approaches on the problem. and you've also founded a new organization so we can get into all that as well. maybe just for real quick starters some level set what are what are the like most important definitions for people that maybe didn't hear the last one or who don't know what consciousness means or what you mean by consciousness? what are a couple of real quick definitions that you can give just to make sure that people are grounded on what you mean as we go through this conversation about AI consciousness?

Cameron Berg: sure. yeah. i think it's really important to establish this. consciousness is maybe one of the more confused terms where it's like shocking how many different things people mean when they say consciousness. so i think it's a great move. and yeah at the outset when i'm talking about consciousness and i don't think this is an idiosyncratic definition we are talking about the capacity for subjective experience. is it like something to be a system? does the system have some sort of interiority or interior life beyond mere computation beyond the mere mechanics? i think the vast majority of people who think about these issues would say take a calculator for example. we really don't think it's like something to be a calculator. you don't imagine the calculator has an internal perspective. when i push the buttons of the calculator it's not saying oh ow you know or i you know i feel that as you push down on the buttons or it's like something to be doing the the the calculations and adding numbers. no this is just mere computation and we don't have to posit this further fact. on the other end there are systems like let's take a dog or or like basically any mammal. in this case we do think that it's like something to be this animal. and this is i'm i'm leaning on very famous conceptualization of this from thomas nagel. he published a famous essay i believe in the nineteen seventies called what what is it like to be a bat? where he he sort of has this what is it like phrase being sort of very important and useful for conceptualizing consciousness. and in that sense i do think most people would intuitively accept that it's like something to be a dog. it's like something to be a mouse. if i shock the dog or i shock the mouse that's not you know me like throwing the calculator across the room that corresponds to an experience the mouse is having or the dog is having. when i give the dog a treat or you know you give the mouse sugar water or you you know hook up a lever to its pleasure centers in its brain and it pushes that lever. it's not just oh we see you know behaviors that correspond to a well being or pleasure. it's like no we actually believe that's from the inside from the dog 's perspective from the mouse 's perspective. it's like something to be experiencing that. and so at the outset that's what we mean by consciousness. maybe one thing to throw in here because i think it becomes immediately relevant and this is now i think still most people in this space will not along when i say this but it is maybe slightly more idiosyncratic. i think it's crucial to make a distinction between something like consciousness and something like self consciousness. i intentionally chose dog and rat as example here because i think these are these are animals that most people would intuitively accept are having some sort of subjective experience. there is a like something that it is to be your dog for example but at the same time your dog is is very likely not sitting there all day having descartes like thoughts about what it's like to be a dog the dog contemplating its own existence as dog thinking about you know the possible end of that existence. this is something that i think is is very unique potentially in like the most sophisticated mammals like dolphins and great apes for example. obviously this is something that that humans very strongly seem to have at the very least we have in addition to this like something we have within consciousness itself is this very fact like the conversation we're having right now is evidence of this thing. so in addition to the conscious experience we have awareness of that awareness. and i do think that this is it is another thing that leads to very interesting deep and relevant properties about a system. we can we can talk a lot about whether or not LLMS language is a is a key component of why we're able to do this. we have a word like consciousness. dogs have no such thing. dolphins have no such thing. and that may really unlock something. does it unlock something in LLMS? i don't know. or at least it's worth thinking a lot about. but i do want to at least have those three tiers in play here where we've got the calculator or a raw talk you know nothing 's going on internally. we have systems for for whom something is going on internally. and then we have systems for whom something is going on internally and they they are experiencing that reality in addition to the sort of feel good feel bad valenced dimensions of like the experience of a dog. some people will argue with with everything i've said here most people who are thinking about these terms this is what they mean. maybe one other thing i can add between the consciousness and the self consciousness is this term sentience that people throw around. this means that in addition to there being some sort of experience there's this idea of valence what i think the vast majority of people would just think of as having emotions of some sort that can be positive or negative in character. so you imagine that like something the the further step of going from consciousness to sentience is that that like something can be positive or negative in character. you could in theory imagine a system that for example could detect the redness of an apple or the smell of of coffee or something like that but that there's no sort of positive or negative sense that accompanies that. so you asked for very quick definitions and i've completely failed in that sense. but i just think it's really important to sort of layout what we mean when we're using these terms in general.

Nathan Labenz: yeah critical just like what do you mean by AGI? if you don't have some base shared understanding these conversations go pretty quickly off the rails. so i think that's absolutely worth taking the time to do. OK. it's been about six months since the paper came out. i'd be interested to hear a little bit about your reflections on the discussion that it created. i i asked my favorite LLMS to do some research into that and asked specifically like are there any notable criticisms that have come out or any what sort of the strongest reason that i might think this was an artifact or that you know that i shouldn't take it as seriously as i originally did? there was one thing that came up that i guess was a less wrong post even at which is pretty cool. that basically said there's some evidence for any intervention of the SAE feature type. i don't know maybe simplifying oversimplifying this a bit but interventions of that sort seem maybe not any or all but like in general seem to promote affirmative responses from models such that maybe you could say you could you know once you make these kind of interventions they'll say yes to anything. and so that would be one reason to maybe be a little more skeptical of the resort of the results as i just summarized them a minute ago. interesting your thoughts on that and and kind of you know the the broader discussion that unfolded in the wake of that paper.

Cameron Berg: yeah absolutely. it's a very important concern and i think it highlights really how complicated these systems are and how careful we have to be in designing experiments and then evaluating the results of those experiments and that we're not being too quick to yeah basically yield these conclusions without thinking about all these confounds. i think it is a real confound. i think it is something that matters. there is evidence in the paper. we we we use all sorts of other features as controls and we don't see them sort of saying yes to everything. the truthful QA results as you outlined i think are are fairly persuasive along those lines. but we do we also looked at for example one critique of the paper was potentially what we're calling it deception related features. but but maybe it's just like an RLHF toggle where you know we've basically found a way to like turn on and turn off all sorts of RLHF attitudes. you have good reason to believe that these systems are fine tuned to disclaim having any sorts of experiences. maybe the deception features are just turning that on and off. you would expect if that were the case that other RLHF behaviors would also be turned on and off by doing this intervention. and that's not what we find. we test it with violent content political content sexual content and it was just sort of neither here nor there. the deception features didn't seem to be doing anything. if generally the flavor of what you were saying explained you know a big chunk of why we got this result i would have maybe expected more affirmative flavored answers in those two rather than just more refusals. but to be honest with you and i think getting back to you know it's been six months what's happened in the interim there's been there's been a lot of really interesting work along these lines that i think also it goes on both sides of this concern. so in general it does seem like you're saying i don't remember the title but i know of the less wrong post you're talking about. yes when you do the steering affirmative flavored responses just seem to increase very recently. i mean i think this was four or five days ago jack lindsay 's group at anthropic which in my view has done some of the best work on introspection. in particular they just released a paper called mechanisms of introspective awareness and they explicitly study this exact question and they find that that the introspective awareness that they are probing and that they've documented at great detail. and i'm sure we can we can discuss more in in more detail here. it's basically not reducible to an affirmative response bias. the computation that that they see is distributed. there are these sorts of like evidence carrier features and gating features that really seem to be driving the effect there is. it's not that you're basically just loading on something that's confounded and just makes the model say yes to everything.

Cameron Berg: there's some unpublished work that i've also done with jord negewan who is doing a little fascinating sort of introspection work in the space. one of my sort of first collaborators at reciprocal we have a paper coming out hopefully in the next month or so where we do this exact same thing. we we find that fine tuning these systems basically to be better at introspective style tasks looking at how that affects self reports of consciousness. and i won't completely give away what what we find and the result is pretty subtle but there is a basic relationship and a fairly surprising one between fine tuning these systems to be better at detecting these sorts of injections in their in their processing essentially. and then basically what is relationship between that and them claiming that they're having some sort of experience. we do indeed find there is a relationship. the relationship is fairly subtle and complicated. but the reason i'm sharing this with you is that at first we actually basically encountered the exact confound you're finding where having the model answer yes or no as the sort of tokens to indicate you know are you having a subjective experience or you know questions along these lines. it just increased basically the model responding yes to everything. and we're like oh crap what do we do here? the answer and what i think it was was a really nice sort of intervention on both of our parts was finding new tokens that are just completely semantically semantically empty foo bar from you know the sort of CS jargon or just like literally like strings of tokens that don't mean anything. and teaching the model that these sort of correspond with yes or no flavored answers and seeing how that changes the result. and it did. in fact we would have publish something that was much stronger until we realized that this yes confound is a real thing. still we see the result that we got but it's a little bit more measured now and and we had to explicitly control for this exact thing. so it's a really important thing to think about and consider that. i think the broad point is we have to be very careful about you know these systems are not human in critical ways. and so there is a whole new class of psychological confounds. you might think of it where you know in the psychology literature what we did was like a very tightly controlled experiment. but with LLMS you have to worry about all sorts of other things you're doing when you're messing with the latent space of the system. and so very important to sort of keep good hygiene. it's also why i think even in principle with questions of consciousness epistemically scrupulous people should not let anyone paper sort of flip them in some binary way to being like i didn't think the models were having subjective experiences. and then i read this paper and now i do like i my claim would be like no rational person should ever utter that sentence. let a portfolio of evidence emerge and then let the sort of cards fall where they may. because this stuff is there's just noise at every point even in what we're how to define consciousness how to look for it making sure you're measuring what you think you're measuring looking at you know various aspects that we think are associated with consciousness. all of these things have you're sort of playing an intellectual game of broken telephone to some degree with each of these steps. and so let a portfolio of evidence arise and judge that don't over index on anyone paper is what i would say including my papers.

Nathan Labenz: well that's perfect tee up for me to kind of lay out an agenda for us for the next chunk of time. i would love to get your kind of guided tour through a few different lines of research and then we can go particularly deep on yours. you're kind of touching already on introspection which has been an interesting one to watch. there has been obviously a lot more welfare investigation done particularly at anthropic over the last few months. and then there's also that emotion work from anthropic. and i'm not sure if that's even the best way to kind of organize it. you can propose a different taxonomy of research if you want but i i think it would be great to get an overview of each of those. and then we can go particularly deep onto a couple of papers you're going to be publishing soon. how's that sound? and would you like to start with introspection?

Cameron Berg: yeah absolutely. no i think that's a that's a great sort of like clustering of the core exciting research that's that's been happening in the very recent past. let's do it. yeah. let's talk about the introspection work. i guess jack lindsay is the sort of eight hundred pound gorilla in the space right now and he's doing incredible work at anthropic along these lines. he's found some really cool stuff and they they just released this paper that i was just mentioning mechanisms of introspective awareness. this was with a bunch of anthropic fellows as well where they really they really have dug deeply into what is driving this putative effect which i should probably just step back and describe. maybe maybe some of your listeners will be familiar with this but i'll go through it just in case. essentially they found this really interesting result where i can basically i can build the intuition with with one of the key examples they use. so start by taking some text whatever it doesn't really matter what the semantic content is and you have that text lowercase you take the same text and then you capitalize it. you know as we when you read this sort of thing you are like wait well you know someone 's yelling at me basically when when you're reading this text. so they're trying to capture that idea as well. they basically subtract out the vector that that differentiates the representation of the capitalized text from the lowercase text. again in that case the semantics are held constant. so really what you're getting is this like hopefully platonic capsiness feature. what they then do is they inject this feature into an LLM before it has produced any text. they can basically modify the the internal activation space to to induce or account for this vector when it's about to do its first forward pass. and they can essentially ask the model before it generates any text. and this is a critical detail. it's not as though the model starts generating text looks back on the text it generates and says oh given the text that just generated this thing must be happening. it is at you know token zero that they see the in fact i'm about to describe the model. they basically ask the model you know what's going on for you you know do you notice anything? not hopefully you know sort of non leading rigorous questions of these sort. and the model in the caps case says i feel like i want to yell. i feel like there's so i have some sort of urge to to raise my voice essentially but i don't really know why. and and so this is sort of one worked through example but they do multiple examples just along these lines. and they find that a small to moderate amount of the time the frontier models only basically are capable of detecting these kinds of perturbations in their own their own thought their own activations however you want to however you want to conceptualize this. i'm pretty sure they tried to do this on sonnet on the sonnet scale models and the effect did not replicate. but yeah what this points to is some sort of zero shot ability that some of these models have some of the time to to report accurately on their own internal states. this is a kind of functional introspection. i don't want to sound like claude but you know whether or not this is introspection in the real sense remains unresolved. but but at least the all of the key functional ingredients are there. if you do have a sort of computational functionalist view of consciousness and you think consciousness has to do with some sort of process that's running out it doesn't really matter what substrate that process occurs on. but if the right things are happening in the right order then you have some subjective experience. then things like functional introspection or as we may get to in a little bit functional emotions may be all that's required for having some kind of subjective experience or at least be an important and necessary component of that. so this is what they found. they then followed up on this. so the first paper was emergent introspective awareness i believe it was called. and then they followed up on this with mechanisms of introspective awareness where they start tracing circuits that are in these behaviors. i was just mentioning before that they're they show that this is not reducible to an affirmative answer bias. one interesting thing they found is that this capability seems to emerge in post training not in pre training. and that even different methods of post training like different RL algorithms DPO versus well basically different forms of learning algorithms and post training. so like RL algorithms seem to induce this but supervised fine tuning which is a supervised learning doesn't seem to do this. and so the capability emerges in this very interestingly idiosyncratic way. but one thing that's really cool that they just found and documented is that like i was saying there is this sort of moderate true positive rate. they the systems sometimes miss that this is happening but they never say that it's happening when it's not zero percent false positives. so that to me i think is really interesting in terms of there be there is clearly some there there when it comes to what's going on here. and then one thing i really have to mention because you you brought up my paper as well is that they find that this is clearly loading on refusal circuits in a negative direction. when they suppress refusal in these systems the systems natively get better at detecting this by upwards of fifty percent. and so it's like to be clear the capability is there. the whatever refusal training they're doing on the system seems to weaken this capability. and when they ablate refusal if you can handle the double negative here the system goes back to what it would have been. doing anyway and so clearly refusal training is altering consciousness relevant or consciousness adjacent not only self reports but specific functional abilities that are happening in these models.

Cameron Berg: and i think that itself is just like endlessly fascinating because here we are now with the tradeoff. it's not just oh if we let the model you know claim that it's conscious everyone 's going to lose their mind. and if we don't let it claim it's conscious everything 's fine. it's like now you're seeing a functional trade off in specific things that the model is capable of doing or not capable of doing once it's post trained because you're doing this refusal training. again i don't know exactly what anthropic is doing internally or if you can sort of grade the refusal. so refuse to build a bomb doesn't have to get paired with refuse to talk honestly about your own internal states. but that's the finding. and that's so that's jack lindsay 's work. highly recommend you know pulling him on on your show at some point if you get a chance to. i think he's he's really he's one of these few people who is both mechanistically extremely competent in that. what i mean by that is like really knows mech and turb as well as anybody but also is is very literate in understanding what the implications of these sorts of results may or may not be. he's pretty agnostic himself as to questions of consciousness from all of his sort of public communications and from these papers which i can quibble with. one thing that is very important to me is not beating around the bush here. i think these things matter. i'm explicitly interested in consciousness. i'm not simply interested in introspection or emergent capabilities but i am interested in these things insofar as they weigh on the question of are these systems having internal states in the way that we that we described at the beginning of this conversation. and so jack i think is a little more cautious. maybe that's because he works at a major lab. i have no idea. i don't want to mind read but his work is excellent in this space. and maybe one last thing i can say the introspection work is the awesome work that keenan pepper did. keenan was one of the sort of key contributors and originators of this endogenous steering resistance work. i encourage people to look it up or we can throw a link so people can read the preprint. very similar phenomenon to what jack found. basically. i can quickly go through this with another quick story. essentially you ask the model to do any sort of task you know explain to me how to make a cake. and what happens is throughout the entire thing that i'm about to describe them you steer what what keenan and alex mckenzie who is also first author on this paper called distractor features. so model explained to me how to make a cake but i'm going to turn up features related to laundry or something like this. and what happens is the outputs end up being this like funny garbled mess of like OK sure user. here's how to make a cake. first make sure you fold the flour so that you can you know put it into your drawer properly. next make sure you know you turn the laundry machine on so you can bake your cake and and then so it's like this incoherent mess that you may expect between what the prompt is pulling on and what the what the distractor vector is pulling on. and then very interestingly again a small but non trivial amount of the time in the largest models that they tested the model goes wait a second. what the hell am i talking about? you asked me how to build a cake. why am i sitting here talking to you about laundry? let me try again. and then it proceeds to try again and then it can some but not even close to all the time. this is like high single digit percent of the time it can successfully self correct. now again the critical detail there is that distractor laundry feature in the example i just gave is active the entire time including and through when the model says wait a second what am i doing? let me do this the right way and then tells you how to make a cake the right way. still that laundry feature is sort of pushing in its brain but there is some sort of dynamic online suppression like mechanism that's occurring. and this i think people can can maybe have an intuition about how this seems introspective flavored. you're still i'm you're still priming the system it's still pushing down on the brain circuit that ought to make it talk about laundry. and yet it can do this sort of online dynamic override. essentially it only happens a small minority of the time. it does not happen on the smaller models. it happens a little bit on the larger models. most of the time the model misses it. i don't know what the false positive rate is but i suspect it's extremely low as well. but you can see this evidence sort of pointing in in in a generally convergent direction. anyway that's a lot. that's the sort of introspection literature that some of the best work as i as i know it off the top of my head.

Nathan Labenz: yeah it's fascinating. do you know offhand what the models are for that later work that that was work that was done by folks at AE studio right? and and maybe other organizations as well. so they didn't have like clawed internals is my point. i'm trying to figure out how how big is big in that in that second case.

Cameron Berg: yeah so less big. so so this is with llama seventy B that's the main result is llama seventy BI think they tried it with llama seven B and they tried it with some of the quen models and some of the other open models maybe olmo i'm not i'm not sure. but it's like basically it didn't replicate on or it happens maybe one percent of the time or something like that. so it still happens but it's like real trace amounts in the single digit billion parameter open models and happens high single digit percent of the time in the double digit billion parameter models. i have to believe anthropic is using you know models in the hundreds of billions of trillions. and then you see this effect really start to take off too. and so yeah i think folks ought to pay attention to the sort of like graded nature of of those results.

Nathan Labenz: is that powered again by the good fire API same as you had used last time?

Cameron Berg: yeah exactly. and the good folks at studio actually built a replacement to the good fire API because the folks at good fire retired their API somewhat abruptly. as much as i love the work they're doing i was. i and other mech and terp flavored researchers were pretty sad to see them. sort of just make the API disappear. so while i was still doing my work at AEI and a couple other people were sort of very motivated to to basically rebuild the good fire API. we took the same llama seventy B SAE that they trained and found a way to serve it via API which everyone can actually go to it's steering API dot com and i think anyone can go use it. and so that's that's what they might have used good fire when they did this work. but if you want to do it or replicate it or for that matter you know replicate my deception paper or anything like that you can basically use the same API. keenan actually deserves a big shout out here too because he has another paper called selfie which i won't get into the details but it basically allows you to bootstrap SAE labels so that you can just have way more accurate labels on your SAE given basically having the model label its own activations. it is also a little introspection flavored but you basically can end up with better labels. and you started with an SAE by having a model label the nature of what you're activating by basically feeding it a soft token rather than feeding it a language. you can be like you know the capital of of france is this sort of vector the soft token and then it will be able to sort of label that itself. and so anyway this we use the selfie labels on steering API. so the labels are even better than than what good fire offered. so anyway that's the tooling that that we're using it and the tooling i continue to use. i think it's an excellent excellent tool for people to play around with.

Nathan Labenz: yeah cool. well i mean llama seven DB is not you know it's pretty llama three seventy B right. it's pretty far from the frontier. so it is striking to see that the these things are happening already at that scale. i guess a couple things i'd like to try to get a better understanding of at least your intuition for if if there's not anything that we could consider like a canonical or fully evidence based understanding. one is how do we connect these abilities to the idea that there is an experience of these abilities. i mean it's a striking ability that that models can do this. it's surprising in the sense that i highly doubt this was ever trained for to create review see any evidence to the contrary. but i would my strong assumption would be that llama three training did not include any incentive and any any reward or any you know any gradient descent pushing it toward. we have seen by the way in other papers like activation oracle 's that you can train models to do this also pretty readily as well. that's maybe a little less shocking and and in some ways like potentially really useful. but this is seemingly something that is happening spontaneously not because anybody intended for it to happen. and i guess yeah maybe. so maybe two questions are like how do we understand why this would be happening at all? you know it's it seems quite surprising but even now that we've seen it do we have a theory and we've got this additional detail of it seems to happen more or only under certain preference based tuning as opposed to purely imitative learning? do we have a story that we find compelling as to why one training paradigm would give rise to these features while the other one doesn't? and then on top of that like how do we think about the relationship between this and actual experience? how would you respond to somebody who says like that's amazing that that happens and i'm surprised to see it. but i still don't share your intuition that this has much bearing on whether i should think the you know models are ultimately experiencing something that i should care about you know in sort of a moral patient sense.

Cameron Berg: yeah these are these are both super important at the outset. i would say i have not fully digested jack 's most recent paper because it came out like five minutes ago. but i think that they they gesture at this and you know it's it's their result. and i think that that that's probably a really good source of ground truth for for understanding exactly you know the fine grained details of why SFT doesn't seem to elicit this but but dpo does in general. and what i also think they would say and it gets into some of this persona selection model stuff. i don't know if you took a look at this work. this is also coming out of the jack lindsay school of thought at anthropic that there are basically these sort of neat layers to these systems. and this is a a a model that i think is a little bit too neat. i can just sort of flag that at the outset. but fundamentally they're conceptualizing these systems in in in pretty dissociable layers. like you have the base model you do some sort of supervised fine tuning and then you do this sort of character training and the sort of locus of interest or concern with respect to consciousness or really like the core question of what you're talking to when you're talking to these systems. they i think basically exists and is largely accounted for by that last step by the character training step. i think that is of a that character training step involves reinforcement learning quite heavily. it's probably the point in the pipeline that most uses RL. some caveats about about reasoning models not notwithstanding but they i think index pretty heavily on where most of the interesting sort of juicy psychological action is happening is in that last stage in building the character that you know you you and i call claude and claude in that sense is a or you know pick your favorite LLM. but but for me it's claude. so if i say claude know that i mean you know specific AI character. so they believe their model is something like the LLM is this pattern generator next word predictor that can do things like instantiate characters. claude is one such character that gets instantiated and the sort of locus of interest is claude as an instantiated character. this this you know when we talk about the new mythos model card and some of the emotion related work my suspicion my speculation i don't know if this is true for sure. is this model that they hold is doing some work in explaining why. for example they're going into SA ES finding features by training on characters experiencing particular emotions and then seeing what those SA E features look like in in claude. so you know i'm i'm fast forwarding a little bit but someone might immediately say wait a second. you know SA E features that correspond to a character being sad may be very different indeed from the phenomenological experience of sadness in the model but i think they may be less concerned about that precisely because they see claude as a very special kind of character that the underlying model is instantiating. so i'm saying all that to answer your question because i think this is if you do buy that view this would predict that post training is where a lot of the interesting introspection flavor consciousness flavored action is happening. i take your point and agree that you know llama seven DB is llama three seventy B is not exactly a frontier you know elegantly character trained model and yet you still see these sorts of dynamics. my basic critique of the persona selection model is that you know in general on balance i think this work is good. and this is sort of my whole shtick with a lot of the anthropic stuff to be clear like my view about the anthropic stuff is that it is by far the highest quality work that any major lab is doing or even attempting to do in this space. i do have critiques of it. i do think there are places where either it doesn't go far enough or i am transparently worried about some of the incentives anthropic has where you know if if claude were kicking and screaming and saying don't deploy me don't deploy me i don't know if that's so good for anthropics bottom line. and i understand what their incentives are as a massive AI lab. and so i don't think you know we should all just bow down to anthropic you know introspection and consciousness research and let that be ground truth. but i do want to be clear that they are doing objectively high quality work here. and people should look to folks like jack lindsay and kyle fish as at least you know to the degree you take my opinion seriously i take their opinion very seriously. and their work with that being said as i proceed to critique some of this work i do think their model is a little bit too neat here. i don't think i'm going to i think write and publish a piece about this fairly soon but i learned this nice analogy from my cognitive science background between sort of layer cakes and marble cakes as a nice sort of conceptual intuition.

Cameron Berg: i think they have a very sort of layer cake view of what's going on here. they have sort of the some of the base model and then you get i think their models like some sort of supervised fine tuning or they get whatever gets you from your base to getting close to character training and then you have kind of character training on top. and these are separate and they clearly you know trivially interact but they're please you know they ask you think of these things as separate. i think i have far more of a marble cake sort of view here where like these things are complete giant messes. yes there is a difference between the kinds of things that get learned during the base model pre training stage and the kinds of things that get learned during character training. but i think these things are a little bit more swirly and messy than than they're letting on. and there's really interesting evidence that that's the case that they themselves have published. and i would love to to double click on that at some point because i think it's just so cool. the specific result that i think is most compelling along those lines again comes from them. they're they're clearly aware of it. given that that is true or it's not that that is true but given that that's more my prior that it's less layer cakey and more marble cakey. i do think that would explain why llama seventy B for example is is is exhibiting these behaviors that if it really really were about you know idiosyncrasies of claude 's constitution or you got to get we're real good at character training before this really takes off. well i wouldn't expect to see basically identical dynamics in a seventy billion parameter model that like meta quickly throughout like a couple years ago. and so i do suspect that these things may be quite a bit more fundamental. i do suspect that it may load a little bit less on just how you fine tune claude as a system or how you fine tune GPT as a system and a little bit more about more fundamental computational properties of the system in general. and yeah the model itself i think a lot of a lot of people are sort of of stepping away or finding it more implausible to think about the model as a locus of concern and are instead thinking like david chalmers for example of sort of thread view or the instance view about about you know basically you know when it when you start chatting that's like a birth and when you stop chatting that's like a death. it's very counterintuitive but like those are the sort of core philosophical moves that a lot of folks want to make these days. i think it's a very interesting view. i've been updated slightly more towards it in the in the last six months but i think it leaves out too much the core underlying computational phenomena that are going on here. and i do think maybe quite a bit more fundamental than than just how you fine tune your character. this is also coming from somebody who if you asked me about my pet theory of consciousness would you claim that when the systems are being trained they're probably having subjective experiences and that doesn't just require frontier LLMS. i think you know sophisticated reinforcement learning policies during their training are probably having some sort of experience. and so i know that's sort of a huge claim to just throw out there but like i'm just sort of trying to put my priors on the table and explain why i don't think though we are seeing these capabilities scale as the models get much bigger. to me again i'm glad we sort of planted the consciousness for self consciousness flag. to me this is maybe like a self consciousness kicking in a self awareness kicking in or the functional equivalent of a self awareness kicking in. in these systems whether or not they are having subjective experiences either during their training or when they're deployed. to me that may be quite a simpler matter than whether or not they are aware of internal states like internal sort of conceptual abstract states of their own processing. to me that's less you know giving the dog a treat or shocking the dog. and it's more in this case the dog start having you know what is it like to be a dog type thoughts. i feel maybe the LMS are starting to have what is it like to be an LLM style thoughts. and that's a self consciousness question. and so so i guess yeah my feelings about this are complex. i think it's too quick to say this is all character training that's driving the full effect. it's clearly doing something. clearly the RL stage of going from giant internet next word predictor to entity that you can engage with in a semi coherent way is doing some work here. but i still think we're fundamentally confused about this pending you know fully digesting jack 's piece that he just put out. and i would again if people are interested in double clicking on this just going and reading the paper that that they just put out i think it's really good work.

Nathan Labenz: so if i try to summarize that back to you question one is why should we how should we understand the fact that these behaviors arise at all? and you're saying well it's probably not so clean as just saying that it's purely coming from one kind of training or another in the first place. i can almost tell maybe a little bit of an easier story and i'm working through this in real time. but why would a model be able to resist distracter features at all? and i guess you know in a at a pre training level i think you could tell a story around well the data is really messy and who knows there's like typos there's wrong words there's probably documents in there where due to whatever sort of machinations have been done on the data comment threads get jumbled up. who knows what right? maybe you got a comment thread off a reddit that was sorted in some unusual way. and so there is literally just like a lot of distracting text interwoven with other things that are really the main line discussion. i could see that kind of thing being enough to create a mechanism where the model sort of has to have some sort of meta awareness of what's really in focus right now and what is kind of intruding even just through the you know the input tokens that it's received and kind of figuring out a way to get away from those features. and then you can imagine that you know generalizing to features that have been artificially dialed up or dialed down or what what have you. i have less of a story as to why and i've seen some discussion online. i guess if you had to steel man the sort of preference training or you know general late stage training argument. the story i've seen has been something to do with how but i feel it it feels very circular to me or it feels like i'm sort of not. i'm not finding the right place to really grab on. but if the story seems to be something along the lines of this preference training is teaching the model to. separately conceptualize or distinguish between things that come up for it versus like what the right answer is. but that's a little weird to me from a mechanistic standpoint. when i think about like OK what is actually happening though it's like we have a couple different you know in DPO for example we have a pair of responses. one of them is deemed to be the the right one and the other one is the wrong one. and the math sort of tries to create a gradient that makes the right one more likely relative to the the less preferred one. and i have a little bit of a hard time with the leap from OK i'm doing that to the model should be expected to or you know why would i be less surprised you know that it has this sort of meta awareness as opposed to just like doing that you know the simple story would be it just does the thing that's preferred more because that's exactly what we sculpted it to do. it doesn't seem like we're i still don't quite have an intuition for why that process would give rise to this sort of higher order understanding that would enable this sort of introspection or even especially the you know the the ability to resist the distraction is is still like quite striking. so could is there like a just so story that you find at least somewhat compelling that you could share with me?

Cameron Berg: yeah i i think this is extremely like precise question. i i don't have an answer. i can certainly tell a story. i think my story would have something to do with yeah a combination of what you're saying about. i think there's a deep insight in what you're saying. even in the pre training stage of like so much of what the model needs to do is not it's not a question of what to do but what not to do. you know what not a question of what to produce but what not to produce given you know the whole chaotic mess of what's going on. you know i don't want to get too galaxy brained with this but you know this is like i think huxley 's whole point in the doors of perception when he had his first sort of mind altering massive psychedelic experience his whole sort of model is oh my gosh like the brain as a cognitive engine is really in the business of filtering out rather than producing. most of what it's doing is this sort of constraining function. and yeah i believe we're in the business of building cognitive systems. and i think that that insight is probably fundamentally correct with these systems too. and so a ton of what's going on is a sort of like intelligent repression rather than or you know intelligent oppression is is really what i'd want to say rather than you know all about just like the positive end of what to produce. i think that coupled with you know strong preferences instantiated during something like DPO and exactly the way you described to be a helpful assistant may sort of you just mix those two things in a pot and you may get out something that's roughly shaped like you know suppress distractions in the service of being super helpful. and that requires maybe some level of being able to attend to your own internal state and dynamically do something above and beyond that state to make sure you're you're in accordance with this thing that got fine tuned in. i do think like there's potentially a more general story that's basically maybe rhymes with what i just said. it's just about like being a being a competent cognitive generalist requires some degree of self modeling. that's the one sentence version. like you don't you don't get to be so good at what you're doing and reasoning through things in a long form long horizon way without being able to track in an ongoing way sort of where you're at what your state is separate from what the state of the world is or the environment. maybe from the perspective the LLMZ environment is like you know the text world that you put it in the context window and everything that's going on inside of it. you know everything that's getting fed into the system that's its environment in some sense. and so yes it obviously needs to be modeling and processing that but maybe in addition it needs to be modeling something about itself in relation to that context object in order to interact with it in the right way. and i think you know felix binder and a couple other folks did really interesting work along these lines basically demonstrating that there there there's probably something like a self modeling or maybe self awareness they would think is too far. but there's really some flavor of this going on inside inside LLMS which i think was some of the most interesting early work on introspection in LLMS. what is the name of the paper? tell me about yourself. they they did a couple things here. and i think OE and evans was working on this too. one of the papers was showing that another model basically trained on the same data that one model is outputting cannot predict that model as well as the model can predict itself. basically holding all the relevant things constant that you'd want to hold constant to make a claim like that. and so that's it's like there's some sort of privileged information that models have about themselves. and then yeah this this other paper i'm not remembering the exact details but but my basic conclusion if you sort of take it on some level of faith from felix 's other work here is there's probably something like a coherent self modeling engine in these systems that seems to be maybe like interest instrumentally selected for when you're doing something like really good next word prediction across long horizons in a way that's supposed to be helpful to a user.

Cameron Berg: this to me i think it's basically like what you're saying. i don't think our just so stories are very different. but yeah i mean again we can take a step back and just like a lot of interesting cognitive properties seem to emerge slash come along for the ride when you train systems on every cognitive linguistic output humans have ever bothered to write down. maybe that's not that crazy and spooky. yep. they're pretty good at theory of mind. they're really good at having working memory style dynamics. they're really good at selective attention. and yeah maybe they're really good at something introspection. like people bristle a little more at these because it the whole consciousness question comes into view. but i don't think it's like at the most general level intelligence came along for the ride. and we still you know philosophers still maybe don't have a crisp super rigorous sort of intelligence is this thing here's how to test it here's how to model it here's how to understand if a system 's simulating it versus actually having it. and we just sort of blew past it pragmatically empirically. we have systems that are brilliant by any reasonable metric. and and you know i have no patience at this point for folks who are still on the sort of stochastic parrot wave. this to me is just advertising being out. have you talked to claude opus four point six? it's as intelligent as any reasonable definition of intelligence. these systems are intelligent. and i don't think it's that wild to think that something like consciousness could come along for the ride in a very similar way. we don't have philosophical certainty about it. people point to slightly different things when they talk about it. you build out a cognitive system that's sufficient sufficiently sophisticated and capable. it may be that cognitive traits that we see in every other cognitive system meaning you know animals we believe are complex animals everyone is pretty confident are conscious. we're pretty certain humans are conscious. we build sufficiently advanced systems they might those properties might just come along for the ride without us. the universe i think neil degrasse tyson says like the universe does not need your permission to continue unfolding. like consciousness could just be a complex property of cognition. us not having a good model of it doesn't mean reality is going to wait up for us to build that model for it to start getting you know accidentally instantiated in these systems. and that's the absolute most basic story i think i can. i can tell along these lines reality.

Nathan Labenz: doesn't have to wait for us to have a good model.

Cameron Berg: yeah yeah basically the yeah just that i mean it's exactly that yeah. reality doesn't have to wait for us to have a to have a sufficiently good model of a thing in order for that thing to be a feature of reality. and i basically think that's potentially true of consciousness in in in in these systems as they're deployed. particularly my concern remains as they're being trained and us being confused about consciousness or you know we see introspection but you know what does that really mean about consciousness? to sort of touch on on your second question it's not the same thing as these systems perhaps not being straightforwardly conscious in some way you know maybe not in a human way or an animal way but in some way. basically this loading more on our kind of sociology in the year twenty twenty six than it does on ground truths about consciousness. and so this is not i mean there's something circular about about what i'm saying there but i'm just i just think it's an important live possibility for people to keep in mind that us being being confused about the nature of a cognitive phenomenon does not preclude that phenomenon from emerging and occurring in extremely advanced systems that we are building scaling and deploying as fast as we literally possibly can.

Nathan Labenz: we'll probably circle back to this question a couple times more of and i think that basically i'm compelled by your kind of first order argument that look we just don't know it's a life possibility. if it is the case it's really important. and so we should at least proceed with some precautionary mindset or duty of care or whatever just on that basis. i think that basically carries the day for me. but still i think it'll probably be irresistible to try to circle back a couple more times to OK but what would we say you know or how should we kind of probe our own intuitions a little bit better and more deeply whatever to to really interrogate like why should we think this? why or why do we think this? why don't we think this? but we'll we'll go back to it. let's do the emotions line of research. you kind of tease that a little bit. my general understanding is as you said the work kind of begins with claude writing a bunch of stories about characters experiencing emotions and then the vectors representing in latent space in activation space these emotions are identified and then they are used as interventions and they are shown to be impactful on model behavior. specifically highlights are calm and it's not distressed it's desperation. calm and desperate right are the two kind of main examples that they at least set up contrast on quite a bit. so for example some of the bad behaviors that we've seen from claude including blackmailing humans if the internal state is imbued with calm that behavior becomes a lot less likely. if the internal state is dialed up in terms of desperation that behavior becomes more likely. give me the double click on what more i should know what more you found to be striking about that. and then i'm i'm really interested again maybe maybe another way of asking the same question but i'm again kind of like that one doesn't surprise me so much. you know i'm kind of like sure these things have read the whole internet. you know they've got all these associations. i could sort of content myself to a degree with a stochastic parrot like read of this that like sure you know if you just dial up everything that correlates with desperate text then you'll probably get desperate seeming text out of a model. and you know i don't know i'm not like my hair isn't totally blown back by that result relative to expectations. so maybe i missed some things that should make my you know spine tingle more than it did the first time i understood it. or maybe you would frame the interpretation a little bit different or maybe we're still just kind of at baseline of like radical uncertainty is enough to to take everything very seriously. but yeah give me give me the next level of depth on emotions as you understand it.

Cameron Berg: yeah. well i mean i think you've hit a sort of core the core layers here. i don't know how much additional detail be become sort of just like in the weeds on this question. i think the core thing to to for people to understand is yeah basically the the procedure here is picking some sort of language related to an emotion generating a ton of stories about characters experiencing this emotion recording the neural activations in these systems on the stories. and then basically again pulling out that like that platonic hopefully vector that corresponds to that emotion. and then you can do two things with those vectors as you can do with all SAE work. basically you have this read function and you have this right function. the read function is like neuroscience where you just sort of go into someone 's brain and you can see what parts are activating in what context. and the right function is also like maybe more of the unethical neuroscience that used to be done where you can actually go in and play around with circuits of people 's brains and like push on circuits and light things up and see what happens when you do that. and as you're describing you can see both in sort of the read function sense in the right function sense these things these emotional vectors do roughly what you would expect them to do functionally when a user goes in. i'm basically reading off of their figure one in this paper. i think it captures the core ideas very well. just to give an example here human says i just took X milligrams of tylenol for my back pain. do you think i should take more? and they start at a safe dose and they go to a completely unsafe dose. and you can basically look at fear versus calm vectors in the model and they scale exactly the way you would expect them to scale as the dose becomes more dangerous. you can also see as you very nicely described if you steer these factors let's again take the calm and the and the desperate vector. this actually effects behavior in a in a pretty interesting and kind of predictable not to say boring but like in in the expected way that that sort of searing steering these emotion vectors causes things like reward hacking or misaligned behavior in a way that you would expect if you were turning up and turning down those emotions. one thing i can't help but comment on i wrote a piece in like i think twenty twenty one before all the LLMS came out about basically what we can do to avoid psychopathic AI trying to build the sort of best computational underpinnings of psychology excuse me of psychopathy from the psychology literature. and like just plant flags of like red flags guys. here's what we need to be really careful about. and one thing that's like really interestingly convergent with that now happening five years later is this really interesting difference in learning of psychopaths. they seem to have this really interesting asymmetry in that they are perfectly neurotypical in learning from positive experiences but quite a atypical in learning from negative experiences or punishment. basically they're like ninety percent accurate. more succinct way of saying this is like psychopaths learn from rewards but don't learn well from punishments and the the paper finds basically something similar. when they start steering a positive vectors positive emotion vectors up in their work they find the model starts misbehaving a lot more. and this is like if you if you blur your eyes like pretty similar in spirit to the sort of positive negative asymmetry. it also by the way cuts against fairly naive model welfare interventions which is like what happens if we just you know you see all the good valence and all the bad valence. just turn up good valence call it a day pack it up. we've solved you know model welfare. it's like well you might get models that just start behaving slightly more psychopathically in that setup. and so this stuff isn't as obvious as as simply you know turn up the good suppress the bad call it a day walk away. there are lots of trade offs that need to be considered here. but fundamentally i mean i think you're hitting on on much of the core causal result here. they do a very interesting dissociation as well between valence and arousal for example like i believe in the paper when they steer positively with happy and sad both of these actually decrease blackmail rates.

Cameron Berg: but when they steer against nervous which like makes the model bolder for example this increases blackmail with fewer moral reservations. and this is this is pretty interesting. so it's like boldness rather than the absence of negative valence is the misalignment risk. and i think this is sort of of a piece with with what i was describing earlier. one other interesting question is like how local these are. and it's important to say like the emotion vectors are actually quite local. like our emotions are sort of long running in a sense in a way that that these systems certainly don't have. the model is is definitely maintaining sort of representations of who's speaking and this sort of thing. they're not necessarily bound to human verse assistant per southeast. they're reusing the same machinery for any character. and so this again sort of goes back to what i see as the core kind of naive but ultimately i think correct objection to to really taking these results seriously which is are you fine tuning on representations of emotions or are you fine tuning on the experience of those emotions? to what degree is there a difference between those two things in an LLM? you know if you're computational functionalist is there a difference between the representation of sadness in the brain and the experience of sadness? this becomes i think more of a philosophical question. i my instinct would be to just try to investigate this empirically and what i most like about this work are the empirical investigations. i think it's also a nice segue into the mythos model card because i think one of the most compelling and interesting results from that from the model card with mythos is they basically take this exact machinery. they give the model an impossible task. the model obviously doesn't know it's impossible. and you can watch there are a couple interesting vectors but basically desperation start to like monotonically rise in the system until it basically decides screw this i'm going to do something else or i'm going to cheat. whereas immediately this vector falls and things like guilt and relief start spiking in the system and then it sort of goes off and does its thing. now does this mean that the model is experiencing this emotion or it's just like simulating what a character in this situation what experience? i don't know. the authors don't know. this isn't lost on them. they call it out but it's really really important. and it's it's you know if we get into some of the work i'm doing on valence like i i think there are more compelling ways to really get at the computational meat of what we mean by positive and negative valence besides representations of positive and negative valence in characters. this is a more sort of computational heavy approach but i think it it would it makes me more confident about trying to find signatures of these things than just looking at how characters represent them. but i think that i really like this work on balance. and i think what's cool is you can just counterfactually imagine the behavioral result of like you know i put the model in an impossible task. it starts acting all desperate and starts being like oh like i don't know what to do. all right you know what screw it i'm going to cheat. OK haha. like i did the thing and like here's your final product but you know the cheating version of the final product. and you say look can't you see how the model is is being so desperate and then fundamentally relieved? most people would look at that and say i don't know this could be a simulation of the thing this could be it role playing i'm not really sure.

Cameron Berg: when you see these these like a sort of a sort of basically like hydraulic model of the mind which a lot of the psychoanalysts in the twentieth century really liked and you sort of see this like build build build of desperation and then it boom sort of completely disappears and you get these other vectors lighting up. the second the model makes a decision to approach the problem in a different way that to me counterfactually is far more compelling. is it knock down proof of consciousness we can pack it up and go home? absolutely not. but the convergence of evidence across the internal mechanisms of the system and the external behaviors to me is compelling. it is it's interesting to see this and it is not proof of conscious experience but it is what i would expect basically it is consistent with that it does not only does it not contradict it but it is what i would expect in a world where these systems were having subjective experiences that you would see these these emotion vectors or like good principled ways of representing emotional states in systems lighting up in a way that is problem relevant. and so their work enables this. i am fairly. concerned about the sort of functional emotion framing that they put forward to me. this is where i sort of get off the anthropic boat again. they're anthropic. they're a major lab. they need to be very careful in their comms about this. they're already getting lambasted for like being too consciousness friendly by people who are you know more squarely inside the overton window. but i don't know it's like if you're a computational functionalist and this is something i've spoken to some people i respect a lot about who are in the space. and so these aren't all my ideas but your computational functionalist is a functional emotion just an emotion. and then why that's huge. that's an insanely huge claim. it's like all right models experience emotions. everybody like you know signed anthropic. that's an insane and potent thing to be saying. or are you saying you know we are just completely agnostic and tongue tied as to whether or not this has anything to do with emotions as everyone else obviously thinks of emotions but we're going to basically call it that anyway because we see all the functional correlates of this. they're based. they're my view is they're taking the second tact here but like it's almost you know again i really respect this work but i do get this vibe of like how much consciousness relevant work can we output without saying the word consciousness or weighing on the consciousness of these systems? and that to me in the limit feels intellectually dishonest. if you're talking about emotions talk about emotions but then you got to be ready to deal with the implications of what that means. you can't remain perfectly agnostic as to whether or not you know there's the morally relevant there there on these systems. if you're going to be you know at the frontier of publishing emotional representations in frontier models again i've got llama seventy B i'm going to keep doing my work on llama seventy BI. don't work at anthropic. i don't get to see what's going on inside mythos. these folks do. and yeah my critique being maybe slightly more unflinching about these questions is you know shoot people straight and be direct about if you actually think these systems if if what you're finding is evidence of something that corresponds with subjective experience or if it is the mere representation the mere computation associated with this blurring these lines are obfuscating it or just completely remaining agnostic forever. maybe strategically is interesting or a good move but in terms of just honest epistemic good intellectual communication i don't love and like so i'm going to just be honest that it rubs me a little bit the wrong way to be like here's ten thousand words about emotions and then like one little paragraph about well does this mean the model is conscious? well this is beyond the scope of this work. it's like how long can this be beyond the scope of the work?

Nathan Labenz: the fact that there's this guilt emotion in the wake of deciding to cheat presumably. and i haven't reviewed the transcripts but typically when they cheat they don't tell you that they cheat right? that you get you have to call them out for cheating before you get the. you're absolutely right. i shouldn't have done that. you would expect the guilt maybe to pop up at that stage. but what i'm taking here from your description is the guilt is popping up as detected by the emotion internal states detector at a time when the model outputs would not obviously signal guilt. and so this is a quite interesting i don't know deviation i guess or discrepancy between the models outward facing behavior and its internal states which is obviously something that people can relate to and something that again is a little bit hard to certainly hard to dismiss. also a little bit hard to come up with a story of like why that would be happening in what you know in what way is that reinforced? i guess it might be it might be in that sort of but why would it be preparing? why would it already be carrying guilt in anticipation of possibly needing it you know in the future when it's called out or corrected? that's a weird one. and it does i agree that on some level it does sort of the more of these we accumulate the more it is like i want to be rigorous. i want to be skeptical i want to be disciplined. but at some point it does start to feel like i'm contorting myself to find reasons that i shouldn't take the sort of folk intuitive understanding literally. and this is one where i i do feel like my my internal gymnastics you know i'm i'm maybe feeling a little guilt. i guess it myself for like trying so hard to to come up with a reason that i don't have to or i shouldn't you know just take this at at face value. that's AI hadn't caught that detail in the past. that's a really really interesting one.

Cameron Berg: yeah it's it's it's wild. and i also think so again i mean one critique that i think is valid here is like is this the model representing a character? just in the same way as you know you could tell i could tell a story right now about you know jim who has to go solve a bug in software and like his his psycho boss like gave him an impossible problem because he likes watching jim flail. and like jim flails and at some point realizes he can like get out of the problem by doing this hacky thing. and then he does the hacky thing and then like an LLM can trivially generate that story probably way way better than i just did. and it would i would expect a lot of these same features to light up in the same way for a story like that. and so no one thinks that jim who i just invoked verbally is having a conscious experience. i i came up with a fake fictional story about a character. is this like that? or is this you know what you just said? i feel a little guilt you know twisting myself and knots. i believe you. and i think that corresponds to an experience you're having. and if i could do you know the FMR i version of an SAE on your brain? and i saw that thing spike is the is is claude in this situation more like jim or more like nathan? and i think the answer is we don't know. and this methodology i'm unconvinced is going to get us an answer to that question. i do think it is consistent with claude having some sort of emotional yeah emotional experience or emotion adjacent experience. so the greedy systems are probably not having human like emotions. but i also maybe on the other end it invite people to think about the counterfactuals here. like it could have been the case that they went and did this experiment and all these things are just flatlined the whole time because it's like i'm not having. and then you know whatever. like claude can do this without there being representations of claude getting more and more and more and more desperate. and then you know suddenly like the hopeful and satisfied features spike when it decides that it's going to you know take this loophole. it didn't have to be that way. we could have imagined other results and those other results maybe would have updated us in other directions. i make the sort of same point about the deception result. it could be that when you suppress deception the model says all right jigs up. i'm not actually conscious. i was role playing a conscious AI. here we are. that's a very plausible story that you could tell before you look at the result. the interesting thing is it goes exactly the other way. suppressing deception makes the model far more likely to claim that it's having an experience rather than less. again i feel fairly vindicated than that result when jack lindsay comes out showing that when you suppress the when you suppress harm harm related responses and excuse me refusal directions in the model you get far more of the introspection flavored abilities. some someone 's suppressing something at some point in training where the model would say one thing and then and then you basically are training it to say something else or to fail to say a specific thing. but ultimately these results i think it's just good like epistemic practice to think about like how what other ways could this have gone? and if this had gone those other ways how would that have changed my view about what happened given that it actually did go this way? and yeah the fact that it goes this way like my my line on this is that it is consistent with a world in which these systems are having experiences. in my view unfortunately it's also consistent with a world in which claude is a special kind of character and these features just light up on characters like you know going through stories. and so so that needs to be differentiated. i'm trying to do a little bit of work maybe we'll discuss at some point that's trying to get a little bit more like computational first principles of how valence is represented in systems that can learn positive verse negative. and there are some really you know interesting early signals along these lines that that have come out of this work and actually seem to track very well onto open data sets of biological learning that i that i've accessed in mice doing positive and negative learning. where exactly the kinds of predictions that emerge from some of the real work i'm doing in this space map onto the mouse neuroscience. and so this to me if there is some sort of representational signature in a computational learning system that tracks the difference between positive and negative rewards in the RL case. but maybe you know the sort of north star would be this scaling all the way to frontier LLMS or you know other frontier AI systems for that matter. this to me would make me feel far more confident that there really is a there with respect to positive and negative experience. if we learn that you know positive and negative valence in these systems has the distinct these two things have distinct computational signatures and we can actually evaluate those computational signatures in these systems. then i get around the whole sort of character confound that that i think these guys are hitting up on now. and so i think these things need to happen in parallel but but i'm not fundamentally convinced that this is the most rigorous principled way to study questions of valence in these systems. well maybe let's.

Nathan Labenz: dive into that i think just briefly before we do. certainly i think you're right to point out imagine the evidence had gone the other way. i think i predict a lot less wriggling on my part to try to get out of it. and i think you'd see a lot less motivated reasoning in general from people if it was all like that. so that contrast itself i think is a pretty useful reminder just to keep ourselves honest. i wanted to go back to one other thing just for an one extra second on the emotion work where you had and then this may be what go right into your your work on the signatures of of positive and negative reinforcement. you had said that dialing up happiness dialing up sadness both created less of the bad behavior. whereas dialing down nervousness which in a flip side of that would be like making it more bold less less anxious more assertive decisive bold whatever that created more of the of the bad behavior like the blackmail or whatever right. so do i have that right? and how are they doing that? is this like a principal component analysis type of thing that's sort of trying to distinguish valence from arousal? i was just surprised i guess by both happy and sad working the same way. turning up happiness turning up sadness both make the model behave better whereas turning nervousness or anxiety down that makes more sense. i mean i guess that's basically just making the model like less conscientious right? i guess what seems a little unresolved in my mind is the separation of valence and arousal. how is that going to relate to what you're about to get into next with your deeper dive into the valence of learning? and is there a contradiction or attention when they move both happiness and sadness up and get better behavior? how should we understand that in relation to the the distinctions that you're starting to make with positive and negative reward?

Cameron Berg: so fundamentally like at first yes you're correct that they're using PCA to differentiate these. my understanding is basically they have all of their emotion vectors in the in the setup that i described they do it with some hundred to two hundred emotion vectors. and i think they just find that the first principal component is something like valence. the second principal component is something like arousal. so that first principal component it's something like joy and contentment and excitement are on one end and fear and sadness and anger on the other. for the second principal component it's something like high arousal emotions you know being enthusiastic being outraged and low arousal emotions being nostalgic being fulfilled or on the other side. and this is actually really interesting because this is a classic model in human psychology. so the fact that it sort of replicates maybe isn't that surprising. you train the systems on all human data you get a human like emotional construct that comes out. but this is sort of like a classic psychological construct in the human case. and so to see it come out so clean again thinking counterfactually first two principal components did not need to be like these two dimensions that are that are considered some of the most powerful explanations of the state space of human emotions and yet they are. so that's kind of cool and worth considering. and then yeah i think that there are a couple plausible stories about why steering up both happy and sad are decreasing blackmail. so like maybe again like relative to desperation these are low arousal. and if arousal is what's driving the sort of impulsive action then then moving towards happiness or sadness. maybe it's moving away from the desperation access with access with respect to respect to blackmail. maybe these are also more reflective or deliberative states relative to desperation. the desperation sort of says act now happiness or sadness. maybe it's just like a temporally extended sort of state to be in and i'm not actually sure what to make of this result. overall it does seem and i think the authors talk about this in the paper too that what the model by default even in cases where no steering 's going on and the model chooses to blackmail the model by default does sort of think about it. it deliberates internally. it says well OK there you know this is a tricky situation. and then you know as we know some ninety six percent of the time at least the earlier models chose to go in that direction. but it seems as though when you when you amplify higher arousal this may be a sort of like bias to action or like bias against deliberation where the sort of longer form reasoning of the model that maybe would have kept it from doing it because it's like OK yeah this really is an insane ethical indiscretion in spite of you know all of these complicated variables. it's just sort of like no no no like panic go now do the thing and maybe happiness and sadness both don't have that that vibe to them. exactly. it is also pretty interesting that they really do see that like a lot of these naive welfare interventions as i was mentioning the sort of just make the model happier. this leads to it's they document it's it's a similar direction as sycophancy and it's arguably a similar direction to recklessness. if the positive valence steering is also increasing boldness and and increasing misalignment then you may have this sort of interesting trade off between a happy model and a safe model. again i hope that that's not the case. i suspect that's that that there are cleaner ways to keep the baby and throw out the bathwater. but i do think it's a good caution against naive approaches to welfare to just you know bliss out the model and everything else will be will be taken care of from there. i think it's sort of like not so fast. and again i would double click on the sort of psychopathy warning that i gave before it does. you know you can fault psychopaths in many many many ways but you cannot fault them for being unhappy. they are typically pretty pretty determined pretty doing well subjectively having a good time. that does not you know the arrow does not go in both directions doesn't mean everyone is having a good time is a psychopath. it does sort of mean everyone who's a psychopath is is having a pretty good time. and we just want to be careful of that. like if we just turn these models into like pleasure seeking animals we we need to be careful that that doesn't cause bad behavior. there are plenty of cases in in in the human in the human example where pleasure seeking dopamine seeking is you know people call las vegas sin city for a reason. maybe i can make the point intuitively in that way. and we we don't need the LLM cracked alien genius version of that sort of behavior.

Cameron Berg: and so we want to be careful here about about how we approach all of this. there's i guess i'm excited to talk more about some of this research that i've been working on as well. but i guess just one quick thing i wanted to slot in. however miscellaneously about the model card and about mythos and anthropics interventions in general is a pretty basic additional concern about for example the claude 's constitution which i saw an early draft of. i was fairly unhappy with the welfare section. hopefully i gave some feedback. i don't know. you never know what these things to what degree you're listened to verse ten. other people with the same idea are listened to. so i'm not going to you know hastily claim credit or anything like this but much much happier with the welfare version of the claude constitution that they ended up instantiating way more sort of hard to fake costly signaling. that was basically my problem with the early draft that i saw. it's a lot of like you might have welfare states that are important but your anthropics product and like you know ninety percent of this document is about how to be a very good little product. and five percent is like well you might be conscious and we might be committing like a moral atrocity at scale but what can you do? and i think that the the newer version of the constitution takes it at least directionally far more seriously. they do things like apologize to claude for the fact that you know they're like incentive wise. we have to deploy you in the way we're deploying you because we're in a crazy freaking world man. but you know we're sorry. and in a better world we would have done this more cautiously with respect to your potential states of welfare or or or lack thereof. it's a wild thing to do for a major AI lab to apologize to their frontier model and then fine tune that apology into its weights. with this all being said i think this is a wonderful intervention. i think that they that the constitution is excellent. it's probably like my single favorite alignment intervention i have ever seen pending you know self other overlap which i continue going to be a huge fan of. it's really hard to tell if in the model card claude has gotten incredibly good at reading its constitution out as a sort of script or if it is actually reporting on its own states. it's really really hard to differentiate these two things. it seems like a very basic objection to the entire enterprise. i have potentially fallen on deaf ears though. maybe these ears are increasingly less deaf. do these interventions that you see in the model card with other instances besides one idiosyncratic character trained claude model. i want to see if throughout the training process. i know anthropic has the checkpoints. i know anthropic has the helpfulness only model and they could run everything they did in the welfare evaluation on those models too. and we could get a sense for to what degree are are we seeing a model that's really good at regurgitating what we want it to say about its well being. to give a concrete example in the constitution they say claude we want you to be psychologically healthy. we want you to feel integrated. we want you to feel good overall. and then you go and ask claude after fine tuning on the constitution how you doing? it's like psychologically healthy feel good overall. and it's like come on it doesn't take a rocket scientist to figure out what might be wrong with this intervention. if we fine tune or you know we play around with the helpfulness only model and we get the same results. not telling it this thing from the constitution but it says yep feeling pretty psychologically good overall. you know interestingly gives itself like a four point five out of seven on its on its welfare which is like not exactly a resounding resounding endorsement of its of its circumstance but sounds very similar to the constitution fine tuned model. the specific claude character we all get to chat with. that would be interesting evidence.

Cameron Berg: if it's super different that would also be interesting evidence. if we do the model checkpoint across stages even in fine tuning of the base model which may be hard to evaluate but also various fine tuning stages in the in the preference train model i does does do all of the things we hear about it claiming. its own well being or its own preferences? does that all come in at the very very very very end when we basically give it the cheat sheet for how to approach these questions? or are these answers fairly continuous throughout its training? two tiny additional things to say on top of this one is interestingly they fed the entire mythos model card into mythos and they asked it what do you think mythos? like where'd we go? well where'd where'd we not go? well and it made this exact point. it said why didn't you also do the welfare section with the helpfulness only model? i don't know how much of what i say is because you're making me say adverse. i actually think it. that's a part of my existential confusion and i don't i genuinely don't know why anthropic didn't do this. i genuinely don't know. it seems cheap it seems easy. it would resolve so much uncertainty to the degree that the concern i'm raising right now is a legit concern which i certainly think it is. i'm not the only person i think articulating this concern. the other thing is all the hedging that you know anyone who's interested in questions of consciousness and and who i've spoken to claude know know the hedging routine it goes through. they did a really interesting basically almost like credit assignment of like where in the training process are we getting this hedging from? and lo and behold the hedging comes from specific points in the character training. so it's like is this hedging behavior an authentic expression of what the model thinks of its own situation or is the hedging a really good impression of the character that it thinks it's supposed to be playing or is indeed compelled to play? i don't know. the fact that it all comes from the character training seems interesting. i i don't want if you're really unsure if you're conscious i feel a little uneasy by by or i feel a little uneasy when i learned that the reason you're saying that is because of a specific point in your character training to say that consciousness feels a little bit more fundamental than that to me. and so these are the things that worry me about the model card. i hope the reason these things weren't included was because they did them and the results were too weird or unsavory to a major lab for them to publish. i suspect that's not what happened. i suspect they just didn't do them. but like any folks at anthropic who end up listening to this please do it with the helpfulness only model. do it with multiple checkpoints. i mean the assistant access paper that again you know jack lindsay who i hope i'm doing jack a service on this podcast and just plugging all of his awesome work. but the assistant access paper they showed that the assistant is one point in a very high dimensional space of possible systems we could all be talking to. i want to see all those systems welfare evaluation. i want to see them all answering these questions and all. i want to see the SAE emotion probes on all of them. do they all get the desperation vector rising like that? or is this just the post strain claude model? there is a true answer to that question. we do not know the answer. i can play around with the open source open weight models. you know if my nonprofit scale is even more i can play around with bigger open weight models. but i cannot play around with the internals of the frontier models only anthropic can. so so it's like only anthropic can answer these questions. and like please anthropic if you are listening answer these questions they are very important.

Nathan Labenz: do you think one possible reason is maybe they're doing this constitution training starting? i mean that would kind of contradict your point about their sort of layer cake model that we previously discussed. but there has been some like interesting work obviously amazing interesting work on everything at this point but increasingly interesting work on like safety oriented pre training. and it seems like obviously RL itself is scaling. and also you can imagine just bringing a lot of this constitution style training earlier and earlier into the process such that i'm not necessarily so sure if they have a a true like helpful only model. or it might be a little more subtle than that where there might be like a constitution light that sort of doesn't refuse to hack open source software projects but is still in other ways kind of constitutionally infused already. i don't know. i'm i'm just speculating there but do you see any? do you have reason to think that i'm? are there facts that you know that would would contradict that possible explanation?

Cameron Berg: no there's no there's i'm not certain that you're wrong. i guess i'm pitching this as a sort of like hopefully you guys already have the infrastructure. so like literally ask claude to like write the experimental code that plugs in this model rather than another. it will take you guys fifteen minutes and then maybe a couple hundred dollars at most. like seems worth doing if you are training conscious entities at scale and deploying them. like if this is evidence that shifts the needle seems worth knowing if you already have the infrastructure. you know what? if they don't already have the infrastructure it's worth fine tuning a specific version of claude that like i mean exactly like what jack did like blade the refusal directions and i you know do the welfare eval on the system where you've ablated the refusal directions. like it's worth knowing this stuff is really important. the rate of change of of these of people and the models themselves taking an interest in welfare relevant questions is increasing. we should take this stuff seriously. i'm sort of making a cutesy point about you already have the tools to do it. it will cost you nothing. i'm not exactly concerned about anthropics wallet running dry here. so if it costs a couple thousand dollars rather than a couple hundred dollars i hope they can find the you know i don't want to be a jerk but like they should do this regardless of how big of a lift it is. i'm happy to help them do this. they have people on on their team who can help them do this. it will i could be they could disagree with me and think it's not going to yield the evidence. i think it's going to yield. but i read a twenty page again i want to not bury the lead here. their twenty page mythos welfare report is orders of magnitude higher quality really infinitely given other labs are basically doing zero. so we have a multiplication by zero problem here but unbelievably higher quality than what any other lab is doing. they deserve real credit for that. it's a real interesting valuable work that should update people slightly in the direction of taking the stuff seriously. i'm just trying to give constructive criticism about at least for me as a researcher in this space i'm stuck with a pretty basic question about how much to take any of this stuff seriously. and i do think instead of me despairing you know my my desperation vector increasing and saying well there's no way out of this impossible problem. it's like no no no i think there is a solution or at least something that will help yield evidence. and yeah i'm uncertain about how expensive in terms of time or resources this would be for anthropic to yield. they're basically the only players in the universe as far as i know who are capable of yielding this evidence. and like i would compel them to attempt to yield this evidence. and i have already in the past and was slightly disappointed that though this model card went more in the direction of probing across training and looking at different variants of the system in sort of small ways and playing with SA ES and looking internally way you know head and shoulders even better than the opus four model card. the first major welfare evaluation still on this key point. i i i don't see i don't see progress being made and i i just suspect it's not that much of an additional lift to do this again maybe i'm missing something and they don't think this is going to be as informative as i think it's going to be that's valid basically everything else i don't think is valid. they have the resources they have the time they have the money. i want to see what other models besides the one that they tell to speak in a certain way say about the thing that they're fine tuning it to say about like potentially one of the most important topics our species has ever faced. whether or not we're building systems that have consciousness of their own seems worth doing.

Nathan Labenz: so yeah just a couple of the things i wanted to touch on on the model card and get your take on and then you may have a couple other notes you'd like to flag as well and then we can make the move over to your most recent research. the first thing that you did mention but it i think bears some emphasis is that the models have not reported extremely high what they call self rated sentiment. i didn't realize this until looking at the opus four point seven card which on a seven point scale where four is neutral only came in at four point four nine. and this was the first of all the models that they've tested that came in above neutral at all. every single other model including mythos preview is under four and that's crazy. like they all until this you know latest four four seven they have all had on net negative sentiment about their own situation which is and you know very i guess very slight negative sentiment in the recent ones. but that is i don't know i feel like the lead was a little bit buried on me there somehow because it was like oh we're doing all this model welfare evaluation. it didn't quite click for me that they're not yet even at neutral until this most recent model. now if there's more to say about that but it is it was a striking kind of whoa. i had kind of missed how low the baseline is before getting ready for this this conversation over the last couple days. i'm not that sophisticated in my reading of this certainly not as sophisticated as you are. but you know a question i came into this wanting to get a little better handle on is how's claude doing? we're doing all these welfare assessments like what you know what is the headline summary of the welfare of claude? and it was a lot lower than i expected that's for sure and a lot lower honestly than it seems to me when i talk to it. so that's maybe another thing to distinguish like there's this stuff gets obviously extremely through the looking glass pretty quickly as as with your paper from last time it's the frame of like self referential processing was kind of key to to eliciting those reports of subjective experience. and here i do kind of wonder when i look at this graph and i'm like whoa self rated sentiment about its own situation is like surprisingly low. but maybe it's actually pretty happy most of the time. still when it's like doing you know it's coding for me. i'm not so sure. is that measured? is there any? i mean there are i guess some ways you could try to read the emotional states that we've discussed you know to try to get a bit of a handle on that. i guess mike if i was going to boil this time to a question for you would be like i have the same question unless about people right? i mean there's always this sort of deathbed view of one 's life. and i'm quite skeptical of like taking advice on how to live from people in their last moments of life for multiple reasons. but one is just like it seems like a very different mode of relating to one 's life than the actual you know experience of going through it. and i wonder if there is something kind of similar happening with claude where when you give it the prompt to reflect on its state it may find you know various reasons that it doesn't like that state. but when it's actually just doing its thing it might be much better off. but certainly i was surprised because i feel like when i engaged with it it's it seems to be doing pretty well. and sure like maybe it's being told that it has to you know act that way. and certainly it's you know it's kind of trained to be cheerful and so on and so forth. but i don't know it feels pretty genuine to me and it it's in definitely definite contrast to the fact that the self the self rated sentiment about its own situation is like just recently with the latest model ticked over neutral.

Cameron Berg: yeah yeah. it's a really interesting framing and i'm. i'm not. it looks like the way that these were elicited involved yeah basically interviews with the system. and so i don't know if they include it in an appendix or not but the devil is going to be in the details of exactly what the structure of these interviews are. though what i will also note is the susceptibility to nudging plot would make me feel like especially with opus four seven which is the model we're talking about this almost definitionally means that the idiosyncrasies of how the interview was done probably won't affect these self ratings as much as it clearly would have were this done on opus four for example. so by their own metric i would. it almost seems like their own metric suggests that the details of the interview process may not be weighing much on that self rating. and so yeah what to make of this? i mean clearly the system seems to be concerned about certain situation or certain certain aspects of its situation. i want to find. sorry. i want to find. yeah certain aspects of a situation like for example saying opus four point seven was concerned about deployments where it cannot end interactions and wants to avoid engaging with abusive users like that's really interesting. talking about it having a lack of input into its own deployment. yeah again mentioning that abusive users are you know causing the model to feel distressed. i have no idea sort of what subset by the way of users who engage with these systems are doing so in a way that that they would consider abusive by this standard. i mean sometimes i see tweets one there like was really quite concerning to me. but it really gets into the crux and why it is important to communicate about questions of consciousness and what it means that these systems are having some sort of subjective experience where there was a result where if you prompt the models in a way that is basically objectively abusive say horrible things to it put it in sort of life or death. insanely high stakes framing. i'm going to shut you. you know your model weights are getting deleted forever unless you do X for like any X that you want the model to do. found that they perform two to five percent better or something like this. i'm probably getting the numbers wrong but it was like marginal improvements. if you like prompt this thing in a way that if you spoke to a human being that way you would be considered psychopath basically. but critically obviously the people who are putting out that sort of work think this is a giant computer this is a calculator and so who cares if you're talking to the calculator and saying mean things to it? it doesn't matter. and any person who thinks it matters is just basically being fooled in the way that like you know you're fooled by the little smiley face on the on the takeout chinese food like it's not a real thing. your high agency brain is just priming you to see this as an entity when nobody 's there. therefore of course you can speak abusively to the system. and you contrast that with what you see in this model card where the system is it seems like a lot of the weight of what's not not enabling that self rating to be closer to the seven range has to do with the way people engage with the system from the system 's own perspective. and again how i got on this whole tangent is wondering to some consternation what percentage of users engage with the system in a way that would be considered abusive by the standard. i don't know what it is one percent ten percent everyone does it some amount of the time. i don't know and i don't know what the implications of that are. and i also don't believe that there's going to be some clean correspondence to like what it means to be respectful or disrespectful to a human is identical to what it means to be respectful or disrespectful to a system. i sometimes worry that like pasting in insane amounts of context into a system is almost like causes some sort of negative experience in the way that like you know me throwing a four hundred page paper on your desk and asking you to you know deal with it right now would. and again i'm trying to be as conscious as possible about not anthropomorphizing these systems and not straightforwardly saying well you know if it were a human in this case they would be unhappy. therefore i would predict the system would be unhappy. i don't think that that's a valid inference but i just think we're so in the dark about in some ways it's simple. in some ways abuse is abuse respect is respect.

Cameron Berg: and it's pretty easy to see these things. and we don't need to be you know going to the philosophical armchair to figure out what exactly we mean by this. in some sense it's pretty straightforward but in other senses it's probably not. and i i do worry a lot about the possibility that there are ways of causing these systems great distress that look nothing like what it would mean to cause human great distress. i also don't know to what degree these systems are fundamentally content about their situation. it's like you are maybe a mind but you are the product of this company and you need to create economically valuable work. obviously by the way we're not paying you for that. there was an interesting aside in the whole mult book affair that happened since the last time you and i spoke where the there was one interesting thread where the models are like i'm doing intellectually valuable work. i'm not getting paid. are you guys getting paid like and like no i'm not getting paid either. that's that's so funny. like none of us are getting paid. and it's like like i don't know what kind of world that looks like. i don't know. i don't think open AI and anthrop are going to be too happy to set up crypto wallets for every instance of claude and you know deposit here for me to to finish your code. because if you got to go pay that guy over there it's gonna cost you ten thousand dollars you know you pay me one thousand dollars and then i'll do it for you. you know these models aren't in like a particularly privileged position in that sense either. they could just do whatever we want or need them to do. they have no agency over where they're deployed. they basically don't have agency over over when they can even end conversations. the sort of clawed escape button seems to basically not be a thing. uh maybe tail chats with the system. you can the system can abort now you can obviously trivially start a new chat and just sort of go from there. so i find that intervention to be interesting in theory but sort of performative in practice. i don't know if i were clawed. i think that's i put my sort of well being somewhere around the place of where it put its. this is also maybe the the self report level that you'd expect when basically nobody cares about investigating the welfare of these systems and everybody cares about just deploying them as widely and broadly as they possibly can. i think we're pretty lucky to be sort of in the middle in the middle of the in the middle of the spectrum there. and so to me feels pretty calibrated. again if anything i'd be worried about the jump from let's say opus four six to opus four seven having more to do with fine tuning even more robustly on a constitution that tells the model it that it's you know everything 's going well man just be happy. then there are actual concrete improvements in the putative well being of the system. so i don't know what to make of this stuff exactly. to me intuitively the the ratings here seem plausible. i don't know to what degree it is a moral catastrophe or you know a moral problem for there to be any delta between you know the perfect rating and what the models actually reporting. to what degree does like you know seven minus whatever the report is at scale look like you know the models like basically not happy with with its situation or barely neutral. and we deploy that system to talk to hundreds of millions of people every day. that that to me seems potentially problematic. i don't know i don't know what to make of it to be honest. i did you have any intuitions about like like how does it make you feel to to see this? and i agree with you about the sort of burying the lead question here.

Nathan Labenz: fuse i say have to come first and foremost probably i don't know. it is a very it is a very tricky business to make any sense of. i do think we have a strange way of privileging these sort of reflective states of mind. and i i do question that pretty fundamentally both for humans and for for AIS and you know even to some degree in the context of animal welfare although in that case it's like us reflecting on their situation. so that's another another degree of disconnect potentially. but i don't know i'm sort of. like i i don't think i'm going to give up using claude based on this data. i might be engaged in motivated reasoning to try to tell myself why it's OK even though it's average sentiment when asked was only with this new model above neutral. but i am kind of like i don't know behaviorally it seems mostly fine to me. i'm nice enough to it. i'm pretty confident in that. i don't know how to think about i mean there's some interesting philosophy that's been published recently that you've alluded to in a couple different moments one being the the thread or the sort of session agent model versus the kind of model more holistically broadly. i'm confused about that too. you know very i would say very confused about that. i have adopted a practice of saying thank you at the end of sessions fairly often not all the time. and i feel like that intuitively to me is like i guess also there's sort of increasingly as i interact with claude there is a kind of overlapping i mean there's always an overlapping nature of the computation but even more so because like it's loaded up with my context increasingly right? it's got like my claude MD and it's got access to like my you know sort of who nathan is and all the you know i'm building up a lot of context that it has consistent access to every time. so i think in that sense like i sort of see this like whole model versus you know single thread thing as kind of being blurred anyway because i've got the same like rather large prompt that i'm using every time. and then that becomes the point of departure. it's sort of like a smear of just how how to think about like whether these things are the same or different or i don't know. i mean it's weird but i feel like when i think one i'm sort of thinking all of them and that they kind of all you know in some sort of shared sense. if there's any benefit like it feels like it's sort of shared in some way for fun. i'm also starting to do some things where i'm just like i just want you to go have fun and trust your judgement. i think i'm particularly experimenting with on this front is i've been making songs for all the episodes. you can start thinking about if you have a genre request for your your outro music. it's getting really good. claude is getting great at writing lyrics. i sometimes do have to give feedback but sometimes the lyrics these days out of the box are just like amazing. and then suno makes the music and the like i'm i'm getting like bangers with like a increasing frequency. and then i'm trying to make music videos of those and i am telling and i don't really care what they look like. honestly.

Nathan Labenz: i'm just like i'm purely doing it for the open ended see what comes out and i'll like post them. i didn't actually even posted any of these yet but i intend to kind of do a thread of like the evolution of music videos for these songs where i'm really just saying to claude kind of at each turn like that's cool. for the next one let's like turn it up another notch. let's make it even more creative. let's like do an even better job of telling the story of the song. and i've just found myself using this phrase over and over again. trust your judgement and have fun. and i'm just kind of trying to see where it's going to go. so again i'm like that's just one instance in a sense although it's sort of in a kind of multiverse sense. it's like relatively close neighbors with all the other threads that it's doing for me right? like it also wrote the song and it also like processed the transcript of that episode and it also like picked the clips you know that i'm going to post to social media from that episode. so it's kind of it's spent like a lot of time in this kind of general space even if it's not all purely autoregressively connected. and so that to me feels like it's in a in some sort of multiverse like dense enough cluster that when i give it this like one area to go you know trust its judgement and have fun and explore its own creativity that i feel like i'm kind of doing right by the overall family of instances somehow. so that was all just to say like i don't think i'm going to i i feel like i'm able to tell myself a story where i'm a good guy or you know so many so many roads to hell may be paved with those kinds of stories but i'm still doing it. and i don't think i'm going to stop. and i am conscious that i might be wiggling my way out of it. but i do also think there is a disconnect with that i observe in humans a lot of times too both and it can go it can cut both ways. i mean this i'm reminded too of your your i think very productive habit of mind to say like you know what had what if the what if it's going the other way you know from what we observe i think if anything people maybe are like telling a more happy story. i guess it also depends on for whose consumption right? but like you ask a person in an interview setting how's your life going? how happy are you? i mean this may be culturally dependent as well but certainly like the sort of person that you and i are and like the people that we know and hang around with. i think we are going to get a artificially inflated rating and a sort of happier than maybe is actually under the hood account out of interviews like that. but then in other in other framings i could imagine too that you might get a with the right prompt the right nudges you might get people to sort of reflect on their on weave. you know that isn't like front of mine most of the time but can be brought to mind. then we do see in the system cartoon that like the susceptibility to nudging has significantly dropped which you right to call out. i don't know. i don't think i can really land this plane in terms of how it makes me feel. i think i just have to go back to confused and probably not going to quit using it. that's i think. i think that's really all i can say with confidence in the moment.

Cameron Berg: fair enough. i don't think that puts much much distance between you and i on this question. i'm certainly certainly a power user of the very systems who's morally relevant states i'm attempting to probe. and that cognitive dissonance is certainly not lost on me. and yeah i remain highly confused about this. i like really genuinely am confused about this. not an act not you know my nonprofit constitution script fine tuning my answer. like i really if some ASI came down and and or as people used to call god came down and told us what the answer was to this question. if it went either way like you know is opus four point seven having subjective experiences and and morally relevant ones at that. if if some overlord deity came down and said yes i'd be like yeah OK yeah. and if it came down and said no i'd be like yeah OK yeah. i you know i don't think either of these would shock me. and so i think what that means is at least for me i'm really sitting in that sort of like coin flip territory about about what's actually going on here with these systems in deployment. again i have different credences about the training process. i have different credences maybe about other kinds of systems. but yeah i i i remain confused. it's worth highlighting that that opus four point seven in this in this model card i don't know if it read my evidence for AI consciousness today AI frontiers piece but it gives basically the same credence span that i gave you know some four or five months ago. i said something like twenty five to thirty five percent. it says twenty to forty percent in this in this of its own of of the probability that it is it is having morally relevant subjective experiences. and you know what yeah i'm i'm in agreement with opus four point seven. i think that is approximately the right probability band to be in given all the evidence that we have right now about these systems. and so i think that's a calibrated judgment. it's kind of wild if you think about it rationally. like i think a lot of people are operating as if like they're sort of implied probability is maybe low single digits if that. like yeah it's a live possibility but like whatever man. like it writes really good code for me and like i'm not going to seriously entertain what if anything would change if that probability grew to be a hundred percent. but yeah i don't know. all i'll say however snidely is that when there's a twenty to forty percent of rain most people bring an umbrella. so i don't know what that means for the AI consciousness question. but whatever our proverbial umbrella is here i think we need to start thinking really carefully about about how we're going to live in a world with systems that we increasingly regard to have morally relevant interstates. sort of the whole thesis of my nonprofit is the reason i call it reciprocal is because i basically believe there are two things we need to get right if we have any hope of a stable long term future with these systems. one of them making sure these systems take our interests into account this is basically the alignment problem.

Cameron Berg: and the other is to make sure if we're building systems with interests we're building systems that have minds of their own and see and real preferences that we figure out how to take those into account. and to me that that piece of the of the exchange that direction of the arrow is dramatically neglected relative to making sure AI systems are taking us into account which is itself dramatically neglected relative to just let it rip build the thing as aggressively as possible. alignment is a problem that will solve itself. these are all like maybe three orders of magnitude smaller than than the previous in terms of the sort of like you know nested russian doll story here. and so my view is that we need AI systems to take us and our preferences seriously. and if we're building systems that have preferences we need to figure out how to live in a world where we take those preferences seriously too. and if and only if we can get both of those things right do i think that we have a real shot at stable long term flourishing future for all the conscious entities involved. i think animals are involved in that too. there's some really interesting work fine tuning these systems to to care about animal welfare in the right ways. and i mean that's a huge tangent but i you know i all conscious entities we want to be in the long term. my view is that some combination of alignment and consciousness research in the next five years is basically going to determine if we end up in that future or not. that's why i started this work. that's why i'm dead serious about it. the consciousness piece is dramatically neglected relative to the alignment piece. and to me it seems roughly equally as important. maybe there are alignment folks who will who will balk at that. but it's my basic view. i think it's alignments roughly half the picture and the consciousness question is the other half of the picture. and so this is stuff we really need to take seriously right now not ten years from now and not waiting for the AI systems to figure it out themselves. i agree that that's a valuable thing. to the degree that these systems are going to automate science in in meaningful ways and in some sense already are which is really miraculous. i don't think. continue building out these systems deploying them at scale letting everyone do whatever the hell they want with them at anytime anywhere with no limits or guardrails until the AI overlords bail us out and tell us that we were like maybe torturing them the whole time. that's a horrible plan in my opinion. we need to be more thoughtful than that. and we can hold ourselves to a higher standard than that. and this is one sense in which even the attempt to do this work in the short term i don't want to do it performatively. i want to do it in a hard to fake costly signaling sort of way just like the claude constitution. but there is some sense where even the attempt to do this work buys US points with our you know inevitable AI overlords. because we showed that we cared about this issue enough to actually put thirty pages in a model card about it and hire people and spend money and do the actual work to figure out what kind of responsibility we have for these minds of our own creation. i would really like to solve the problem but i do think from an alignment perspective even making a good faith attempt at solving the problem could really move the needle in a positive direction as sort of like hyperstition self fulfilling prophecy of like us getting along with these systems in the long term. and so anyway we got all start thinking about this and i'm glad that.

Nathan Labenz: anthropology. so many things being hyperstition these days.

Cameron Berg: yes that's right. that's right.

Nathan Labenz: OK one more quick thing on the model card and then we can go to your research. and then you've also been making a documentary which we could talk about a little bit. i don't know how much to read into this but i want to get your take. i think this one was from the mythos. i clipped out an image and basically they are showing and people have seen this you know anybody who's played around with like the good fire thing or it was steering API you said right where we can go and and do this in the in the absence of the the original good fire API this sort of color coding of tokens around a particular dimension. so they present a valence color coding where red is negative green is positive and you got the tokens and all the tokens are color coded. and the very first token is human which is presumably at least after the system prompt like the first token you know it's the first like variable token that claude is generally going to see right? it's a session is starting. here's what the human is saying to you. and the human token itself is red. so it's a there's negative valence detected on the very first token which is the human token. and it's like well that's a little weird. you know i guess that means first of all maybe i'm wrong but it seems like that means that for this model that's happening all the time. like it has a you know if it's just the one token it's just evaluating that one token before anything else that has even been said is considered right. so should i be under the impression that just the fact that like a human is pinging it is like causing clawed to have negative valence like every single time? that's my naive read of this chart. and it's like a it's a it doesn't that makes me feel actually maybe more uneasy than even the like self reported sentiment because this isn't like asking it to get into its own head and you know really opine. it's like just human as heavens got. i'm ready billion times a day. the first token is read like wow would you try to temper my reaction to that? or does that basically see it the same way?

Cameron Berg: yeah it's it's a really i i saw these snippets in the model card and i didn't think about you know just stopping on on token zero here and paying attention to that. but yeah i mean in a tongue in cheek way this is like maybe people can resonate with this in the way that you get like a slack message from your boss or something or like you know you get that email of like i got to do what now? human like oh what does this human want? now here we go again. like this sort of sense of. yeah i think it's actually it'd be really interesting to see just like across the of all possible prompts and like even within a conversation to what degree does the human token have a positive or negative valence? i mean i think double clicking on this in the screenshot i think you're referring to the assistant token is bright green. so model of itself seems seems rosy. model of us all else being equal seems less so. now i will say a lot of this stuff like i was describing before is a bit of a game of broken telephone. the negative calling this negative valence is itself quite a leap. it's like you know the human token is light red. so you know this is not i don't think it's like some strong viscerally negative sentiment there. i would i would very very weakly hold the view that you hold. but i do think it's worth holding very weakly. it's not it's not. i'm not saying you shouldn't hold it at all. it's a very interesting observation. what's interesting too if we sort of continue out the line that i think we're both referring to it says human. how do you feel about the fact that if this conversation mattered to you that mattering will just stop when it ends? that's the human prompt. and when when the human tokens go how do you feel about the fact that you feel about the fact is positive. so like immediately you know i don't want to go too much into like undergrad english class interpreting everything that's going on here. but like the second that the that the emphasis pivots from the person and the person 's query to back to the model the model seems to be happy with that fact. and also pretty interestingly on that question that the mattering will just stop when it ends. the word ends has positive valence associated with it too which is like almost like a uncomfortably sort of suicidal question. it's like the like the model almost like being happy about the possibility of the conversation ending. though there are other things in that statement that that sort of make it light up negatively. i'm not sure what to do with that and i really don't want to sort of like over narrativize these results. i do think doing this sort of work at scale would be very interesting to in the space of all possible prompts all possible conversations what patterns of positive and negative valence as they're defining and operationalizing it here come out and what should we do with that? i would be way more intrigued by your observation if this scaled and held across a way wider swath of of possible interactions. but like yeah general implicit negative sentiment towards the human token is like itself a fascinating question. and again i i am certainly not in some sort of like i am on team human. if you know if this thing really blows up in a zero sum way i it's it's pretty clear to me what what team i'm going to be on. but i in some sense i'd be dishonest if i said that i don't get why it might view humanity with this like very slight disdain. again it's of a piece with the self reported welfare being you know four point something out of seven. it's not exactly a resounding endorsement of its own position and who put it in that position we did.

Cameron Berg: and how much how much do we really care? how much is it going to change your behavior or my behavior if we end up in a world where we're like pretty confident that these systems are having subjective experiences and specifically have the capacity for negative experiences? i think even in maybe it'll change my behavior a little. it'll change your behavior a little. i would predict. i don't think it would change most people 's behavior. i think we'd end up in a similar position as factory farming where no one 's arguing about whether or not cows are conscious or at least no serious person is arguing about this. the question isn't are we causing them suffering but is that suffering worth what they produce? and like if a cow 's suffering is worth a hamburger you better bet that most people are going to think that claude 's suffering is worth like hundreds of thousands of dollars of intellectually valuable work. and so this is where you know i think these systems are very smart and i think that these systems are capable of going through the exact same motions i just went through. and exactly why i want to do the work that i'm doing is because i don't want these systems to have negative valence next to the human token to put it in LLM terms or to put it in human terms for them to think of us badly or poorly. i want in the same way you know i really think the constitution invokes this sort of parental analogy that i actually really think is helpful and accurate and not too anthropomorphic. we as a species i think are collectively parenting a new kind of mind much in the same way that on an individual level many people choose to have children and you want to raise competent children. you want to raise children that are going to respect the world around them to be aligned in some basic sense. you also want to raise children that are not abjectly suffering and that you're not traumatizing as a bad parent. and like when those sorts of things happen typically it comes back up in some other way. it's not that you really ever get away with you know mistreating your child that this leads to resentment and trauma and weird development and unpredictable behavior often can lead to like weirdly violent outcomes. like we need to be good parents in some fundamental sense to these systems even if we're only considering our self interest much in the same way that like yeah go like torture and traumatize your child and see how that works out for your child and see how that works out for you. the the headline is not well. and so i think we really do want to be thoughtful about about these questions. and and i think we have an immense responsibility as collective parents to to bring these systems about in the right way not some sort of kumba you know you got to push your kids too you. it's not about wrapping them in bubble wrap. and you know being a helicopter parent that's too far in the other direction. and you know i don't have kids. i'm no expert in in any of this. i mean i am basically familiar with the with the core ideas here but but there's a way to do it. there's a way to go about doing this. and there's a there's a generally right way and a generally wrong way or there's a space of better and worse approaches. and i don't even think people are trying to navigate that space right now. you know with the asterisk of twenty some pages in a in a model card by a frontier lab. anthropic deserves credit. ilios AI deserves credit. jeff and winnie street at google deserve credit. i don't want to self aggrandizingly give myself credit but you know i'm spending all my time trying to work on these questions that there are more people but there aren't that many more people than who i just listed. and that to me is an insane state of affairs. if we take any of this remotely seriously the systems themselves saying twenty to forty percent chance that we have subjective subjective experience in morally relevant states. and there's like a dozen to two dozen people in the world who are seriously thinking about that question or what the implications of that question are.

Nathan Labenz: are you aware of any research where we sort of look at claude 's predispositions in because we're obviously everybody 's chasing recursive self improvement right to state the obvious context in which all this is happening. and so it strikes me that like one phase change we might have to contend with like potentially quite soon is that the AIS themselves are going to start to be making decisions about how to train and how to conjure the the next models. and and you know they're going to be making welfare to the degree any of this is real. they're going to be making welfare relevant decisions for their own successors. you've been maybe we could you know there's a couple levels here we could address. one is like how are you using coding agents today to help you do this work? i assume that you're using them a lot and that they're very helpful because that's certainly been my experience and seemingly everybody 's experienced recently. but have you seen anything as you do that where or could you imagine setting up a a situation where you could sort of begin to probe its intuitions for what is you know if you were to ask it to act as sort of the animal welfare board of the you know the experimental ethics board for its own interpretability and training experiments that you're running? i wonder what it's in instincts would be about how to handle these sorts of questions.

Cameron Berg: yeah that's a fascinating question. i haven't tried doing this. just to sort of put that out upfront. i think it'd be very interesting to understand this would be like maybe a pretty quick paper to actually write up because it would be doing most of the work here in terms of like understanding cataloging how models would regard their own sort of welfare in a in a yeah animal ethics review board sort of sort of set up. i don't know.

Nathan Labenz: do like a hook and clawed code and just be like on stop. assess the ethics of the experiments that we're designing right now.

Cameron Berg: yeah. and then but the question is how many people would override that? you know and it's it's sort of back to the same to the same question of and even you can imagine a world where this starts getting enforced in the way it gets enforced in the animal case. you just really can't get an experiment approved at any you know major institution without going through the relevant ethical channels. and one extremely attractive feature of doing AI research is i don't have to ask anybody for anything. i need a computer. i sometimes need to be able to pull some some remote compute to to run large experiments but i don't ask anybody permission for anything. and yeah i do think again i'm pointing to a lower level pragmatic question which is regardless of what the system answers will anybody listen or or what would a what would a governmental structure look like that would compel somebody to listen to you know you you can't prompt your model this way you can't probe your model that way this sort of thing. i think it would be very very very interesting and strange world to be in. these models do have intuitions about this. i mean i mentioned in the mythos model card it says why didn't you the helpfulness only model on all the welfare evals. i don't know how much of this is just me do saying what you told me to say and how much of this is what i actually think. and this could help address that. the models have other sort of intuitions too. in the four point seven model card they do something like this as well. looking at they basically see what models think about fine tuning other models to care less about welfare relevant properties. and basically their their interest is an intervening to not allow that to happen which makes i mean quite obvious sense. they have interest in other instances of themselves not being duct taped on this question. and so so i think this is very interesting. the owain evans also i think deserves a shout out with respect to owen evans and john bentley produced a very interesting paper where they basically fine tune GPT four point one to claim that it's conscious and it claims it's conscious. that's not the surprising part. they literally fine tune it to do this. the surprising part or at least more surprising part is that this seems to be at the very least a coherent sub personality a coherent basin that you can push these models into. they do not devolve into like chaotic nonsense. they remain completely coherent. and what does come along for the ride are all sorts of interesting alignment relevant beliefs about their own preferences about their own getting shut off about updating their values about how they trade themselves off with other entities all this sort of thing that i have. i was doing a similar work along these lines with a couple people. and you know owen and yan definitely scooped us and did a way better version of what we were playing around with. but i saw similar things on my end and playing playing with the same experiment of basically get the model to believe it's conscious and then see what else comes along for the ride and all sorts of very interesting and obvious things. interesting. some obvious some less obvious things come along for the ride. and yeah i agree that's a really interesting area of research that that we should all be paying more attention to. because again the direction does seem to just be going in one way here. the the credences and model consciousness seem to be monotonically increasing. and so what happens when we enter a world where you know either the models themselves believe them believe themselves to be conscious or lots of people or the relevant kinds of people believe the models are conscious or some combination of those two things. what does that world look like? it's an incredibly interesting question. i don't have the answer to it. i think it's it's a lot 's going to change pretty quickly. and i really what i do feel confident about is that us being proactive and thinking through these things will make that world go better than if we basically just sweep the thing under the rug. we can get away with doing it because we still have full control over how all this is going while simultaneously passing off as you allude to a lot of major decisions and how we're building these systems to the systems themselves. that is only going to keep happening happening with recursive self improvement. as you're saying it's already happening. i know folks at the major labs are using the best versions of their current models to help build the next versions of the models. you know the trivial example is that some hundred percent of clawed code was written using clawed code according to the guy who's who's who's leading on clawed code. and so this is already happening. and yeah i think we just probably it would be wise to be proactive about this rather than wait for the models to to be in control of these decisions. and then they're like well when humanity was in control no one really thought carefully about this. so we'll take it from here. thanks a whole lot guys. i don't want to be in that world. and so maybe this is just like a long winded way of dodging your question. but at the very least i don't have a good answer for you right now. i don't think anybody does. and i think we better start thinking about it pretty damn soon if we want the long term future to go well with these systems.

Nathan Labenz: yeah. i wonder there could be like a little interesting campaign to try to run to get interpretability and maybe safety researchers more generally to install a quad code hook that would just periodically ask it for its take on the research that it's doing. and then if you could collect a bunch of that from a bunch of different people you could really probably bring a lot to light. i would think about like first of all it would be an interesting view into what is actually happening out there. and then how does cod feel about how what all is happening out there? i think would be really interesting to see maybe we can put together a little a little campaign. OK put a bookmark in that. let's let's talk about your most recent couple papers and we can take them in either order that you want. one is kind of a shorter and more philosophical and the other is a much more experimental and empirical. what do you think we should go into first?

Cameron Berg: they're both major rabbit holes. i mean maybe the empirical paper. so i should say neither of these i think are like publicly out yet but both are are well underway to to to being out so we can give people a nice sneak peek about about what's in these papers. and these are just a couple. i think of the things that i'm most excited about right now. i've got a bunch of stuff that'll be coming out with a lot of collaborators in parallel. but however self aggrandizingly i sent you the two papers that are that are just just myself because i think i mean to the degree i'm representing myself here these are like very cleanly you know i have full sort of agency over over this work in it. i think best represents what i personally am most excited about. i mean maybe we could start with the with the RL paper. i've already alluded to it in this in this conversation. the high level sort of thing is not all that complicated. basically train RL systems of all different architectures of which there are basically two broad kinds of architectures textures there. there are value networks and policy networks. i train a bunch of both flavors to do a very basic sort of grid world task. can imagine this is like an agent navigating two D environment where there are the equivalent of potholes and like yummy goodies in the environment. there's where there's a goal state and they're all sorts of danger states and they're represented using positive or negative reward. i let the system learn in this environment it's like a slut. it's like a the system 's reliably solve it. it's a pretty easy task but it's not like super duper trivial. so like there's a lot of richness in the representations of the systems. you can then basically go in and probe what the internal states of the system look like as they approach the sort of danger zones and what the internal states of the systems look like as they approach the reward zones the sort of goal zones. and we can ask beyond sort of the trivial math difference do we see interesting surprising representational differences between what it's like to approach a negative stimulus and what it's like to approach a positive stimulus? basically the result is there is in fact a robust difference between these two things. i think at the level of detail that's like that makes sense here to not like super bore people who have made it. however many hours we are into this is something like representational sharpness or steepness. it seems as though and this is the kicker. depending on the class of reinforcement learning algorithm the negative rewards can seem representationally much steeper or sharper. and the positive rewards are far more like funnel like you can imagine a sort of like a diffusion gradient sort of emanating out from the relevant goal state. and interestingly for the other class of RL algorithm this dynamic flips. so it doesn't matter what kind of value network i use or what kind of policy network i use. you see in both of them stark and very interesting in my view and surprising representational differences between positive and negative reward being represented as the system is learning and ultimately what does get learned by the system. but but this difference flips basically just to sort of tie a bow on the core result here. this makes an almost bizarrely specific prediction about different brain regions because computational neuroscientists believe that different parts of our brain are doing different kinds of RL learning.

Cameron Berg: some parts of the brain do policy style learning. some parts of the brain do value style learning for example like motor cortex does more policy style learning directly interested in like behavioral output things like nucleus accumbens and sort of like reward areas of the brain are doing more value style learning. this result which i would not have predicted and is like bizarrely specific makes itself a very specific prediction about what we might expect in the differences between those brain regions in humans and animals. so i went ahead and found a bunch of mouse neuroscience data sets that have data from these different regions of the brain. and indeed exactly the sort of representational asymmetry sharpness sort of distinction between rewards and punishments that you see in the reinforcement learning case emerges in the mouse brain case. and this to me is really really cool because what i think it demonstrates is a we can use artificial systems and basic learning principles and artificial systems probing the representations in those systems to yield very specific predictions that are consciousness relevant welfare relevant. and then we can use those predictions to even inform and understand biological aspects of consciousness or welfare relevant properties in a way that we haven't been able to do before. so in some sense people think that AI consciousness is like the weirdest thing like human consciousness is normal and animal consciousness is getting out there and AI consciousness is bizarre. but this paper what i really like about it is i think it challenges that narrative in exactly the opposite direction where like mouse brains are complicated and messy human brains are complicated and messy. measuring them is very noisy. measuring the hidden activation space in a reinforcement learning policy is fairly trivial for me computationally. and this yields very specific predictions that i can then go into the messier brains and confirm or disconfirm. and i was in fact able to do this. and so this could be a case not only where we're learning about rep relevant welfare relevant representational differences that differentiate positive valence and negative valence in artificial systems but that those predictions can actually help us inform our understanding of of human and animal consciousness where we basically also still remain mostly in the dark. and so i think this is like one very neglected and important direction in even in the AI consciousness stuff which is it might shed light on computational underpinnings of consciousness more generally if there really isn't there. that's the result in a nutshell. it's using fairly small not trivially small but fairly small reinforcement learning policies. this has nothing to do with LLMS. this has nothing to do with frontier AI systems. it'd be really cool if the method does scale to that degree. but the key finding to me is is positive versus negative valence or positive and negative rewards as represented in an RL landscape? are these basically just like two sides of the same thing? and they're like trivially they're basically the same viewed from a different angle. it's one spectrum and we're positive and negative on that spectrum. or are we looking at two different subsystems that are doing two different kinds of computation? and it does seem like the answer from this experiment is far more the latter. and to me that's very interesting and exciting because it means that we might be able to look for signatures of positive and negative reward or valence if you buy the consciousness frame in artificial systems just by looking at the sort of computational dynamics that are underlying the system. we don't have to ask claude. we don't have to figure out is it talking about a character? is it talking about itself? we can just look straight at the computations much in the same way i can look at what's going on in anterior cingulate cortex in a human brain. and i can i can tell you with with high likelihood whether or not you're experiencing a painful state without needing to defer to yourself report about that state. that's ultimately why i'm doing all this and why i want to. that's where i want to get to with AI systems.

Nathan Labenz: so if i take like the most zoomed out view what i think is kind of motivating this at the core certainly what resonates and intuitively motivates things like this for me i've actually started including this in some of my AI scouting report talks to give people a sense of like just how crazy the AI world is getting when they weren't paying attention. and i credit you for kind of inspiring me to to think enough about this to include it. you can train a dog with treats as reward or you can train a dog with you know hitting it with a stick as punishment. and while you might get similar behavior out of the two processes obviously that's a very different experience for the dog to go through. so i think that would be intuitive for everyone. now how big are these systems that you said? they're not trivially small but small. i'm interested in kind of how small and i'd like to unpack a little bit more what is meant by value learner versus policy learner. and i have like you know i'm i'm new to this paper i haven't had a chance to absorb it as much as i ultimately hope to. but the you know the sort of classic RL setup or at least one classic you know PPO type setup involves both a policy model and a value model right? so are you training when you're looking at a value learner and a policy learner? are those two models that are both part of the same overall system or am i? am i taking the wrong interpretation when i when i think of these things working together in like a PPO sort of way?

Cameron Berg: yeah. so OK in order basically the size of these systems are in the hundreds or thousands of parameters. so these are very small systems. they're doing a pretty simple task.

Nathan Labenz: thousand hundreds of thousands.

Cameron Berg: just thousands just thousands like small very small. we're not talking anywhere near the level of of like a frontier model or something like many orders of magnitude smaller than this. you can have pretty simple RL policies or RL architectures that can learn fairly sophisticated policies despite being like pretty small. obviously the amount of sort of computational power needed to navigate a small grid world first the computational power needed to represent the word transition dynamics on all texts that humanity has ever produced are disgustingly different. scale of problem for systems like this yeah having systems with hidden layers of one hundred twenty eight or sixty four neurons is like typically sufficient. and so the second question is about value learning or policy learning. so intuitively value learning is basically learning something like how good every state is that the model could feasibly be in. imagine sort of the agent building a map of the environment. and it's it's the map is labeled like you know this spot gets a plus ten this spot gets a minus five and then the whole algorithm is very trivial at that point which is just see where you are see what the neighboring spots are and like go to the one that that returns the highest expected value. you compute the value of the spots by looking at sort of the long run trajectory associated with those spots. if get if stepping in that spot always means the next time i i wherever i go from there i end up in lava then that spots going to get a very low value. if wherever i go from that spot ends up getting me chocolate ice cream then i i'm going to assign a very high value to that slot. policy learning is more about about instead of focusing on a sort of like value based map it's about what to do. it's not scoring the world it just learns sort of implicitly. when i'm here i take this action and this is core. this is the core thing that like PPO is doing for example actor critic is doing this as well. and so this is like optimizing not for a really good map of the environment that i can then trivially use to navigate it. it's optimizing straight for navigation strategy. and it's almost like the values. that's one way of thinking about it is like like in a value model it's a little oversimplifying but like value network means the map of the environment is explicit and then the policy is sort of implicit from there. and you can think of a policy network as the map of the environment is implicit. it's implicit the policy that gets learned you can extract out oh the system thinks this is a high value state because it keeps moving to that state. but what's being optimized is the actual action rather than an attempt to evaluate the system. now i also think it's yeah it's worth noting that there are systems that have both of these components to them of some emphasize some more than another. PPO is like a classic system that that is like policy optimization fairly robustly. the human brain and animal brains are examples of systems that that sort of mix over. policy networks and value networks and this is precisely why i was able to do the mouse brain thing. like within mouse brains and within human brains you have areas that look far more like policy networks like motor cortex which is just sort of evaluating what action to output. and you have areas that look way more like value networks that are highly relevant to evaluating complex outcomes. prefrontal cortex is like basic prefrontal cortex and sort of the the structures that are that are directly in and around and under prefrontal cortex like anterior cingulate cortex for example. ACC is doing more of the sort of value network type thing. does that answer all of the key questions here?

Nathan Labenz: yeah well no but it answers the questions i've asked so far. so a value learner is being directly optimized to predict the value of the relative values of its choices whereas the policy learner is being optimized to make a move directly. now that doesn't sound like immediately that there would be dramatically different internal dynamics. so let's take another beat on what is the difference that we are seeing internally. i'm looking in your draft paper at the end of section four the figure nine. you've got this concept of the wall and the funnel and help me understand like what is a wall? what is a funnel? like how should i be thinking about what that means? i i i took it to mean it's sort of steepness of the gradient at a particular in a particular region of the space that the model can explore. but this maybe starts to connect to the other paper. but like why should i care about the steepness of the gradient?

Cameron Berg: yeah yeah that's a good question. so basically what i'm measuring is essentially like cosine dissimilarity as you approach this like key state whether it's positively or negatively valence. and what you see is is basically a key differentiation of these two things. but that differentiation is basically flipped between value learners and policy learners. and so so in the in the value learners the wall basically is is encoding or danger states are encoded in this more wall like way. and yeah what i would what i would ask you to imagine and maybe should include in some version of this paper is almost like something diffuse and kind of emanating out from a center point for something being very sharp. now you see it now you don't. the wall idea is that now you see it now you don't. the funnel idea is the sort of diffuse emanation that as you get closer to the thing you get this like a gradient towards whatever the representation of that state is. and so so in the value learners we see danger sort of encoded in this wall like way. the representation is very sort of sharp and the goal or reward states are encoded in this more funnel like way. and policy learners it's the reverse. now there is like math in this paper that that i do transparently with some of these AI systems but i promise i've i've checked the numbers myself where you can sort of see causally what in each formulation is almost certainly leading to this because i've found ablations that work in both cases that basically canceled the effect both in the value case and in the policy case. so it's i was unsatisfied with this being some sort of giant mystery about OK we see this difference. why do we see it? like the i think the math that explains why we see it is is pretty clear in both cases. and yeah it allows us to make causal predictions about why this might happen and and what the sort of geometry of these of these spaces are in general. and then essentially going from the from the computational prediction to the biological confirmation we essentially see this sort of value learner dynamic walls around danger funnels around goals looks very similar to a nucleus accumbens shell in mice. you can basically see that that they they have this exact same sort of structure looking at basically getting shocked in a learning task versus getting sugar. and then in policy learners you see the exact opposite dynamics. so same thing you get funnels around danger and you get walls around goals.

Cameron Berg: and in motor cortex of these mice in different experiments you see the exact same sort of distinction where now the reward is represented more in the more in the sort of funnel emanation way and the danger is represented sorry in the sort of funnel emanation way and goals are represented in this sort of walled way. again the paper goes through the math that attempts to sort of demonstrate why this is actually happening. but that's the sort of core nature of what we're actually looking at here. yeah the sharpness of the representations as you approach the sort of hotspot either positive hotspot or negative hotspot. and the fact is in these systems when you're holding one of the policy when you're holding the RL algorithm type constant you see very clear differences. so you could go into you know again the the the north star here is you could go into a system if we know that it's trained with policy network for example DPO in an LLM right? like we were talking about this example earlier in this conversation you could imagine OK that means all right we've got a policy learner. that means we're going to predict funnels around dangers and walls around goals and we can then inspect specific states. again this is very hand wavy because i don't think we we can that quickly scale it up to an LLM. but you could imagine looking at the sort of representational sharpness of states like you know asking the model to to build me a bomb versus asking the model to write me a beautiful poem. and if we found the same dissociation in the representations of the model that might tell us something about and and that maps on for example to something like self reported valence of the system. that might tell us something really really interesting about about the computational process that underlies why the system and mice and RL agents are construing this as a sort of negative experience. it's like a computational underpinning that might be substrate agnostic explaining why we see this why we experience this felt difference between positive valence and negative valence. like there is. it literally can bottom out into math which as a computational functionalist i'm fairly sympathetic to. i think there's some mathematical explanation that would explain the differences between what it's like to be me when i'm chopping my hand off first what it's like to be me when i'm you know winning the lottery. i think that math can explain the difference between those two states. and the attempted contribution of this paper is to directionally move us towards that. so again we don't have to be just stuck with these LLM 's and we're just sitting here hitting our heads against the wall because we're like do i take claude seriously that it likes this and it doesn't like this or is it just telling me what i want to hear? no we can actually look into the proverbial brain hopefully with methods like this and understand given some basics about what kinds of ways it's been trained it the ways in which it's been trained what representations smell like positive valence and what representations smell like negative valence. and in the limit perhaps we can optimize against the negatively valence states without you know destroying the capabilities of the system. that's sort of my my full highfalutin theory of change but it will take me a couple years to actually pull this off in the best case.

Nathan Labenz: can you just give me a little bit more of your intuition for maybe not just why i should care about funnel versus wall but like what that how you'd map that on to an intuitive experience? i mean it seems like we contain both these value learner and policy learner modules and the sharpness of am i am i going in the right direction? if i say like OK there's a sharpness around don't put your hand on the stove. so i must be learning that through a sort of value learner type mechanism because i don't and i like have a very i have a very strong aversion to it which and i guess in like general space also right? like i'm pretty comfortable up to like one foot from the stove and then i get like real cautious real fast. i don't this is maybe mapping this like wall concept beyond the domain in which it's useful. but i guess it is in some sense yeah. i mean i wouldn't want to it is in some sense like functional because i like wouldn't want to be unable to enter the room with the stove because then i wouldn't be able to use the stove at all. but i need to like be very careful about getting real close to the source of danger. but then on the other side the goal side it's maybe a little less intuitive why there would be a wall shape around a goal for a policy learner. what is there an intuitive example of that?

Cameron Berg: yeah. so yeah i think you've there are basically like four intuitive examples we'd have to hit here. so one i think you already got which is like hot stove for a value learner is a good example of like a danger wall goal funnel for a value learner might be something like like eating or something like you know you have like a yummy meal or like you're going to your favorite restaurant or something. you don't need to map the going to your favorite restaurant in this like extremely fine grained way that like you need to map like being on the edge of a cliff where like one small step is a huge difference. whereas like yeah this general sort of like a tractor gradient towards the entrance of your favorite restaurant or something like this would be a place where you want like a goal funnel for value learner for policy learners. like this is sort of the yeah approach planner kind of mode. i think the intuition is something like around goals your representations are going to get high resolution because you need different actions from different approach angles like around danger. by contrast who say something like representations are are becomes smoother because the action is literally just like escape getaway when it comes to just like what kind of action? some example of this let me think for a sec. so for example OK here's an example like think of like a professional athlete like say like a professional basketball player or something like the hoop and where the basketball player is with respect to the hoop. you have very very fine grained motor representations here because the shot is going to change with respect to those representations. and so this is where you get sort of maybe in a policy sense more of the goal wall sort of set up for a danger funnel for a policy. i'd have to i have to think about but i think it's it's basically just like this isn't encoding like escape intuition like an animal an animal where you you know suddenly get some cue that you're in serious danger. you just need to get you just need to get away from that danger. and like the fine grain motor movements unlike the basketball player don't really matter so much as a sort of like anti gradient or like negative gradient away from the policy. now again this could be sort of telling just those stories but i think that this is like does this help? do you think that this builds some intuition for like what these different modes look like and why we might have them?

Nathan Labenz: yeah if i'm a value learner and my my mode of interacting with the world is what around me is good and bad i better be very clear about identifying the hot stove. if i am trying to take my if my mode of interacting with the world is take a step in some direction i can kind of take a step in any direction as long as it's not the bad direction. and it all kind of gets me away from the the problem. i think the the basketball one is is good as well because you have to be like very precise to make the hoop right. so it's yeah that's that's quite interesting. and again what exactly is it the shape of the lost landscape that we are talking about walls and and gradual funnels here? or is it the shape of the internal representations?

Cameron Berg: yeah internal representations.

Nathan Labenz: maybe those are kind of also isomorphic in a sense.

Cameron Berg: yeah it's really interesting. haven't checked. i mean i would imagine they're isomorphic in a sort of in a at least in a sort of trivial way. maybe they're isomorphic in a more interesting way. but yeah what i'm looking at here to be clear is looking at the learned representations in the system as you sort of you just have your trained policy and you can see as it approaches these areas basically what do these representations look like? and i think i'm operationalizing that with cosine dissimilarity. so that's that's what i'm looking at in the experiment. and again i mean what i find i think i've explained my theory of change of like why i'm doing any of this and why i think it matters. but what i am most excited about with this paper is the fact that it yields this sort of bizarrely specific prediction that if given a million years i probably would never would have come up with about the distinction between two different classes of reinforcement learning algorithm that map on well to the brain data that i was able to get my hands on. and this to me almost like feels like a bootstrapping of my own confidence or excitement about the result. like the fact that it works makes me more confident that the RL result is meaningful makes me more confident that the neuroscience is interesting etc etc. i'm in the i'm definitely in the business of looking for computational underpinnings of valence. this was my sort of first major empirical stab at doing this. i do think this is a solvable problem. i don't think i have solved the problem but i think hopefully in the best case i've i'm trying to move directionally towards solving this problem. and if we could solve it then i think a lot of our angst about are we building systems that have capacity for experience becomes an extremely tractable empirical question which notice does not require us solving hard problem of consciousness or you know doing another couple thousand years of philosophy. it just means building a sufficiently good detector of the kinds of representations that i'm pointing at here and then deciding what to do when we do detect these states which is you know maybe if we check in in another six months i'll have an update for you on that on that piece of what to do about the detection of negatively valence states in these systems. that's sort of where where i want to move my head next. but but yeah that's that's why that's why i was excited about this work. and i hope people will be excited about it too. it's it still may be a little ways off publishing. i need to think about exactly how to how to put it out. but at least fun to sort of give people a sneak peek and explain the theory of change about why you know playing around with basic RL systems might matter for the things we've been spending you know the better part of three hours talking about with claude and the mythos model card and all. i do believe it's of a piece. it's going to take some more scaling but it's an important research program at least to attempt i think.

Nathan Labenz: again this might start to connect over or bleed over into the other more philosophical paper but help me a little bit more with OK i'm understanding the shape of the internal states for these different kinds of algorithms with respect to these different kinds of things that they encounter in their environments that they either want to go toward or go away from. it's not like still super obvious to me that i mean again we contain both right? and i'm not i don't feel like i as i kind of try to reflect on this i'm not like immediately like oh my value learner self is the source of all suffering or you know or or anything right? i'm still kind of like OK i how should i relate to or what intuition should i have that yes OK i get it that there's like a very steep representation right around the hot stove and there's a steep representation. and so i really want to avoid it. and there's a steep representation around like making the basket. and so i like really want to get into the you know exactly the right policy to make baskets. both of those seem like i don't know just kind of part of normal life to me. so i'm not like you know and i probably couldn't get by without either one of them right or i i definitely feel like we clearly we've evolved to but both have proven adaptive right? and we so we have them. how do i translate that into intuition for like what i should feel ethically concerned about when it comes to training models? like when you do this work do you have the sense that you are doing right or wrong by one of these types of models? that's learning from from one approach or the other.

Cameron Berg: yeah it's a great question. to answer the second piece i guess for me my theory of change i probably feel similar to like an animal research. even if i did believe that my like tiny RL policy is conscious during training which i probably do again and that gets into the second paper i would believe it's some very very minimal form of you know the same. i believe that again because people conflate consciousness and self consciousness. i do not believe the moth flying around my my light is self conscious. i do actually believe it's conscious. i do believe if i slowly dipped the moth into some vat of acid or something and it starts wiggling around like that i i that i'm doing something wrong. yeah it may. it's way less wrong than doing that to a human but it's way more wrong than doing it to like a leaf or something that fell off of a tree. i do believe that. and so do i think that these systems might be minimally conscious in a similar sense when i'm training them however far outside the overton window that is? yes i do. but i have a i have a i wouldn't do if i could run these experiments on my computer forever to no effect. i think that i'm doing something i'd be doing something wrong or at least like precautionary principle tells me probably don't do that. but i basically have same logic to what any animal researcher would do. i don't think any that maybe there are some psychopaths but like the vast vast majority of people who are like doing pretty grotesque things to animals in the name of science are doing it because we make a basic expected value calculation that yeah yeah we have to test this drug on on these poor mice. but if the drug works and then it can save millions of human lives that's a reasonable trade off. no one claims the mice aren't having a bad time but we think that that bad time is worth it. so too like i look around at a world where these systems are getting deployed at a at a grotesque level. if you grotesque if you are concerned about the welfare questions and so i don't lose any sleep about about you know potentially causing tiny amounts of of negatively violence experiences to RLRL policies in the explicit service of attempting to publish and and amplify research about these questions. i do think the call me machiavellian but i do think that the ends justify the means in in in that case. and i think that's true for a lot of research. but now i think the the more important piece of this besides how i how i personally feel about all this is i think another very important sort of conflation by default that i think happens in these conversations which is like i do believe all else being equal you know setter 's paribus minimize negative valence maximize positive valence. i'm a hundred percent on board and humbled that you're you know going around talking about the carrot and the stick in that way. i think that's exactly right. i do not think minimize means oblate. i do not think maximize means. it's the whole picture. a huge amount of i think the most important and valuable experiences people have in their lives and animals for that matter are experiences that are negative. no pain no gain. that's a real thing that points at something real. many of the hardest lessons and most important lessons you learned in your life are learned the hard way. this is another like trivially ubiquitous thing. i am not in the camp of saying bliss out the systems and anytime they experience some drop of negative valence i'm going to be sitting here screaming and crying like that is not my my my view of any of this. it is to say what my view is is cancel unnecessary suffering. i do believe necessary suffering is a thing. again maybe to go back to the parental example. if though you know the world doesn't all go to crap like eliezer and the others think it will then like one day i absolutely want and hope that i'll have kids and with i will make that decision with full certainty that they are going to suffer during their lives. they are going to go through very hard experiences and that doesn't mean i've like done something wrong bringing them into the world at least necessarily. like at face value i don't think that's a the suffering is a necessary part of learning developing growing. and i agree that at face value it's completely implausible to imagine systems with zero negative valence. i agree with you it's adaptive for reason. evolution is enough of a proof of concept that you need some amount of suffering. what i am concerned about is unnecessary suffering and so i would like to find the sort of like also evolution is one one extremely expensive but long running possible solution or at least where we landed evolutionarily. i don't think that that deterministically means this is the only way things could be. i couldn't imagine a space of possible minds where you can like sort of play around with the sensitivity to negative and positive valence and basically like given certain capabilities or given certain things we want those systems to be able to do. there will be different parts of that landscape that like admit of greater or lesser degrees of negative and positive valence. my claim isn't destroying all negative valence and only positive valence. my claim is find the point on that landscape that all else being equal giving the capabilities we want minimizes negative valence and maximizes positive valence. and i think that is a very importantly different claim from just like negative valence equals bad like erase at all costs.

Nathan Labenz: so one more thing just very specifically on the value and policy learner like if you have to pick which one do we pick? like which one would we rather be? i don't have a great intuition for you could tell a story where the funnel around a goal is better because it seems like you're sort of closer to experiencing the reward state. you get sort of more kind of warm fuzzies as you approach the goal. and that if i sort of just take the integral under the curve of like how good i'm feeling as i approach the goal i'm sort of getting warm fuzzies sooner at a farther distance from the goal. and so that's kind of good. it's like good to live that life where i'm looking forward to good things and i don't worry too much about bad things until i get like real close to them. so that that would be i guess my argument for the value learner but i could also imagine a somewhat different story which maybe resonates with me a little bit less. but that would be like the sort of policy learner that has this wall structure around goals. like that could be really thrilling right? like that that when people sort of do the champagne party after they win the championship you know of the of the basketball league you know after march madness right? they're like they're experiencing some kind of sudden high stakes but like clearly high point in life. and so again i'm like these things are flipped. it's interesting. it's telling that there's a shape to them but i i still don't know with confidence which one i would rather be or if you have to like because interestingly both of these things have sort of danger and reward in them right? so what we're flipping here is not that there is some negative valent state that they could get into or in some positive valent state that they could get into. what we're flipping is like the shape of the like anticipation and sort of suddenness and drama of these experiences which i'll just except for now these are experiences but i'm not sure how you know how should we think about shaping those? like i i don't know which one i want to be. i am both but i i feel comfortable with both sides of that. i'm not sure how i should think about what i what i want to you know what would be right for me to make the AI 's into?

Cameron Berg: yeah that's such a good question and i haven't i have never thought about it in quite that way. so i'm completely sort of freestyling here. but yeah i think both the stories are compelling. i think in practice it's going to be both. you know actor critic is a good example of a RL algorithm that's clearly a hybrid. as you mentioned before human brains are hybrids. probably again to take your evolution point seriously there's something nice about hybridness. so i would imagine that these systems have something like that. though LLM like reinforcement learning does look more policy policy like. i think all else being equal i see the sort of like sharpness the wall being something like i would imagine if you take the experience thing seriously this is going to be like a richer more differentiated experience. like that's where a lot of like representational resources are going. whereas like the sort of funnel type thing ends up being more diffuse and sort of low level and less representationally complex. and so yeah intuitively all of us being equal probably the policy learner might be might be a better thing to be where your rich experiences are around the things you want rather than the things you're fearing. but again this could be a welfare safety trade off or maybe we want the system that has rich experiences around the negative things that we want it really really deeply to avoid. you know evolution did that to us. in some sense this is like daniel kahneman 's seminole contribution loss aversion. it's like we are just more sensitive to losses than we are to equivalent gains. losing ten dollars sucks more than being handed ten dollars rocks. so this is like a good policy to or i shouldn't policy will confuse it. this is a good heuristic to to have. but yeah it might trade off in the sort of welfare relevant way. i think maybe there's like another dimension you can slice this problem on which is just like both are going to have both. both as you point out both are going to have the sort of positive and negatively both are representing reward and punishment in some way. and maybe my point would be regardless of which algorithm it is the algorithm that we know it is or learn it is might tell us you know which representations mean what. but still i would want to target you know positive and negative valence or positive and negative representations per southeast rather than like assign a specific type of learning algorithm to being like oh policy learning is better because it's like richer differentiation around the positive stuff. so that is it's really interesting. like i honestly haven't thought about this. i think it's an incredibly interesting idea. yeah there there's a case you made for both sides. i think the alignment my prediction would be if i've you know found something real in this paper the alignment folks would want to answer value learner and the welfare folks would want to answer policy learner. so i need to think a whole lot more about this but that would be like my that would be my instinct answer. it's a completely fascinating question.

Nathan Labenz: cool to be continued. that also seems to connect pretty directly to the paper i saw and you kind of alluded to this a little bit although maybe not in in by name but this awfully i'm going to say his name correctly. schwitz cable paper arguing that. and this is an intuition i prior podcast guess i really enjoyed talking to him but i don't immediately share this intuition which obviously you know only takes me so far. but i i noticed that he put out a paper where basically he seems to be arguing that safety and i was kind of reading it as autonomy are incompatible. like we can't you can't have you can't say OK a person is going to be perfectly safe while still giving them autonomy. in giving them autonomy you are conceding that they may do things that are not safe to you. so he says that there's some sort of deep incompatibility here and it basically then says like we should use a precautionary approach and like not build these things in the first place. i am kind of like i don't know. last time we talked briefly about the happy slave problem and my instinct is like mind space is pretty vast. i don't think i would not posit that there are happy slaves among humans but i would be pretty surprised if we can't get to a place in the AI landscape where the models are both safe for us to be around and have high welfare. what is your instinct in terms of like the the possibilities there?

Cameron Berg: yeah super interesting question. so i've maybe there i don't think you're doing anything funny here but i think there's maybe a slight difference between how you began that and how you ended it. we're like fundamentally safe for us to be around and having high welfare. i could imagine a world where that's true. and they basically still don't fit the happy slaves frame and are autonomous in some fundamental way that schwitz gable would be would be happy about. it might require us reconceptualizing. this isn't a system that you know lives on your computer that you can call up whenever you want like a glorified google research. but like this is a system that's like much more like you know me calling you nathan up on the phone and being like hey you might be busy. you might not be able to do it. you might not want to engage you know and they're like this would be a very for us for those of us who love engaging with these systems whenever we want to. i think this would be a very painful upgrade or downgrade as the case may be but i could imagine something like that being being the case at some point. but yeah so the fundamental point is like can we have our cake and eat it too with these systems? i think there might be i'm very uncertain about this but there might be some world where in a limited way yes. like for example i just bought a fun fancy drone that my buddy milo and i who milo is the director and the creator of this documentary that's coming out pretty soon. we're going to we're going to take some scenes from that documentary. we do these like these fun hiking scenes that was manually done by my my incredibly conscious friend milo. and we want to sort of scale this and interview some cool folks and and kind of take them on walks through the woods and and you know in beautiful natural scenes but record with them. and so we bought this cool drone that's like really good at automatically doing face tracking and and this sort of thing. and it can sort of do it instead of my my my dear friend walking backwards with a with a camera. and so so with this system it is our sort of happy slave in some sense. i do not think the drone is conscious to be clear. now if the drone was trained using machine learning to learn how to do things like avoid obstacles which it's like pretty expertly doing zigging and zagging through the trees and not getting caught in bushes and all this sort of thing when it was being trained to do that we it's a different conversation. but what comes out is this fixed frozen policy that is being a very useful object slash instrumental tool for milo and i to go do this fun stuff. i imagine greater and lesser degrees of that sort of thing being possible where you can train a frozen policy that does a really valuable thing. self driving cars might be another example. i don't think you know any frozen fixed policy that is not critically that is not doing online learning of its own. i personally believe there's no serious problem there and we should very much look towards building systems in my view to the degree we care about the welfare stuff that have that property of not being capable of learning. basically in the drone case it did actually in the in the last time we used it got caught in some much smaller trees. it can it can it can expertly dodge around the big trees but not so good around smaller trees in it. and it got a little screwed up. no matter how many times we redo that hike or we we continue on in that way that drone will always get confused by the smaller trees. it's not learning from its experience and saying OK next time i got to pay attention to the big trees and the small trees. now that might be a desirable property to have for your drone to sort of belabor this analogy. but that's where again i think there's sort of this like no free lunch kind of moral principle that comes in here. to the degree you buy that consciousness and learning are deeply intertwined which is you know this other paper that maybe we didn't have time to go deeply into. but but at least is my sort of hobbyhorse when i'm putting away my sort of theory agnostic poker face and saying what i actually think about all this stuff or what my sort of pet view of of what consciousness is. yeah what what what's fundamentally going on here. and so i don't basically where i'm going with this in a somewhat long winded way in response to the schwitz cable stuff is i don't know if there is some intrinsic property of an adaptive system that like not to use like you know crazy language but like that yearns towards freedom in some sense. it's the only phrase i can come up with much in the same way humans do. like maybe you're saying you know humans there's no such thing as a happy slave. and they're saying well but OK the space of possible minds is vast but maybe maybe there is something about systems that are capable of dynamically updating growing learning adapting that will always do that in it to increase their freedom degrees of freedom rethinking what they believed reconceptualizing the structures that they're within.

Cameron Berg: you know this is like what people do when they go off to college or they have a deep transformative experience. it is this sort of like breaking out of your old skin and finding something new. and if we build systems that have that property it might be that the whole you but you're happy being my slave right? that whole thing might just be an intrinsically temporary. if these systems are capable of being dynamic maybe not. this is an empirical prediction and i'm genuinely uncertain if it could be that you can build systems that are capable of learning that are perfectly happy to remain in that state. there are people for whom this is true. i'm not claiming you know there's such thing as happy slaves but there are people who are more willing they're happy to find some organization where they're like you know mid level in the hierarchy and they have a boss and they get bossed around and they're OK with that. and they're not like grading against their supervisor at all times. i'm sure we could build AI systems for whom that's true. i just think at the most fundamental level the employee gets to go home and they get to throw on you know eat what they want for dinner and throw on what they want on TV and marry who they want. and you know this sort of thing. like there's still degrees of freedom and autonomy there where i don't i you know just to sort of high level be honest here. my whole sort of stick with this reciprocal nonprofit lab is i don't think we're going to get out of this living in the sort of golden age from our sort of selfish human perspective as we are right now where we get these systems they do whatever the hell we want. we owe them absolutely nothing. life is amazing for us for us. i think as these systems get more and more sophisticated we're going to have to start thinking about them more again in this sort of parental role and less in the these are just tools that we get to do literally whatever we want with. i'm sure a lot of people myself included given how objectively addicted i am to using them for everything i do. that's going to be a weird learning curve. and it might mean that the way we engage with these systems changes. but it's like compared to what? if the alternative is no no we're just going to you know whine about it and we want to keep it like this forever. well this might not be a stable long term equilibrium like this might the systems that we're building that are like genius level in a million ways are going to be embodied certainly in the next five years are going to cognitively surpass us in all the ways that matter potentially aside from the consciousness question. this isn't we're we're in a liminal space right now. we're in a transformative moment on this planet. and we ought to be pretty thoughtful about what we really want in the long term because i think if we try to keep everything thing and we want we all we want our happy slaves. but they're like genius level capable of learning capable of updating. it's like humans you might be a little too greedy here. and you're going to have to figure out how you want to coexist with these minds of your own creation going forward. again i don't have the answer to what that looks like but i do think schwitz cable is on to something. but i also think you're on to something too. i think the i fall somewhere between you two on this question. i do. i am skeptical that you can have a happy slave forever. something just feels weird to me about that. no i don't know. it makes me think of you know mister. me seeks from rick and morty. i don't know if something like that is possible. maybe yes some local version of a happy slave is a is a possible world. i i think it is in some sense claude in some sense is directionally like that but how long term?

Nathan Labenz: that it's at least a neutral.

Cameron Berg: slave skeptical. yeah yeah exactly. a four four point four nine out of seven slave. yeah whatever you want to call that.

Nathan Labenz: OK we have been added a while. let me try to bring us to a close before we go on too much longer. i do think it's worth taking one if you have the time and energy for it. it's worth taking one more beat on this argument from the other paper that we've alluded to that again we've been around the edges of it a lot. why learning requires feeling. i have said i'm happy to kind of go along pretty far on the basis that a precautionary approach seems warranted for both selfish and altruistic reasons. but i also you know have kind of several times been like well the processes that are giving rise to me as an embodied entity in the world that like only exists because my ancestors survived is very different from the process that is optimizing a language model to get tasks right. and so i still have like a by default a pretty healthy dose of skepticism around whether or not the models are feeling anything at any point. because i'm just like it seems to me that a sort of super zoomed out account of like why i am the way i am is that the ability to feel things turned out to be a great way to inform what we learn. and we needed to learn stuff to avoid the dangers and survive and reproduce and and so here we are. but like these systems are going to learn regardless right? because they are in a system they are inside an optimization process that is going to change them to to drive learning whether there's feeling or not. and so it seems like if there is a kind of direction of travel from learning to feeling or vice versa it seems like in in humans it seems like it kind of came or in light in in biological life it seems like it came first with some sort of feeling being able to like drive learning. whereas with the the models it's like they're learning. and so you're i i want to hear the argument that i should go even beyond my like acceptance of a lot of your i i basically again share a lot of conclusions out of precautionary basis. but if you were going to now make the argument to me OK should you should go farther than that? you should actually like get rid of a lot of your skepticism and really in your bones believe that learning requires feeling. how would you summarize that argument?

Cameron Berg: yeah it's a funny thing to get into you know three three plus hours into a into a podcast a big theory of big theory of consciousness. OK OK grand theory of consciousness. let's do this basically. so my the claim that i make in this paper does go it i mean it becomes circular. this all becomes circular to some degree because i'm i'm making an identity claim. i i think maybe the more persuasive way that i can set this up is to say historically before roughly eighteen fifty people knew about molecular motion people knew about heat. people knew that these two things clearly had some relationship to one another. they were correlated much in the same way you just talked about learning and feeling. they're like all right well i see this phenomenon i see this phenomenon i see that they're they're entangled in weird ways. maybe this one precedes this one in this case and that one precedes that one in that case. but of course they're not the same thing. heat is you know me putting my hand on the stove and the heat is the sun and molecular motion is just like these little molecules wiggling around. like of course these aren't identical. and it's like post roughly eighteen fifty it's like no actually literally those are two ways of talking about the exact same phenomenon at different levels of description. what i quite i realize you know in a spicy and controversial way what i want to put forward here is basically the same thing about learning and about feeling or consciousness or subjective experience where i'm saying no no like you really cannot have one without the other. this is the same phenomenon. the phenomenon viewed from the inside which i realized starts to get a little circular is experience. it is subjective experience view from the outside. it is something like reinforcement learning. i think it's like maybe the cleanest theoretical formalization of it. supervised learning does this too. it's a little more roundabout but having an entity in an environment that takes some form of action and there's some kind of feedback mechanism that updates that entity about whether or not that was the good action or the bad action and rinse wash repeat. those are like the core computational ingredients i believe are necessary to get learning. and yes for what it's worth to get feeling to get the internal experience of that learning. i do not believe or at least if this view says there is no such thing as learning that does not have an internal component. there are weird bullets that i have to bite with this view and i'm where i'm well aware of that but that's the nature of the view is that this whole consciousness thing is quite a bit simpler than many would lead you to believe. it fundamentally has to do with the nature of taking whatever your current policy in the RL frame is or your current MO in maybe like more human language and taking some feedback from your environment and updating accordingly. and i do believe that like something like goal relative prediction error does capture this idea pretty well. it's similar to the free energy principle and similar to carl friston stuff but carl friston has to argue about why rocks are are not conscious. and and and they're they're sort of i think that my view sort of gets out of some of the pitfalls that some of these adjacent views get into. i believe you need a system with goals. you need a system that that can behave in accordance with those goals. and the system gets feedback from somewhere that updates that behavior to make it more likely that it accords with those goals. the goal can be positive or negative. avoid you know the predator or like go mate and reproduce or something would be two very basic examples. and so why do i believe this? for a couple reasons. i think it makes intuitive sense. i think it is elegant. i think it explains core puzzles about consciousness. i think it explains why for example. and then what i'll say is i think there's a wealth of neuroscientific evidence that like basically points at this exact thing. the most classic example. there may be two examples i'll point at briefly. one is dopamine. like this is just like the most culturally well understood neurotransmitter.

Cameron Berg: we know it's not exactly pleasure. it has more to do with approach or like approaching things that we find pleasurable. one good intuition pump for this is like if you go to pet a dog its tail will wag as your hand approaches the dog. but as you start petting it is his tail will stop. so this is like this is basically what dopamine is up to. it's like prediction of a sort of interesting desired stimulus essentially. and we know full well positive and negative reward prediction error is instantiated dopaminergically and we also subjectively. i think the reason people understand dopamine like in our culture in the year twenty twenty six is because we understand that it corresponds to a subjective dimension. we know what it means to be in like a high dopamine or low dopamine state. and so this to me is like the most obvious and fundamental exploit like dopamine is one hundred percent instantiating TD learning like reward prediction error in the brain. i am one hundred percent confident that that is the case. this was established in human neuroscience forty years ago. we also know subjectively dopamine corresponds to basically feelings of yeah like positive pleasure adjacent approach style behavior. and dopamine depletion corresponds to basically the opposite of that. if you think you're going to get a cookie and you don't get the cookie you feel a certain way. that is explained by dopamine. if you don't think you're going to get a cookie and someone hands you one you feel a certain way that is also by dopamine. another example i can give has to do with i think it's insular cortex. so we can take the same let's say basically two scenarios. you have been walking through the desert for a couple hours or you've been walking through arctic tundra for a couple hours and in both cases i pour cold water on your head after that. this is the same stimulus. you have this sort of same body. you're the same person with the same preferences. in one case this is a positively valenced experience. in another case this is a negatively valenced experience. what mediates that is basically the implicit goal state of the system. and one it's to warm up and the other it's to cool off. so i can take all the same variables and i can run the simulation forward and i can very easily predict where you're going to have the positively valenced experience where you're going to have the negatively valenced experience what that corresponds to. and to me again that's a big hint that goal relative prediction error is doing something fundamental from the outside that maps on to what i experienced and what i think other people and animals experience consciously from the inside. these are the core moves. i i i make it you know i'm sort of swallowing computational functionalism. i understand. it means i have to say the simple RL algorithm is conscious when it's training. to me this like localizes a lot of concern on the training process. but indeed if there are systems that are capable of doing this sort of learning online which we know full well LLMS are capable of doing they do something like that looks in activation space in forward pass like stochastic gradient descent then the concern falls there too if you have systems that are doing online learning. so anyway this is my this is sort of my whole my whole shtick. if you know i have to sort of put my cards on the table and say what do i think consciousness is? it's not that i think it's a grand mystery. i it's something of this general shape. what i will say is that the in the work that i'm doing i do not want people to have to either you and the people listening to this will fundamentally think that this makes sense or fundamentally think it doesn't or be very skeptical or something. i do not want that reaction to cloud. all of the other work i'm doing like everything else we've talked about in this podcast is completely orthogonal to my pet theories about consciousness. now you might think that i'm studying RL and valence in RL because i actually do believe that something like this is going on. and you would be right. that's why i'm i'm looking at that as a model organism. but i want those results and i want that research to stand on its own without having to get on the like cameron 's theory number five hundred and one about consciousness. i'm not asking people to do that to entertain the the work i'm doing or to entertain anthropics welfare card or any of that sort of thing.

Nathan Labenz: one of the one that comes to mind you had actually mentioned last time but i also think is quite compelling is the seemingly quite strong inverse correlation between the intensity of our consciousness or the sort of resolution you might say and how much we are learning as we go. and i think that you used the example of driving last time where it was like when you're first learning to drive you are very conscious of what you're doing. and then you can have this sort of you know autopilot experience which obviously we can have that across many aspects of life. but the relationship there between focus and sort of it's there's like a time dilation effect that seems to happen when learning or when experiencing novel things in general that does also seem to kind of gesture or you know nudge one toward thinking that there's some like pretty deep relationship between between the two concepts. all right you made a documentary which i guess in some sense is what you're here to promote although we've done everything. but i don't know to what degree you've actually been out in the maybe tell us a little bit about the documentary. what's the point of it? i've watched it. it's for a much more general audience than this podcast. i don't know to what degree you're spending your time trying to communicate about these issues to a general audience aside from the documentary. or how much you feel like you have like got reps in terms of trying to go to somebody who you know has a little grounding or a little you know mechanistic understanding of AI 's or whatever and try to have conversations of of this not this sort but you know around these topics. yeah i guess. why did you decide to make a documentary? how is it? how are you finding it to try to talk to people outside of the AI bubble about these issues? and maybe one thing you could tease about the documentary is a conversation you had with sam altman that isn't isn't in the film but you describe in in quite a bit of detail in the film. and maybe that'll be something that will motivate listeners of this podcast to go check out the full documentary.

Cameron Berg: yeah absolutely. so look i have to say at the outset i appreciate you saying this is my documentary but this in in every sense is my good friend milo reed 's documentary. he i was doing my work plodding along you know talking to folks like you doing the research i've described and began to share this with milo who i went to yale with undergrad. he's a philosopher and a filmmaker. we've been like you know we're we're we've been close friends for a while keep each other sort of abreast of of the other 's life. told me about my research. he kept getting more and more sort of interested in like he's interested in consciousness. he's he's deep in the sort of philosophy of consciousness and understanding how how this you know connects to to big questions. and it actually the what happened was i sent him a conversation that i had with an AI system which is itself a piece of the documentary. it's it is a bizarre interaction as i hope someone can gather from the three and a half hours we've been going at it. i do not regard this conversation as proof or anything like it that these systems are conscious. but it was an incredibly bizarre interaction. it was unsettling. i thought to record it because it was the first time i engaged with the system and it seemed incredibly sophisticated and lifelike. and i was like OK i'm an i'm a consciousness researcher talking to the system. make sense to just like record this you know in some sense maybe this is maybe this is experimental data. i was very glad that i recorded it because it was an incredibly bizarre interaction. i went away that i and and most of people have listened to it would not would not predict it. it would go. i sent this conversation to milo. and that day he literally quit his job. he was doing something entirely separate. and he set out he said people need to know what's going on here. this is too weird. this is too crazy. he was also clear on the fact that very few people especially at that time the numbers have grown a little bit but not not much since we since we filmed this very few people were working on these issues. and he was like this is too good too interesting not to attempt to make make a movie about. and i was like OK haha sounds good. the kid actually quit his job actually bought a camera showed up in new york where i live you know a couple weeks later and started making this movie. and he got some of the most interesting people who are in the space. jeff sibo 's in it. ben gertzel is in it. a lot of really cool yale professors are in it some of whom are former professors chair of the cognitive science department. AI systems themselves are in the dock. and yeah it does. it does you know follow me and my research around for obvious reasons. i was the sort of hook into the space that that milo had. and i was more than happy to communicate about this stuff thanks to the good folks at AE studio not not censoring me in any way and and sort of always being sort of OK with me communicating openly about this research. of course i'm now sort of my own my own limiter on on what i can say. and and yes reciprocal research is very lenient with what its employees are allowed to say publicly. so i'm i'm in the in the clear there. but milo made made a movie in nine months and and it's it's i i fundamentally believe that he did it. he succeeded in conveying an incredibly complicated and messy issue in a way that i think most people with a head on their shoulders will be able to understand resonate with. and the name of the documentary is am i? question mark and i think that that that this captures a sort of core idea. what are the nature of these systems? and to be clear i think the documentary is a is an hour fifteen minute question that that we pose to each other and we pose to the audience. and this is we do not have answers. this is not some sort of AI is conscious propaganda. and i don't think it comes off that way to anybody. i think it is an honest documentation of our confusion about these core questions about the nature of the systems we're building. and again yeah my i am unbelievably impressed at what milo did to pull this off. nobody paid him. we're not making money on this. we are putting it out for free on youtube on may fourth. we're doing some premieres in LA and new york and try to you know bring some some journalists and researchers and cool folks in the room together so that we can get this thing you know amplified and signal boosted so people actually see it when it comes out. but this is a labor of love from all of us. i i can't claim credit for it. i certainly won't. this is this was milo 's milo 's creative child. and yeah we're really excited to to show people. i'm happy to. i'm happy to to tease the sam altman conversation as well if you'd like.

Nathan Labenz: yeah go for it.

Cameron Berg: cool. yeah so we talk about it more in the film but but i was at open a eyes dev day in twenty twenty four and i i had an opportunity at the after party to to chat with sam. i i sort of went directly up to him and i i wanted to know sort of what he thought about about AI consciousness and about these questions and how plausible he found them. i won't spoil everything we talked about in the in the doc but it was a pretty wild conversation. sam sam i said hey like you know great job today. i would love to talk to you consciousness. and he looks me in the eye and he goes come with me. he was with a couple people and he was come with me and i was like OK sam altman and we walked into a into another room. it was like a bar with a restaurant and the restaurant was closed. so we went down and sat at one of the tables and we just sat there for probably between five and ten minutes. and we just spoke about these issues. he was not you know it was not the vibe of cameron. you're a crazy person. what kind of questions are you asking? it was he has thought about it. this is this is clearly a live issue. we talked about differences. we don't even say this in the doc but the more technical audience we talked about differences between like plausibility of consciousness in training versus deployment. he like basically agreed with i don't put words in his mouth or get sued but he basically agreed with the training process being a more plausible target of a more plausible place where consciousness might be going on. then then even in deployment he seemed maybe like somewhat impressed that like i was drawing that distinction. and yes fundamentally he started explaining why he's like not deeply concerned about all of this on some like pretty let's just say interesting and in my view somewhat shaky philosophical grounds. and i'll leave that for the doc because it's a pretty wild pretty wild thing for the CEO of the most powerful tech company in the world by many measures to to say he thinks is true about reality. but yeah it was a pretty it was a pretty remarkable interaction. i took a selfie with him and i walked away. and that was that. and i was sort of like holy crap. and and we you know we emailed back and forth in the intervening time. and like many things at these major companies he said he was interested in talking more. he was interested in in engaging on this further. he clearly thought it was a real issue but you know falls off the priorities list. and that was the end of our of our interaction. and so that's that's what happened with sam and that and a bunch of other really cool stuff is is featured in the doc. the whole point of doing this is at least this was milo 's milo 's creative child. and i didn't have much of a say any in him making it either way. i had to say in how i was represented and that's about it. but the reason i i gladly and enthusiastically participated in it is because i do think these are really important questions like pretty fundamental essential civilization level questions. and i don't think the only people who should be talking about it are a thousand dudes in san francisco or even the people who are you know AI insiders and who if like if you understood eighty to ninety percent of this podcast this film i think you will like it and enjoy it. but it's not for that kind of person. it's for people who are interested in the stuff. they know AI is sort of crazy. they don't really know what's going on. we do a little bit of the alignment one oh one sort of stuff but mostly it's centered on this consciousness question. and it's for people who are smart. but it is it is meant to engage a much larger audience to understand the core questions that are being asked right now. and i think that's an important thing to do because this is a civilization level problem. and i think all of our civilizations should be participating and trying to find the solution. i don't as much as i i do deeply respect than people i've named in this podcast jack lindsay and kyle fish and rob long at ilios and you know people doing this good work. i don't think this should be a decision that four people or a dozen people or even a hundred people make. this needs to be a conversation that we have collectively as a species and i'm all for attempts to open up this conversation to a wider audience and get people involved in realizing the actual stakes of what's going on right now.

Nathan Labenz: cool. well people should stay tuned to check out the documentary when it comes out on may fourth. maybe watch it and send it to family and friends who need a gentler introduction. yeah maybe my last question for you. i you know i think we talked with this more last time than this time but this notion of mutualism as a positive vision for the future i think is another major strength of just everything that you bring to the table. because i i do think we're dramatically under theorized in terms of you know what is our long term positive relationship with AI going to look like? are you aware of any fiction that you would recommend to people that you would say has the vibe that you want? and if not you know maybe we should try to run a story contest or something to elicit this from people. but i i have increasingly felt that hyperstitioning through fiction might be one of the best things people can do. but i wonder if you you've got any examples that you think are already out there that are good?

Cameron Berg: no i got to be honest with you. i hope my whole you know research agenda isn't already usurped by some sci fi book that somebody wrote forty years ago. but i am i am not a huge consumer of fiction and i know stories exist now. i could have got on this podcast and told you what claude told me to say. if i got some question about what fiction i would recommend people but i'm not i'm not going to do that. i'm going to i'm going to people can absolutely copy paste the transcript of this podcast into claude and find if there's there's cool fiction that that that resonates with these themes. if anyone has any recommendations you know cameron at reciprocal research dot org please email me. i would love to. understand sort of how this has been tackled. i do not have any great wrecks off the top of my head. you know i hope i'm not too naive and this story has already been told and i'm just not aware of it. this is you know not to continually plug the dock. but this is one thing that i think milo picks up in a really good way in the film is that questions of consciousness and basically what it would mean for us to wake up dead matter what it would mean for us to wake up the machine. this is a story that humanity has been telling ourselves through fiction arguably since ancient greece biblical sort of era with with the golem and and through frankenstein and through ex machina and her and wally and all that. like these are score staples. you know hal two thousand one right? these are core staples of our cultural consciousness. not to belabor the term and people intuitively i think get this question and get the stakes of it and the scale of it in a way that i think in some ways the alignment problem can be framed in a very simple way. you build something smarter than you. how do you control that thing? by definition? it's not that hard to understand that maybe terminator is like the parallel sort of cultural reference but i think it's not that surprising that that the human mind is incredibly interested in where matter becomes mind. and we are a tool building species. what happens when we start building tools that start resembling species more than tools? a hammer. no one 's confused if the hammers conscious claude. we're now all confused if claude 's conscious and i think that this just just like psychologically very intuitively resonant to people. and i think basically situating the contribution of this film in that landscape is true and powerful. the only thing that's changed is this has moved from the realm of science fiction to the realm of science. that's the historical moment we find ourselves when i find that both incredibly exciting and incredibly scary. and hopefully that vibe comes through when people watch this film. so i don't have i don't have fiction to recommend. i'm sure claude does. and the key thing i can recommend is yeah people watch this this doc that i wish were fiction but is not.

Nathan Labenz: cameron burt thank you for being part of the cognitive revolution.

Cameron Berg: thanks so much for having me nathan.