Hume CEO Alan Cowen on Creating Emotionally Aware AI
In this episode, Nathan sits down with Alan Cowen, CEO and Chief Scientist at Hume AI, an emotional intelligence startup working on creating emotionally aware AI.
SPONSORS:
Shopify is the global commerce platform that helps you sell at every stage of your business. Shopify powers 10% of ALL eCommerce in the US. And Shopify's the global force behind Allbirds, Rothy's, and Brooklinen, and 1,000,000s of other entrepreneurs across 175 countries.From their all-in-one e-commerce platform, to their in-person POS system – wherever and whatever you're selling, Shopify's got you covered. With free Shopify Magic, sell more with less effort by whipping up captivating content that converts – from blog posts to product descriptions using AI. Sign up for $1/month trial period: https://shopify.com/cognitive
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.
X/SOCIAL
@labenz (Nathan)
@AlanCowen (Alan)
@eriktorenberg (Erik)
@CogRev_Podcast
LINKS:
Hume: https://hume.ai/
TIMESTAMPS:
(00:00) - Episode Preview
(00:04:52) - How do you define happiness? How can AI make people happy?
(00:08:40) - The striking experience of trying the Hume demo
(00:09:16) - Building multimodal models using facial expressions, vocal signals, speech patterns
(00:12:27) - How Hume’s models have an advantage of LLMs when it comes to interpreting emotions
(00:13:23) - Assembling diverse datasets of emotional judgments via surveys
(00:15:27) - Sponsor: Shopify
(00:22:00) - Across populations, what is common or different in people’s judgement of facial expressions
(00:26:00) - Interrater reliability for training data
(00:30:48) - Sponsor: Omneky
(00:33:20) - The unique labelling of “awe” across different cultures
(00:36:23) - Customizing models for cultural emotional expression norms
(00:41:13) - How they determined the set of emotions recognized by the model
(00:42:50) - Schadenfreude as a unique example of semantic space research
(00:49:00) - Custom models
(00:52:42) - Using Hume in B2B contexts
(00:59:49) - The cost comparison of having a human analyze emotions vs AI
Full Transcript
Transcript
Alan Cowen (0:00) Already with a preexisting knowledge that gets it to well beyond human level. It should be, like, way better than the than any human on Earth at understanding emotional affordances just from the start because it's already learned from way more data than any human on Earth has seen that informs our understanding of emotional affordances in everyday interactions. The ideal is that it is better at predicting how an action affects human well-being than the degree to which it will result in your bank account having a higher number, things like that. Because predicting well-being is a matter of predicting people's emotional expressions, their reactions, their states over long timescales, and that data exists. I think there's a path forward. I'm very much worried about it, but I'm also optimistic that
Nathan Labenz (0:43) we can solve the problem. Hello, and welcome to The Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost Erik Torenberg. Hello and welcome back to The Cognitive Revolution. Today, my guest is Alan Cowen, CEO and Chief Scientist at Hume AI, a research lab and technology company that describes itself as teaching AI to make people happy and thus paving the way for prosocial AI. Now if you've read your science fiction or even just studied the many challenges associated with the use of RLHF training to shape AI behavior, you'll immediately recognize that the interplay between human emotions and increasingly capable AI systems raises all sorts of profound questions. How do we know that we're accurately measuring human emotion, particularly across diverse populations and cultural contexts? Exactly what emotional response should we be optimizing for? How do we understand the relationship and trade offs between short term happiness and long term well-being? And how do we maintain high level control as AI systems begin to access not just our verbal and facial responses, but our brain states directly, potentially coming to understand us better in some ways than we understand ourselves. Such scenarios are no longer all that far fetched. But in keeping with my general philosophy of trying to figure out what is before attempting to determine what we can do about it, I spent most of my time with Alan digging into the details of how he and the Hume team have built their platform, exploring the challenges inherent in collecting emotional response data and their approaches to overcoming them, delving into the nature of emotional measurement models that they've built for text, image, audio, and video, and understanding how they productize all this technology for customers today before finally zooming out to consider the big picture at the end. It's no exaggeration in my mind to say that the development of emotionally aware AI is 1 of the highest stakes parts of the broader cognitive revolution. And I came away from this conversation very impressed, not only with the technology that Hume has built, but also by the depth and quality of thought that Alan and team have put into their approach. You can see the technology in action by watching the video version of this episode. On top of which, we'll be layering real time output from Hume's emotion recognition API. And you can also dig deeper into their philosophy at the humeinitiative.org. The work there, of course, is nowhere near finished, but already they have established a 10 member ethics committee, developed 6 guiding principles for empathic AI, published best practices for measuring well-being across low, medium, and high risk use cases, and articulated a list of both supported and unsupported use cases. As always, if you're finding value in the show, we ask that you take a moment to share it with friends. And, of course, we always appreciate a review on Apple Podcasts or Spotify. Now here is a deeply thought provoking conversation with Alan Cowen of emotional intelligence startup Hume AI. Alan Cowen of Hume AI, welcome to the Cognitive Revolution. Thanks for having me. I am super excited to have you. I think you are at 1 of the most interesting intersections of AI and humanity of any that I've come across. And, you know, it starts really for me just with the headline on the website, teaching AI to make people happy. You guys are building at Hume AI the AI toolkit to understand emotional expression and align technology to human well-being. I kind of wanna unpack this, you know, at at every level from the datasets that you have built to the models, to the products, to how people are using this. But maybe just to start off with, like, a a really big question, which is so fundamental. How do you define happiness and how do you think about that in relation to longer term well-being?
Alan Cowen (5:04) Yeah. I mean, so first, teaching AI to make people happy is really about well-being. Happiness is sort of the in the moment experience of positive emotions and well-being is defined in a few ways. It's like that plus how you kind of reflect on your experiences of being happy, and that's called satisfaction with life. But whatever you're measuring, it's happiness, satisfaction, positive emotions, you measure it the best you can and then you optimize for it. And so that's what our platform is trying to do. We give researchers and developers the tools to measure and optimize for human well-being. You have to measure it in a very multifaceted way. People value the richness of emotional experience. It's not a 1 dimensional thing. It's not about just cute cat videos or funny things or just awe all the time. It's just this mix of things. Love is important. Positive surprise excitement is important. And the idea that you can measure emotion, that's something I've spent a lot of years studying. I have about 40 papers on that, introduced a new theory about measuring emotion called semantic space theory. But the key theme is that emotion is just an essential component of all human interaction. And, yeah, whether that's a human or an AI, every word we say has, like, a tone to it with many, many dimensions that you can read, and people are using that to measure user experience and mental health and customer support outcomes and all of that. All of this ties into the broader picture of improving human well-being.
Nathan Labenz (6:35) It's a grand challenge, you know, for for really, you know, all of human history, not just this AI era, but it it certainly takes on another dimension now. 1 thing I would encourage people to do is check out the video version of this particular episode because we are going to try to take the video and overlay a Hume emotional recognition layer to the conversation that we're having. And this is technology that really works. And if you can't if you or, you know, if you're gonna listen to the audio version and still wanna go get a sense for for the efficacy of the technology independent of this episode, you can also go to the the website. Did you give me a special access, or can anybody go and create an account and actually just try it live in the browser?
Alan Cowen (7:24) Yeah. Anybody can create an account. And what you'll see is our measures, and it's really difficult to actually put into words what expressions mean, so we do our best. The word that you see is really a representation of what somebody seeing that expression in isolation would evaluate that as meaning, it doesn't necessarily mean that in the given context that you're in, that's the best word for your expression. It's really just meant to be an objective measure. Then we have a custom model API to actually interpret what that means. So you'll see all of the nuances of the many dimensions of facial and vocal expression we're able to measure when you try our API. But really, when you want to put that into practice, you want to connect that to something. You want to connect it to user experience, mental health, some measure that you have, and that's where you really get value out of it. You're you're able to predict basically a lot more than you would with just language. That's the real value.
Nathan Labenz (8:18) Yeah. It's super interesting. I mean, even just the base demo in the website works quite well that you know, in a world where things are being announced and things are being previewed and things are being, you know, hyped and waitlisted, I was definitely struck by the fact that this is technology that you can go, you know, get your hands on and actually just look at yourself, you know, essentially in the mirror and and see what the AI is understanding as it reads the expression on your face. And that is a pretty striking experience, at least it was for me. So let's kind of build this up from, you know, maybe the the ground up. Like, obviously, this is a multimodal system, and you have built a a kind of a range of products that take in different modalities. So maybe we could just kind of run down, like, the different modalities that you work with and, you know, give us a sense for, like, how much of a difference that makes in terms of the ability for systems to understand what's going on.
Alan Cowen (9:16) So we're able to measure speech prosody. So the tune, rhythm, and timbre of speech is this is something that every single word that you utter has some speech prosody to it. So you can think of the word token, the phonetics, the linguistic content, what you transcribe as carrying a lot of dimensions of information that make their way into language models, but then that ignores all of this other information, which is in the prosody, so how you've spoken that word. And there's dozens of dimensions of prosody that we're able to capture. And if you supplement the language with the prosody, you can predict a lot of things more accurately, specifically things like outcomes of of human interaction or things about human preferences and and human well-being and mental health. So we can predict in a given video somebody reflecting on their past, whether in the moment they're experiencing depression. That's something we can predict way better by adding in that non linguistic information. So we've got speech prosody, we've got vocal bursts, which are like laughs and sighs and screams and mm's and ah 's. We've got facial expression, which is
Nathan Labenz (10:29) all of
Alan Cowen (10:29) these different action units on your face, all these different muscles that you're moving. And then holistically, are people evaluating this as an expression of anger, contempt, disgust, sadness? Those are all negative versus awe, happiness, love, romance, adoration. These are all kind of related but really nuanced dimensions of expression that people can actually distinguish when you ask them. This is not just true within 1 culture, across many cultures people distinguish over 28 different dimensions of facial expression. So we're capturing all of that too. Each of those dimensions is a continuous dimension that varies over time, that's true for prosody and vocal bursts. If you add up all that data, you're getting hundreds of dimensions per second that we're able to measure as somebody's speaking or just reacting to something or driving, whatever they're doing, and make predictions based on that data.
Nathan Labenz (11:26) I was struck in just kinda looking at some of the materials around how much of a difference it makes. For 1 thing you mentioned driving, adding in the ability to look at the face and to hear the voice beyond just the language. If I'm understanding your materials right, it makes the difference between essentially being unable to tell whether the driver is drowsy and being, like, you know, 90 plus percent accurate in being able to tell if the driver is drowsy. Do I have that right? Like, basically, in that case, it's all about the non language aspects of the of the inputs.
Alan Cowen (12:07) Well, for for drowsiness, if you if you're talking, you're usually not drowsy. You could be really distracted, which is a whole different issue, and we can measure that too. So drowsiness is really happening when you're not talking anyway, and so the language model is not going get you anywhere there. We are able to capture these really nuanced facial expressions that occur even when your eyes are open that indicate that you're drowsy, that you're not able to function at the same reaction speed, basically. That's obviously important if you're driving a car.
Nathan Labenz (12:39) Yeah. Fascinating. That's I think this is a really interesting know, hopefully, my kids will never have to learn to drive a car if we can get our act together. But, I thought that was just a really interesting example of just how kind of far ranging this sort of technology might be in its application. Obviously, again, it makes sense on a conceptual level. Language is in a sense kind of a dehydrated. If you're purely in text form, I think of it as like a dehydrated version of what actually happened. Adding onto it images, audio, video, makes a ton of sense that it's going to unlock a lot more interpretation of what is going on. Let's work our way up. Let's start with the datasets. We'll go to the base models, and then we'll go to the custom models. So the datasets, from what I understand, are largely self reported emotions. So tell me how that's how that works. Like, how are you actually getting the ground truth? I'm sure there are a lot of challenges there, right, around, like, what what is ground truth in terms of, like, what somebody who's really feeling you know, self reporting makes a lot of sense, but I can even imagine that that would be fraud. So tell us about how you've assembled these datasets.
Alan Cowen (13:51) Yeah. So we have this big survey platform that we've put together where we recruit, at this point, hundreds of thousands of people to do a lot of different things. They're talking to other participants. They're talking to AI. They're reacting to things. They're acting things out, and they're reporting on their own expressions. So that's 1, supervised data point. They're reporting on their experiences. They're reporting on the expressions of the person they're talking to and what they infer to be that person's experiences. And we're controlling kind of randomizing the tasks that they're undergoing, who they're talking to, what they're talking about, what they're reacting to relative to the identity of the person. And that's really important because if you just train a model on tons of video data, it will start to understand facial expression, but it'll be really conflated with, like, the identity and context that the person's in. That is very pernicious often. Women are more often in certain contexts than men, more often expressing certain things than men perceived differently than men, and that's true across ethnicities and ages. You don't want those biases feeding into your recommendation algorithm, for example. You don't want it to see women forming more submissive expressions as, like, more positive than than what men form submissive expressions or, like, dumb. Like, you don't want that to, like, be a bias that just gets kind of echoed and enhanced by your algorithm. Like, you want the opposite of that.
Nathan Labenz (15:26) Hey. We'll continue our interview in a moment after a word from our sponsors.
Alan Cowen (15:30) So you use is really important to disentangle the identity and expression dimensions. And that's 1 of the things our surveys is designed are designed to do because they're scientifically controlled. We recruit these participants. They don't choose what tasks they're gonna undergo. Like, it's all randomized. Right? And so then we can train our model on all of these different task features. It's this massive multitask thing where we have experience labels, we have perception labels, we have the task that they're undergoing. And out of that, can extract these, first of all, dimensions of facial expression and the voice and speech that are very objective. If you use the right training procedure, you can get something that's going to give you the same output, whether if it's the same facial expression, whether it's somebody who's old or young or male or female or non binary or 1 ethnicity or another, etcetera. That is independent of the meaning of that facial expression. First, you want to just measure that facial expression. And then when it comes to meaning, there's all of these things that you can try to predict. They could be self reported emotion, right? But it could also be something really practical, did this person have a good experience with like, this product? Are these 2 people getting along? Things like that. And those are really deeply embedded in our expressions
Alan Cowen (17:03) expression and language. And we're able to decode that by first taking those objective measures and then connecting that to the subjective labels in a culture dependent way too, like, for each demographic and culture that we study.
Nathan Labenz (17:15) Yeah. There's so many layers to this. So I guess when I imagine the simplest version, it would be maybe just an image. Right? Because that's just like a static asset. And then I could look at it, and I could say, you know, is this person smiling? Are they frowning? You know? And then so it's it's interesting even just thinking about the vocabulary that we have for facial expressions. We have, like, a few kind of concrete words for expressions like smile, frown, you know, grimace, whatever. But then very quickly, it seems like that vocabulary runs out and we end up with you know, we kind of move over to a vocabulary that is more about the state that we are inferring based on the facial expression. So do you have something that kind of sits between those that is like you know, that I I'm I'm not that super familiar with, like, the micro expression literature or what we exactly what vocabulary they use. But do you have a sort of more objective vocabulary for all sorts of different facial expressions that sits between the, you know, the actual, like, shape of the face and the sort of emotional label that, you know, people just so instinctively jump to?
Alan Cowen (18:36) So this has been a challenge in emotion science for a while. Like, my adviser's adviser, Paul Eichmann, was 1 of the pioneers in this, he invented this facial action coding system that purported to enable people to label facial muscle movements. You have to undergo training to do this. There's 2 weeks of training, 8 hours a day, in order to be able to do this. It turns out that even that system is more biased by what direction people are looking and age than people's perceptions of emotion. So it's that perceptions of emotion are this really deeply ingrained thing in the human brain that enable us to make inferences because that's really what's important. We're not concerned as a species about the structure of what muscles people are moving. We're interested in social inferences, like what is this person's preference, what is their intention, and that's embodied in these emotion labels. People are really, really good at picking up on tiny, tiny nuances that influence their perceptions of emotion. The facial action coding system doesn't really capture this. Things like grin and smile, they don't really capture that. You really need to collect tons of judgments. What we do is we just we first collect a ton of judgments and just assume that structure of the facial muscle movements influences those judgments. Then we try to figure out what it is that is actually influencing those judgments, and not just in a collapsed average way across the population, the whole distribution of the population and how they evaluate a facial expression or how they evaluate a voice or speech. Like, we're taking all of those data points as these nuanced ways of conceptualizing expression, and they have some reliability to them. We do this thing called principle preserved components analysis to figure out across different populations what is in common and what's different in people's judgments of expressions. And we can train on that data, which is each each individual judgment is biased by things like ethnicity and age, but we can actually remove those biases by use utilizing our randomized task structure. And so we get the best of both worlds. And then when you look at the embeddings of those models that we train, those are the things that capture a ton of outcomes really accurately, whether somebody's a driver is drowsy, whether somebody is feeling depressed, whether they're going to be screened for as having Parkinson's, whether they're having a good product experience and a user experience study. We can predict all these things a lot better from those underlying embeddings, which are actually really difficult to name and conceptualize. So, I mean, there's definitely a lot of scientific work we could be doing about just looking at these individual dimensions of this embedding space and trying to figure out what are the actual muscle movements that give rise to those dimensions. It's not easy at all. First of all, because we can't even really measure muscle movements that accurately. The best way we have is for a living person for muscle movements is to measure the electrical signal, but that's not very localized. So you put electrodes on somebody's face and you try to measure the electrical signal, but we have over 40 facial muscles and you can't really narrow it down to which muscle is moving. So it's actually a much harder scientific problem to actually link those perceptions of facial expression and the underlying dimensions that are agnostic to the demographics with the underlying muscle movements.
Nathan Labenz (22:10) So if I understand that correctly, to try to summarize, it's kind of a losing battle to try to find that middle theoretically more objective state, but that kind of turns out to be a mirage. And you're better off jumping from raw inputs to more emotional labels versus trying to have a middle step on whatever the, you know, 30 second muscle in the face is contracting and, you know, that that ends up being, you know, more harm than good. A couple of other just questions on the data. So you you're recruiting people from all over. Is this something that I can go do? Can I participate? Can listeners participate? If so, how would they do that?
Alan Cowen (22:50) We actually prefer to find naive participants who really don't know what we're doing because that data is a little bit more valid. I always get this question even from people at the company, Hey, can I participate in the surveys? And you can participate in some of the surveys, but not all. I would say the ones where we're trying to get people's judgments of meaning and expression, we want that distribution to be a naive distribution over the population. But we have a new survey where we have people interacting with our generative API, so basically voice in, voice out. It also sees your facial expression that influences what it says. Then you report, Was this a good experience or not? Which right now we have participants taking that survey, but we do want to open that up pretty soon. So it can be something more like chatbot arena where it's like a little bit more democratized than anybody can take it. But that's that's coming soon. It's not quite out yet.
Nathan Labenz (23:48) I mean, it makes sense that you don't want to have people that are too meta in their thinking about what's going on. How consistent are people in their judgments, or how do you even think about something like inter rater reliability? How you know, if I'm evaluating myself and you're evaluating me, like, how much do we agree? How much do we disagree?
Alan Cowen (24:10) It's hard to think about inter rater reliability in this way, right? But let's say the 2 of us are evaluating a picture and we have a lot of raters evaluating that picture. There we have some degree of inter rater reliability that we can look at. And then we can see how well our models are capturing the explainable variance in those judgments. So what the different raters have in common versus what they don't have in common. The maximum you can get there is 100%. If I was just predicting 1 person's judgment, the maximum is not very high. First of all, because there's a lot of different things you can say. You can say he's happy or amused, or you can get even more granular, this person looks like they're experiencing Schadenfreude or something. Then you have to think about what is the semantic space that underlies that? So that's what we do with semantic space theory. What are the dimensions you want to cast that along to assess reliability and all that? And gets really complicated if you're trying to just predict 1 person's judgments. But inter rater reliability, I see it as more of a noise ceiling for how well the model can do, because we're really trying to predict the whole distribution of human judgments and not like try to replicate 1 person's judgments or say 1 or or even say like 1 person's judgments are accurate and the other person's aren't. Like, we're not trying to do that.
Nathan Labenz (25:27) Yeah. I imagine that would be a huge challenge. The closest thing that I've studied to this is the aesthetic evaluation of images. And there's a dataset, the name of it's escaping me at the moment, but there's a dataset put together a few years ago where you know, pretty heroic effort to go have, like, a 100 people rate each of the images in the dataset. You think, jeez, why do you need a 100 different ratings of each image? And the answer is because there is a lot of disagreement about just, you know, what this thing should score out of 1 to 10. You see, like, a pretty, you know, kind of usually, it's like a bell curve for any individual image, but you see a pretty wide, you know, curve for just a single image. And, you know, that obviously, you know, means that the downstream model is gonna have some real challenges as well. In terms of the data that you collect, is it like word I mean, I I can imagine different things. Right? You could I can imagine like a unhappiness to happiness scale where I would have like a slider and I would like put, you know, the slider at a point for how happy or unhappy I think they are. And you could have that, you know, on a few different dimensions. Or you could be like, choose the 3 words out of this pool of words that are, you know, most appropriate. I imagine you've experimented with multiple things. What do you find to be the most effective way to elicit that kind of judgment from people?
Alan Cowen (26:51) So we've done a bunch of different kinds of studies of this. So we had pleasant, unpleasant slider. We've had done things where we have 23 different sliders, pleasant, unpleasant, calm, aroused. This is familiar or unusual, novel for the person. All these different kinds of these are called appraisal dimensions. And then we do studies where it's just like pick an emotion or pick multiple emotions. And it turns out the pick an emotion study is way more informative than these sliders. Capture First, if you do the pick an emotion study and then you try to predict the sliders, you can predict them almost 100% of the variance in what sliders people are or where along each slider somebody is going to say this is. You can predict almost all the explainable variance there. But if you try to go in the other direction, you're only predicting 20% of the variance in the emotion study. So emotions are like these really powerful shorthands that we have to describe emotional behavior. That's what we end up using for most of our studies, because that's like the most powerful thing. And there's just so much information and nuance in them. And you can continue adding more emotions up to 20, 30, sometimes 40 different options and get more and more information out of it. So in our surveys now, we have more than 50 different options per survey and sometimes they change depending on the modality. In the voice, we've conceptualized things slightly differently than in the face versus when we're reporting on their experiences versus when we're talking to somebody. Those words can change a little bit, but fundamentally just having enough emotion categories really does capture it and is more consistent across cultures. Things like anger are much more consistent versus unpleasant pleasant. Surprisingly, people used to think that the most consistent things were valence and arousal, so unpleasant pleasant versus calm arousal. It turns out that's not the case. It turns out that anger and fear and love and amusement and awe are, like, more consistent when you have people from different cultures rate things in terms of these qualities.
Nathan Labenz (29:02) Hey. We'll continue our interview in a moment after a word from our sponsors. You know, obviously, you've taken a lot of pains to try to have a diverse dataset. And that's obviously going to be super important both for your product quality and for the way in which this type of technology serves humanity broadly. I wonder kind of what a, just like maybe maybe just interesting observations or kind of correlations you might flag that, you know, would be interesting. And then also, I wonder how you deal with examples that may break correlations. You know? Like, I could imagine, you know, okay. Well, certain people will look a certain way. They come from a certain culture, you know, have certain tendencies. But what if, you know, for example, my sister is Korean American adopted, you know, when she was 3 months old and spent her whole life growing up here. And so she's, you know, looks 1 way, but it, know, has a cultural background that's a different way. And, you know, I wonder how you think about dealing with those, like, correlation breakers as well.
Alan Cowen (30:13) So we do these studies and we we collect data in a bunch of different countries. And we actually published a lot of this data. And there's usually about 80% commonality across cultures once you control for inter rater disagreement, which is bigger than intercultural disagreement. If you look at the disagreement across countries, that's smaller if you've averaged the ratings than it is between any 2 raters within a country. 80% of the of the variance is usually preserved across countries, and that's usually along upwards of 25 to 30 plus dimensions in any given modality. Then when you make it multimodal, it's like the dimensionality goes upwards of 50. Right? So the interesting findings have been, this is what we have in common. There's always a lot in common. There's some dimensions that are not as culturally universal, like awe is 1. So we have this word awe in English that doesn't necessarily always translate to the same thing in different languages. In East Asia, there's a slightly different conception of it. In Africa, there's a slightly different conception. And so when you look at the facial expression that corresponds to awe, which is recognized in every culture, it's labeled slightly differently. It's really in The US where it's predominantly odd, but it actually is less so in other cultures. But these cultural differences are sort of the minority of the variants, even though we do see that. And you even see we did this study of ancient sculptures in the ancient Americas. And the reason that's interesting is because there was no cultural contact between the ancient Americas and European cultures. If people are portrayed in these sculptures as expressing similar facial expressions in similar situations, that's indicative of a biological universal, right? Because there's not really any reason to think that there'd be a cultural explanation for that. We find that to be the case for a lot of Like pain, for example, there's these sculptures of people being tortured. And if you just isolate the face from those sculptures, people in every culture in the modern era recognize that as pain. And you can look at Egyptian hieroglyphs and sometimes you see people crying and it happens more often at funerals. A lot of these like There's these crying ceremonies that are depicted in ancient hieroglyphs that are really similar to these crying ceremonies that are depicted in the ethnographies that were written by the early explorers in the ancient Americas. It's like the same ceremony was happening basically in ancient Egypt and more recently in the ancient Americas. So there's certain expression dimensions that seem to be completely universal. That's on the extreme end of universality.
Nathan Labenz (33:11) Fascinating. And so it's a good it's a good point, and this kind of continues to come up. Right? The the the sort of short version of it is the variation between or I would say amongst individuals within a group is greater than the variation between groups, and that holds again here. Do the correlation breakers cause problems, or is there do you need some sort of special casing for that or maybe that's where the the custom models comes in and you have to kind of build in the context?
Alan Cowen (33:44) That is where the custom models come in. In particular, even though people kind of the plurality of people in every culture usually use the same word or its translation to describe the meaning of an expression, that doesn't mean the expression's used in the same way, surprisingly. People are, in some cultures, much more readily show positive expressions that are high arousal, smiling with a really big smile, than in other cultures. In East Asia, that's a lot less common than in The US. In The UK, that's actually a lot less common too. So you need to control for those differences when you're looking concretely at things like, I don't know, customer support. The customer support calls sound totally different in different cultures, surprisingly. And so you need to train a custom model for your customer support calls within each culture where you're applying it, at least to the extent of making distinctions between broadly The US and Canada versus The UK versus India versus East Asia. These are all and even within East Asia, there's variances. People much more readily express anger in some cultures than others. People much more readily express kind of exaggerated positivity, and it doesn't always mean something sincere depending on the culture you're looking at. So you need to train the custom models for those reasons to control for those cultural norms.
Nathan Labenz (35:14) So I wanna unpack that in more detail in just a second. But before even getting to that, let's just talk about kind of the base models. And you've got a bunch. As I you know, expressive language, emotional call types, facial action units, you know, speech processes you mentioned earlier, dynamic reactions, and more beyond. 1 thing that jumped out to me about these is that they all have kind of a different number of it's not exactly a classifier. Right? I imagine you could get multiple, you know, kind of labels out for a single input. But, you know, 53 emotions, 67 descriptors, 37 kinds of facial movement, 28 kinds of vocal expression. I guess just for starters, like, how are you arriving at these numbers of different emotional and what word do even use for it? Is it label? Is it, dimension? I mean, obviously, you use the actual terms, but, like, I'm I'm really curious about, you know, is that based on prior work, or is that based on some sort of clustering of the data itself? How are you arriving at these numbers?
Alan Cowen (36:20) Yeah. That's based on studies that we've published as well as more exploratory studies that we've done internally, not all of which are interesting enough to publish. But basically, there's boring reasons sometimes. For the voice, if you're looking at speech prosody, that has a lot of different dimensions to it, but fewer than vocal bursts, which are laughs and sighs and screams and interjections like, and uh-huh, because vocal bursts are less phonetically constrained. So there's certain things that are in the phonetics of vocal bursts that we use to distinguish different kinds of things we're expressing that are not present in speech prosody because that phonetic dimension is being used to actually form the words. So you can't use it anymore for vocal bursts. And for that reason, speech prosody actually has fewer dimensions than vocal bursts. I mean, sounds kind of boring, but that's 1 of the distinctions. And then in the face, there's different dimensions that we express with facial expression versus in the voice. And some of that just has to do with probably the way that the voice and face can be used for social communication, even deep into our evolutionary history. So some expressions are better expressed quietly than others, and some are actually harder to form and represent this kind of sincere expression. A really, really stereotyped example would be blushing. Blushing is impossible to fake, right? And it's a signal of embarrassment or shame that is used to express that we're genuinely We can't fake it, so it's a genuine commitment that we are actually expressing this. And it's something that you don't want to do out loud. And so it's not really something that is replicated in voice. Most voice dimensions are they're easier to fake. It's easier to voice act than to both do the voice acting and facial expression at the same time. I mean, you can do the anger or shock or surprise. It doesn't usually look as authentic. It's much easier to sound convincing when you're voice acting. And that's because probably in our evolutionary history that you can just hear somebody's voice, even if that person doesn't know that you're overhearing them. So there's certain private information that we wouldn't want to express as readily in the voice. There's certain things that we want to be able to do intentionally with our voice that we don't necessarily want to lose control over. So all of that
Nathan Labenz (38:59) makes sense to me in terms of the different signals that are coming through the different modalities, But I'm still a little bit stuck on, like, why 53 and not 54 or 52 emotions out of expressive language? How are you drawing that line and saying, like, you know, because you I imagine, like, mentioned, like, shot in for it earlier. Like, I'm guessing that's not 1 of them, but maybe it is. But, you know, like, at some point you have to draw there's like another there's definitely another word, right, that you could add. But at some point, you're you're saying, okay. Like, this is the set. And I'm I'm really curious as to how that cutoff is being determined.
Alan Cowen (39:36) You kind of want to overrepresent. You want to oversample the number of actually distinct dimensions that people express. The accounts that we show on the website are usually the number of distinct dimensions that we actually express or that we've been found. We've been able to identify people being able to distinguish perceptually with scientifically rigorous studies versus the number of labels we use. Do you want to basically pick enough labels that you can capture all those dimensions? We just figure out with some pilot studies or with actual longer term studies what words we actually need to include to get there. Yeah, this is probably a boring answer, but it really is just based on looking at a curve and seeing where where there's drop off in our ability to explain variance, basically.
Nathan Labenz (40:25) So does that cash out too in terms of people actually making these judgments that you would give someone an image and you're just getting sort of random ish labels from different people. There's just not enough agreement on yeah, please. I'm still a little struggling.
Alan Cowen (40:45) Shopify is a great example because if you get ratings of both contempt and amusement, then that's basically Schadenfreude. So you don't need actually to have a Schadenfreude label. So there's actually different rotations of this space. There's a bunch of different taxonomies you could use that are all equivalently explanatory. You could use a completely different set of words and still explain the same underlying dimensions. That's 1 of the big revelations from my research, which is that it's actually this continuous space where you have schadenfreude as amusement into contempt, or shock as surprise and fear, or there's a bunch of examples. We just basically try to find the words that are the most distinct that people can reliably use. And not We try to shy away from words that people don't know what they mean, Schottenfreude, because you can get that distribution of the population that will use that label, but it's pretty sparse. So if you can get the same information out of contempt and amusement labels, that's actually a lot better. So it's the fact that these words are just they're just concepts that are used to parcellate what is essentially a continuous space, and you want to have enough concepts that people are able to reliably identify where something lies along those dimensions. And not too many that that actually you start to lose the ability to you don't lose anything with with having too many dimensions, but you just need more data to compensate for the fact that you have basically 2 synonyms of the same thing or a word that means a combination of 2 other words, and you would need to basically derive that from the data itself, and your model has to figure that out. So you you you kind of want to avoid some of that complexity so that you can be more data efficient, basically. So that's what we do. We run all of these huge studies to figure this out. We have millions of samples where people are rating, imitating, acting, reacting to We have all these videos that people react to. So we're just running these big studies in different cultures to figure this out, basically. And that's just how we train our models. Then when you actually peel off the output of the models, you realize that you actually get even more variance that you can explain from the underlying embedding dimensions. So if you didn't have Schaad and Freudian and Schaad and Freudian turned out to be important to include, you might actually find that you can already explain it with the embeddings that we derive from models that we train, essentially. Embeddings themselves, because the model has been trained to ignore identity, the embeddings of the higher layers of the models are actually pretty identity independent as well, we can test that out. So they're not influenced by gender and race, and they actually represent these movements that are more nuanced. That's sort of the kind of underlying core of everything we do. Once you get these really nuanced dimensions in a way that's identity independent, then we can use that for the custom model API, which allows people to predict anything from it. Or model training on the generative side. We actually look at people's reactions to words, we're training generative models to produce words that evoke the right set of reactions. And all of the nuance that we get from our expression measurement models turns out to make a big difference when it comes to those use cases.
Nathan Labenz (44:18) So 1 more question on just the base models, and then let's go to the custom models. The the base models, can you describe just the inputs and the outputs? Like, I know that it's, you know, for example, video with audio, but I'm curious about how that gets, like, sliced down or, you know, preprocessed before it gets fed in. Like, how much time dimension and how do you think about, like, chunking that? I mean, presumably, you know, if you were to just do, like, a totally naive chopping of the video, you could have a lot of artifacts as a result of, you know, where the brakes fall. So I imagine there's gotta be some, like, overlapping, but I'm I'm very curious about that. And then also on the output, is it just like a number for every dimension that kind of says, like, you're very high on happiness and you're low on shock and, you know, that for kind of all the way down the line? Or what exactly is the sort of raw output of the base model? So inputs and outputs.
Alan Cowen (45:16) So the base models are really, really high density. So for the facial expression model, we can do frame by frame. And we also have a dynamic model that takes into account a sliding number of frames before that, and that's like a parameter you can adjust to inform it's prediction of what someone's gonna perceive at a given frame. So like, for example, people form these like weird facial expressions sometimes at a high density. Really pause things frame by frame, you'll often see these kind of crazy facial expressions show up, especially if the person's really expressive. So the frame by frame estimate can be really spiky. Then we have a dynamic model we apply on top of that. But actually that frame by frame estimate is what we end up using to help to form these custom models. So we take Usually 3 frames per second is the right cadence, or you can go beyond that, but there's diminishing returns. And then we actually have the model figure out what is the We don't explicitly chunk it. We it alongside words into a joint multimodal model, but then the model figures out how to look at that dynamic information intermingled with words. And that forms our prediction for custom model. For example, it depends on what your custom model is doing. If you have a small amount of data and they're long videos, we'll just average across the video because that's the best we can do. The model's not gonna be able to do that much better using dynamic information. But if your custom model is something like end of turn prediction, where it's like every single turn somebody's speaking and you want to know when they're done speaking, which we can do internally and we use that in a lot of ways. If that's what you're predicting, then you want really dense information. You want to actually know the prosody of the last word somebody used and where that left off and also their facial expression.
Nathan Labenz (47:11) So tell me more about the custom models finally getting there with all the the foundation laid. It's a mix now, right, of these inputs and your data. So maybe a good way to to approach this would be to to share some examples of, like, what different kinds of data people have brought to the platform and mixed in with your emotional foundation?
Alan Cowen (47:36) Yeah. So we have a bunch of sample models. So driver drowsiness, you mentioned earlier customer outcomes from customer support calls, is somebody satisfied or not? Did this call go well or not? We have models of whether somebody expressed having felt depressed while they were speaking, but we don't know this from the video itself. And we can predict it from somebody's expressions and language from the video much better than if you just use All of these are models where we do much better using expression than just language alone. The reason that our custom model API is so important is because probably when I was giving that last explanation, how we chunk together different frames of expression with the audio, with the language, that's really depending on the problem. We wanted to do all of that work so that people wouldn't have to figure that out themselves. That's what we built the custom model API to do. All you need to do is upload your labels and your data, we do the rest. We figure out what kind of model to train, and we cross validate it and give you the accuracy, we give you an endpoint to deploy the model. If it's a customer service call, we can give you an endpoint that says how well this call is going at any given time without having any labels of that. The labels you've given us are overall, how well did this call go? At the end of the call, how well did it go? But now you can apply that at a more granular level. You can also apply that to millions of calls that are unlabeled and start to train a model that actually acts like a good customer service agent with all of that implicit feedback that you would not otherwise be able to access because it's like hidden in people's expressions on these calls. So that's where the custom model API comes in. I mentioned a few different custom models that we've made available publicly. There's a bunch that are not By default, your model is not public. You train it and it's just yours to use. But there's models that are clinical collaborators who've trained that are really, really cool. You can take a video of somebody actually in a clinical trial where they're talking about their depression and track in a really nuanced and robust way, like all these different depression symptoms that you actually have an easier time tracking from these short kind of video diaries versus actual doctor's appointments. You can do it at a much higher level of granularity. You can have people do this every day. We have other kinds of customer support models trained on many more calls. Yeah, I mean, just the list goes on and on. There's coaching, use cases, education, like, is somebody distracted? That's not used to spy on them, to help somebody realize if they were distracted while something was being said in a lecture, can remind them later on that they were distracted and ask if they want to repeat that part of the lecture, stuff like that. There's just a bunch of use cases, which is why we've made a developer API that anybody can use and not a SaaS product for a specific use case.
Nathan Labenz (50:40) So if I'm let's say I'm any of a bazillion companies that are sitting on a lot of recorded customer service calls and then corresponding downstream data. And I guess that could be a simple, you know, how well did this call go for you kind of survey result or potentially, you know, did the customer churn after this call or, you know, whatever. Right? There could be a lot of different downstream indicators. But I imagine what most people have is, like, the raw call and then, like, a relatively relatively few data points downstream that they really care about predicting and optimizing for. We just bring you that data and, like, the team goes to work on it. It's not it doesn't sound like it's it's like a fully automated thing at this point. You're still putting in some elbow grease to kind of figure out, like, what is the nature of that dataset and how best to use it?
Alan Cowen (51:35) No. It's pretty much fully automated. So all of that's happening in an automated way. You just upload your labels that you have along with your calls, and then we will train the model that predicts those labels. It will do the work of months of data science, basically, if you were to train that model yourself and also utilize our model embeddings, which you would not otherwise be able to access to give you much higher accuracies in predicting your outcome. Or it could be the outcome that your label represents. It could be the state of a person in the call. And so we do all of that for you and we even give you the endpoint to deploy it. We give you the model ID, which you can then put into our API and get back your prediction right away. Oftentimes people have a lot of data and they want to use the labels that they have, But surprisingly often, even in something like customer support, people have data, but it's really noisy. These ratings of customer satisfaction are really noisy, for example, because you don't know if a customer was dissatisfied due to anything that you can control or whether they're just a jerk. And so it's better to actually have a manager or even better, somebody at a higher level go in and look at the call and listen to it and say whether they thought the call went well or not. And we can take that really gold standard prediction. We can predict that with 100 calls or 200 calls, and then you can apply that to all of your other calls. That's something you wouldn't otherwise be able to do. You can compare that gold standard to potentially other labels that you have, and you can train the model with those other labels and actually see what those different competing models are doing and how to use them. It's really about being able to train a model on gold standard data that's often limited and still get really, really good accuracy.
Nathan Labenz (53:24) Again, I'm kind of wondering about how much trailing time is being considered as you go through this. Like, if I have a 10 minute customer service call, I might be it might be going real bad for the first 2 minutes and then the person totally redeems themselves or whatever. Right? There could be a lot of kind of ups and downs. So how much how much sort of trailing time are you considering? And, like, do you in fact see a lot of, like, roller coaster dynamics as people go through these interactions?
Alan Cowen (53:55) Yeah. It really depends on the outcome. I mean, surprisingly, the average does pretty well. You train these dynamic models We have these dynamic models that we've trained by putting expression measures alongside language tokens into a language model and training it for billions more tokens, tens of billions more tokens, that it really deeply understands both how expressions influence language in an interaction and how language influences future expressions. That's useful for a lot of things, and I can talk about that. But in terms of the custom model, if we take that embedding and use it to predict custom outcomes, that's often a lot better, sometimes a lot better, but often it's just slightly better than just taking the average, since the average is equally good. And then when you're looking at a customer service call, the ups and downs can really just be modeled as a sliding window. It's simpler than you might think. There's actually tricks that you can use that get you pretty close to the optimal or seemingly optimal model that you'd be able to train with a lot more data. We really invest in being able to use those tricks because oftentimes people don't have that much data to train on. We do all the pre training that makes it easy to deploy those tricks because basically we've extracted a multimodal expression language embedding that's pre trained on millions of hours of data. But when you actually deploy that, that's like an embedding that you can cast your data into. Then you can do classification regression within that extremely compressed embedding space, really, and and it's more effective that way.
Nathan Labenz (55:45) A couple of big takeaways there for me. 1 is, like, 1 approach is that you are essentially text tokenizing and kind of interweaving the emotional expression into the dialogue itself. And then I guess you probably have to diarize it as well if you're talking about, like, dialogues. Right? Who's who's talking at any given time?
Alan Cowen (56:06) Speaker tokens. Yeah.
Nathan Labenz (56:08) Yeah. So you've got kind of a, you know, ultimately still all text representation of all this stuff. So these kind of, you know, these kind of base I don't know if base model isn't quite the right term anymore because you the classifier models that you've that we've covered earlier.
Alan Cowen (56:26) Yeah. Measurement models.
Nathan Labenz (56:28) Okay. So those are used to then create this, like, annotation of text, and then you have tons of that. And so you can essentially do, like, continued pre training. Yes. Whatever. Is it like a LLM 2, I guess, would be the default assumption these days for what you're extending?
Alan Cowen (56:47) MISTRL, LLM 2, we do small It depends. We actually have a bunch of models that we train because there's smaller models we use for model orchestration. There's larger models we use for language generation. There's smaller models we use for end pointing and all of that. We train a lot of these, but yes, that's what we do. We're training it on video of people interacting, It's from that video that we can extract all the expression tokens. Those tokens are not necessarily in the There's different versions of this, but the best version is actually creating new tokens that don't exist in the vocabulary of the model and training on those. What we're doing is actually it's not a multimodal model like GPT-4V, which is trained on just images interleaved with text on the internet. It's a different kind of human interaction model, but we're still building it on the best unsupervised auto aggressive language models. And so we're taking advantage of of all the scaling people have done in that domain.
Nathan Labenz (57:50) So on the business side, we've covered, you know, a decent number of use cases just kind of as we've gone through this. I looked at the pricing page. I always zoom to the pricing page of everything that I check out. And I wasn't I guess, is this for a custom model or you can kinda clarify, but I I see 2.76¢ per minute to process audio and video, which with a little arithmetic, I compute at a dollar 66 per hour to process audio and video. I always just like to anchor that kind of stuff and think, you know, well, how much does that cost compared to a human? Obviously, that would be a lot cheaper than having somebody go through and, you know, do detailed annotation and from, like, an ROI perspective. You know, if you've got somebody making $20 an hour in a call center, you know, how much more efficient can we make them? You know, it's it's pretty easy for this ROI to start to look like a, you know, pretty sure bet. Is is that indeed the pricing for the custom model? Do I have that right?
Alan Cowen (58:53) Yeah. So, actually, the custom model API is free. It's trained on top of the underlying expression measurement models, and so you're just charged exactly the same way that the expression measurement models are charged, which is by the hour based on whether it's video or audio or text. And if you have a custom model, it's still the same price by by the hour. Yeah. And you get all of those custom model predictions that you've trained.
Nathan Labenz (59:20) Gotcha. Okay. Cool. So you've done this giant extended pre training with all these kind of annotated interactions, then I bring my own data to it. I only need potentially just a couple 100 examples, but that's enough to really do, like, the last phase of fine tuning to dial it in on my particular labels of interest. And then all that's available at the same price as the base model, and that works out to be for audio and video a dollar 66 an hour or just under 3¢ a minute. How do you think I mean, you mentioned a second ago GPT 4 v and kind of a a multimodal. Obviously, this is the, you know, the big kind of new unlock for, like, general purpose models right now. And in preparation, I did just a little bit of, like, GPT 4 v, you know, throwing a couple images in there to see how well it can do with this stuff. It was hard for me to assess, you know, exactly how well it was doing, but it was, like, clear that it was doing at least somewhat well. You know, it it was not, like, totally failing. I can say that. So I wonder how you think about the future of this space and kind of the role of specialist systems, like the ones that, obviously, you've been building versus, you know, the other argument would be like, the bitter lesson, you know, wins again perhaps, and, you know, the mega models are just you know, are the best at everything. How do you see that playing out over the next you know, as far as far as your crystal ball will allow you to predict?
Alan Cowen (1:00:47) Yeah. I mean, we build on top of the mega models. Every time a mega model comes out, we're really excited because our whole infrastructure just allows us to layer that into the cake really easily. We do the last it's not like the last mile, but the last million miles of training on top or whatever we need to do to utilize that, just to make it concrete. GPT-4V, it captures some aspects of expression, but if you actually test it, it's not always right. It's really biased by gender and stuff. They've maybe made changes to try to correct for that, but then you end up with lower fidelity predictions. That's just because it's just trained on images that happen to be interleaved with text on the internet. The internet comments on different kinds of people in different ways and sometimes pernicious ways. It's not trained on people in dialogue, like the actual social interactions. That's what we do. There's only so much dialogue that exists, kind of, and we also have more dialogue that comes in from users. Our specialty is training on all of that. In order to do that well, you take these foundation models that already understand the language, you take our measurement models, which are also trained on audio foundation models and vision foundation models, which we need our controlled data to be able to extract from those the dimensions of expression. And then we interleave those metrics of expression with language, and that is trained on all the dialogue that we can get. It's a different part of the space. It's still taking advantage of unsupervised training. So we've learned from the bitter lesson, we're not sort of hand tuning any features. We're only doing things that scale. They scale in different ways. There's the scaling of expression language models that predict expression from language and vice versa. And then there's the scaling of the reinforcement learning part, which that's thought of as like this kind of hand engineered thing that doesn't scale, but we make it scale because we can use people's reactions in video as reinforcement for the language model. It's producing language that people are reacting to. Do that over minutes, hours, if we have users that are longitudinally using the product, we can do it over days. Right? And we can say, produce language that makes people happier, produce the answers that make people more satisfied versus frustrated, produce the advice that makes people have better lives. We can do that reinforcement learning at scale with implicit feedback. So we're actually correcting the way that that's done from the perspective of I think people will learn the bitter bitter lesson of RLHF is that it doesn't scale.
Nathan Labenz (1:03:46) Yeah. It's a huge challenge, and it it is I have experienced a little bit of the, you know, the weirdness of GPT 4 v on images where, you know, just putting in like a couple photos of myself and my kid and saying like, which 1 should I send to my wife? And I'll do, you know, 1 where it's like, we look good. And then there's another 1 where we don't look so good. And, you know, it's 1 of these, like, in between moments with a weird expression or whatever. And it really you really have to coax it to to get it to answer that question. It really doesn't wanna tell you you don't look good in 1 of your pictures. So, yeah, there is there's also that aspect of the RLHF that's like you know, it's kind of dialed in for, you know, obviously, a a more controlled and hopefully safer experience, but that is not always consistent with giving, you know, straight talk, you know, accurate assessments of things. And you can see that, you know, just in in ChatGPT. So in zooming out here, you mentioned, like, people using a product longitudinally and, you know, expanding the time horizon. And I think, you know, in time we have left, I I would love to hear how you think about this as we zoom out in terms of, you know, okay. It's 1 thing to say we can predict how well this call is going and we can optimize our business processes around that. That makes total sense. I'm sure businesses is booming as people are like, oh my god, this is something I can do. How much does it cost and how do I get started? But then I start to think, okay. Now if we start to really train models on our emotional response and we're you know, I think we should definitely continue to expect more powerful mega models. You know, Gemini has been released not released, but announced today, and, you know, it's surpassing GPT 4 in in, you know, narrowly in in many ways. I don't think we're at the end of history there. And these things are just gonna become such a much, much bigger part of our lives. And I just wonder, where is this all headed? You know, how are we going to go from this sort of frame by frame, you you look unhappy and we wanna change our business process to avoid that to a world where we have like long relationships with AI systems and, you know, who's optimizing who at some point, like, starts to become a real question, I think. So I know you've put a lot of thought into that. I'd love to hear how you are thinking about it because it's certainly an intimidating challenge from my perspective.
Alan Cowen (1:06:24) I mean, that's sort of the motivation for all of what we do is that at the end of the day, we want to optimize for people to have more positive emotions in their daily lives, right? And that is how AI should be optimized. AI should be optimized for our interests. Our interests are to flourish as human beings. It should be optimized for human flourishing. So how do you do that? Basically, challenge is I mean, there's the I'll talk about first unintentional misalignment, right? Then we'll talk about intentional misalignment. But from the perspective of unintentional misalignment where you're asking a model to do something and then it does it, it actually satisfies your goal, but it does in a way you don't want. The reason that you don't want to have done what it did is because it did it in a way that was actually harmful to you, right? That it didn't take into account a background understanding of what humans actually seek out, what makes them happy, that should override other contrary objectives. That's what humans do naturally. When we ask another human for something, they consider all of this preexisting knowledge that they have about what makes humans happy and that influences how they carry out your request. But if the AI doesn't have that background knowledge, it could do something that's regrettable. So our goal is to be able to have the AI first understand how its decision will affect your well-being next week, next month, next year. Then look at your goal. Your goal might be to affect something next week. Fill my bank account, make sure my bank account is as full as possible next week. It should be able to weigh the different sub instrumental goals that it develops en route to that end goal based on how they affect your well-being and the well-being of other humans and creatures on earth, right? Choose a means to carry out the goal that is not contrary to people's well-being. Right? That's really the goal. So in order to do that, it needs to be able to predict well-being as well as it can predict how full your bank account will be as a function of whatever decisions it can make. Then it needs to weigh well-being above any other objective. It needs to actually override the objective that you give it. And that also then carries over to when something is intentionally bad. Like, it should actually still use people's well-being to inhibit your ability to ask it to do something intentionally bad. And they should be able to say, actually, I can't do that because it would hurt these people and actually be able to explain why it doesn't wanna do that. That's I think the when you put those 2 together, that is basically the solution to the alignment problem, it just has to do with optimizing longitudinally for well-being. That is ultimately the solution.
Nathan Labenz (1:09:21) I've been struck over the last year and change that it seems like the power of the mega models is growing at a faster rate than our ability to control their behavior or our insights into how to even specify goals. And maybe I would extend that to, like, also to the growth of the, you know, long time horizon datasets that we would need to inform these sort of long time horizon predictions. Do you think we are in a place right now where we kind of need more time to figure this stuff out? How do you feel about the the relative pace of progress and the versus, you know, the pace of kind of, you might call this, like, the wisdom of AI systems?
Alan Cowen (1:10:12) So it's worries I I mean, I'm I'm a very much a person who's worried about the capabilities of these systems growing at a rate that exceeds our ability to align them with human interests. I'm very worried about that. I actually think for various reasons, the way that things have progressed recently is generally good. I actually think that the AI race has resulted in people publishing less of the sort of core ideas you to expand the capabilities of these models. That's actually slowed things down because now OpenAI doesn't see what Google's coming up with. Its best idea is to improve the model, Google didn't see what OpenAI is coming up with. And so actually things have slowed down, which is why the Gemini is not that much better than GPT-four. GPT-four has been the best model for over a year, right? So that's encouraging in some ways. Now, in terms of training these models and actually solving the alignment problem, which I think is extremely urgent, you mentioned that you need longitudinal data, and I totally agree with that. That actually does, in practice, inhibit a lot of our ability to test whether these models are actually good when you release them into the environment, because you need to know over the course of years, is this going to improve somebody's well-being? That's really the bottom line, the gold standard. However, you can sort of get around that by taking all the longitudinal data that already exists and reframing the problem as understanding people's emotional affordances. So when I come to you with a request, it can be reconceived as like an emotional affordance for you to improve my well-being by carrying out the request. And you can be like, oh, I I will do that because it improves your well-being. So if you're optimized for my well-being, you'll carry out the request insofar as the request is consistent with my well-being. All of that can be learned. Things about people's emotional affordances given their context can be learned from existing data that has human interactions. And so the models should try to learn as much of that as possible before even having to learn on the job. Now, Hume has designed ways for it to learn on the job. So Hume is designing ways for it to be first, to the degree possible, optimized to understand emotional affordances in the existing data that we have that's human interactions, and then be able to continue learning on the job and testing out the theories that it's learned about how it can basically expand upon affordances for positive emotion in your environment and act upon those affordances for positive emotions and avoid the things that will cause negative emotions. You can actually test that in practice when it's deployed and should learn in practice, but already with a preexisting knowledge that gets it to well beyond human level. It should be way better than any human on earth at understanding emotional affordances just from the start because it's already learned from way more data than any human on earth has seen that informs our understanding of emotional affordances in everyday interactions. So I think there's starting point you can get to. That's where we're trying to go. Then once we release these AI systems into the environment, they have to be able to continuously learn. That's going to be a challenge. But hopefully, the starting point for it understanding emotional affordances is higher than any other capability. The ideal is that it is better at predicting how an action affects human well-being than the degree to which it will result in your bank account having a higher number, things like that. There's no reason why that couldn't be the case because predicting well-being is a matter of predicting people's emotional expressions, their reactions, their states over long timescales, and that data exists just as much as the bank account data exists. I think there's a path forward. I'm very much worried about it, but I'm also optimistic that
Nathan Labenz (1:14:04) we can solve the problem. Well, that's a great note to end on. I I love the optimism, I love the fact that you are attacking that problem in such a head on and also kind of you such a unique way. I mean, I I really haven't seen, much else that's like this, and I I really appreciate the, the depth and the rigor of the thought that you and the team are putting into it. So with that, I will say, Alan Cowen of Hume AI, thank you for being part of The Cognitive Revolution.
Alan Cowen (1:14:34) Great. Thanks for having me.
Nathan Labenz (1:14:36) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.