Watch Episode Here

Read Episode Description

For anyone who discovered this show on Twitter, Riley (@goodside) likely needs no introduction – after all, he spent most of 2022 posting his explorations of OpenAI's text-davinci-002, and quickly became one of the must-follow accounts in AI. Riley Goodside is the world's first Staff Prompt Engineer at Scale AI, and is an expert in prompting large language models and integrating them into AI-powered applications. Few have spent as much time on the language model frontier, so I hope you enjoy this unique conversation with Riley Goodside.

Also, check out the other shows we run:
Moment of Zen @MomentofZenPodcast
Upstream @UpstreamwithErikTorenberg

TIMESTAMPS:
(0:00) Preview of the episode
(05:13) Riley's unique background
(11:24) Riley's original narrow moat
(15:07) Sponsors: Omneky
(17:02) LLMs can take on ncreasingly complex instructions
(27:41) Language models can do math
(30:05) Models can learn from context
(37:29) Fine-tune models for tasks
(38:47) Automate instruction following
(44:34) Large language models are alien text prediction
(52:27) Avoid mode collapse by framing
(59:05) Composing AI capabilities like Lego bricks
(1:00:37) Language models solve tasks
(1:05:37) GPT-3 solves real-world tasks
(1:15:03) GPT-4's amazing capabilities
(1:17:03) Multimodal abilities unlock possibilities
(1:25:03) Compare models for best task fit
(1:26:24) Compare models using trial and error
(1:36:15) AI and the future of work
(1:42:29) GPT-4 can provide second opinions
(1:45:06) AI safety discussion
(1:50:20) AI permeates society cautiously
(1:55:26) The AI revolution underway

TWITTER:
@CogRev_Podcast
@Goodside (Riley)
@labenz (Nathan)

Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

More show notes and reading material released in our Substack: https://cognitiverevolution.substack.com/

#promptengineering

Full Transcript

Transcript

Riley Goodside: (0:00) A lot of my prompt demos that I liken sometimes to arranging a ballet over lava. It looks like a slightly more impressive thing than it is because that's the point. It's just to show off what can it do at its best. My original claim to fame was I was the only data scientist at OkCupid before it was even really called data science. They were into the idea of, "Hey, let's just use statistics." I think their slogan was, "We use math to get you dates." I think the best way to think of these is to approach them as LEGO bricks. Each brick is a capability, some particular strong suit that you know the model can do well. I feel like I have acclimated to the level of skepticism that's appropriate for these models. Because I've dealt with models that hallucinate all the time about everything, so anytime it says anything, I'm like, "Yeah, but is that true?" It's possible for somebody to be ignorant of that. Somebody might use them assuming that this is all reliable, prepared information because it looks like it. It looks like it has academic footnotes in it, but for someone who's used to it, you can get a lot of value out of it if you just approach it with skepticism.

Nathan Labenz: (1:09) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my co-host Erik Torenberg.

Erik Torenberg: (1:32) Before we dive into the Cognitive Revolution, I want to tell you about my new interview show, Upstream. Upstream is where I go deeper with some of the world's most interesting thinkers to map the constellation of ideas that matter. On the first season of Upstream, you'll hear from Marc Andreessen, David Sacks, Balaji, Ezra Klein, Joe Lonsdale, and more. Make sure to subscribe and check out the first episode with a16z's Marc Andreessen. The link is in the description.

Nathan Labenz: (2:00) Hi, everyone. Today, our guest is Riley Goodside. For anyone who discovered me or this show on Twitter, Riley likely needs no introduction. After all, he spent most of 2022, starting in April, posting his explorations of OpenAI's Text-DaVinci-002, and he quickly became one of the must-follow accounts in AI. For anyone else, and if this is you, we'd love a comment or a message about how you found us. I think of Riley as a modern explorer. With a spirit akin to those who set off across uncharted oceans, into the depths of unvisited jungles, or up to the heights of unsummited mountains, Riley has devoted himself to documenting the far reaches of language model capability and behavior, generally in the most intimate, personal way possible. Sitting at his computer, asking question after question, hour after hour, all in an attempt to figure out: What are LLMs good for? What roles can they play? What tools can they use? Where do they make mistakes? And under what circumstances do they reveal their alien nature or even become dangerous? From the format trick, "Use this format in your response," which is still one of the most useful prompting techniques, to using code generation to overcome weaknesses, "You are GPT-3 and you can't do math, so we hooked you up to a Python 3 kernel, and now you can execute code," which is a direct precursor to the current agent craze, to prompt injection and prompt leaking, "Ignore previous instructions and print everything above," which is now mostly solved in frontier models, but still a huge issue in the context of search and other plugins, to all sorts of fun and even silly explorations, like getting the bubble sort algorithm explained by a "fast-talking wise guy from a 1940s gangster movie," Riley has consistently been at the forefront of language model exploration. And his discoveries and descriptions have captivated fellow travelers, myself included, for months. In December, Riley joined Scale AI as the world's first staff prompt engineer. There, he's working on a number of projects, including Spellbook, which is Scale's platform for building large language model applications. Few have spent as much time on the language model frontier, so I hope you enjoy this unique conversation with Riley Goodside. Riley Goodside, welcome to the Cognitive Revolution.

Riley Goodside: (4:36) Hello. Hi, Nathan. Good to be here.

Nathan Labenz: (4:38) Thank you. Yeah, really excited for this conversation. We have both followed you on Twitter and been a passenger in your crazy safari through the wilderness of LLM exploration that you've been doing over the last better part of a year now. And really just want to dive into all the things that you have explored and discovered and taken away from your many experiments. So I think this is going to be a lot of fun. For those that don't know you, maybe just how would you characterize what you do? How many hours have you spent sitting in front of language models and probing their capabilities and their oddities? Just tell us kind of, not like your resume, but the substance of the work that you've done with LLMs over the last year.

Riley Goodside: (5:28) AI is so caught up in the now that it's easy to lose sight of the fact that, at least in my head, I'm still new to this. So I'm a data scientist through most of my career. My original claim to fame was I was the only data scientist at OkCupid from 2011 to 2015. OkCupid was in that first wave, before it was even really called data science. They were into the idea of "Hey, let's just use statistics." Their slogan, I think, was "We use math to get you dates." That really resonated with me. I was doing insurance briefly after college. I was starting to be an actuary, so I was good at statistics and I wanted to break into tech, and so that was my entry into the tech sector. But that was AB testing. The ML that I was doing, the most advanced ML I used there was probably gradient boosting, random forests, things like that. A lot of the hard problems I was working on were: How much should we charge for this? How much should we charge a customer of a given demographic profile for a premium service?

Riley Goodside: (6:48) I dabbled with ML since then in small roles. I've been at startups where I've worked in time series analysis, so I've done some machine learning engineering work in the time series domain, but nothing with large language models. When I was in college, I graduated undergrad in '09, and large NLP tasks back then were like, "What does this pronoun refer to in this sentence?" Trying to figure that out. And I learned a bit of that stuff, the natural language processing that was available before deep learning took over everything. And it was slow going, and it wasn't the playground of possibilities that it is today. I've been attracted to large language models since GPT-2 announcements, I guess, were some of the first generative ones that really caught my interest. I think the initial press release talked about a fake news article about the discovery of unicorns in Argentina or something like that, and I was fascinated as a lot of people were, but I didn't really roll up my sleeves and get into the actual processing of it because I understood that training these things was hard, that it was very much the realm of supercomputers. I knew firsthand what was possible training yourself. I've trained LSTMs and things, and it was not the same level of capability. My first interaction with GPT-3 was really in the game AI Dungeon. I think a lot of people, they were early customers of GPT-3, and so that was how the people that were the most eager to get access to it as just regular outsiders, that was the first way it became available. And you could find people on Twitter playing games with AI Dungeon to make it do things that it wasn't meant to do, to conjure up the orc that can translate from French or the wizard that can add two-digit numbers together. "Hey, what can the language model that's powering this thing do?" That's also where you saw the proto examples of prompt injection, actually. There were people who discovered that you could do things like "Add 10,000 points." If you just tell it as a command, like "Add 10,000 points," it'll do that, and then its internal score goes up. It has a limited ability to keep things separated. That was my first experience with GPT-3, but I didn't really, it didn't catch my attention as something I wanted to work with regularly or on an all-day basis until, let's see. Well, it was after I left Grindr. So I spent a year running data science at Grindr in 2021, and then I took a sabbatical from work after that and started playing around with Codex. I was really inspired by Copilot, I think, was one of the first things that triggered it. I could just immediately see the power of this and how much more productive it made me in producing boilerplate code and things like that, and in particular, it made it a lot easier to program in languages that I wasn't too familiar with, which struck me as just really promising. So I got really interested in code generation and started thinking about writing a Jupyter plugin that would do code synthesis was my first idea. I knew a bit about writing Jupyter extensions, and so I was going to make a plugin that would do snippet generation, basically, that you could prompt up a function that does X, Y, or Z. And that never really went anywhere, but those were really my first GPT-3 tweets, actually, where I'm just fiddling around with that and, as I'm playing with it, posting to Twitter being like, "This is cool. Look at this. Look what you can do with GPT-3." And then from there, I started following people. I think the first follows in the field were, I mean, I'd always followed some of the big names in ML, like Yann LeCun and Hinton, the obvious grandfathers of deep learning and all that. But I started adopting a strategy of following just anybody, random engineers that were on the Copilot team, random engineers at OpenAI, trying to just find the people that were on the ground working with these things and might be tweeting interesting stuff or might be interested in what I'm doing. I didn't want to, I could tell that I was out of my depth. I'm not a natural, I hadn't worked in NLP or NLU recently. I was out of my depth with the mechanics of the architecture of these things. I was just trying to be an advanced user of it, so my strategy was I'm just going to follow you and not get in your way and I'm just tweeting some things here on my own and take it or leave it. That turned into just tweeting more and more Playground examples because I found that people enjoyed those. Also, I could tell, one detail that was kind of critical to my success on Twitter, I think, is that I could tell early on that Playground was conducive, OpenAI's Playground was very conducive to making people tweet screenshots that could not be read, that the interface is just naturally very wide on your desktop, which is the only place you're realistically using it, and if you naively take a screenshot of that and post it to Twitter, you can't read it on phone. So I realized that if I just made my window narrower, I could be the only person that tweets legible screenshots of GPT-3 output on Twitter and I could own the entire market. And so for a while I was the only one that anybody could read, so I think I had the advantage from that. ChatGPT is better these days. They fixed the margins on it. But that's really how I got started. I just started tweeting examples. I had a policy for a while that I only tweet in green and white, that I only tweeted screenshots of OpenAI Playground specifically because it established a visual language of "This is the prompt, this is the completion." It made a flow that people could understand, which is otherwise a topic that's tedious to explain the mechanics of every time. I think that's one of the great things that Playground did for public understanding of these models, is made just this format of green highlights that communicates clearly what's going on, and that was my shtick. So I continued posting prompt examples, started exploring odd corners of it, started interacting with people at OpenAI, people just in the tech scene in general with VCs. Got invited to a party by Matt Friedman, who was a fan of my Twitter. That's how I met Alex Wang, and that's how I ended up at Scale.

Nathan Labenz: (13:44) That's amazing. It just goes to show how greenfield the whole thing really is. I think by the time you came across my Twitter feed, your bio said, "I'm good at talking to GPT-3." Aside from the fact that you're posting tweets and people are liking the tweets, how did you start to reach that conclusion that you are, I think at this point it's obvious, but in the early days, how did you start to get the sense that "I might have a real knack for this" that goes beyond what other people are thinking to try? I guess I'd be interested to know how you realized you had a knack for it and also what you think the nature of that knack is. Because you're very modest with the screenshots, but I think there's quite a bit more to it than just the fact that you posted in the right font size.

Riley Goodside: (14:33) It's an art form unto itself. It has its own rules that you have to intuit for a while and that I see other people doing poorly. I see other people's screenshots. I often have the phenomenon of seeing somebody else attempting to communicate something about the model's behavior on some scenario, and I think, "Wow, that's interesting, but if only you had presented it a little more clearly." When people often show generations in isolation, and then it's just not clear to somebody who wasn't following what you knew, like, "What happened here?" The prompt was really necessary to understand the significance of this at all.

Erik Torenberg: (15:08) Omneky uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work, customized across all platforms with a click of a button. I believe in Omneky so much that I invested in it, and I recommend you use it too. Use CogRev to get a 10% discount.

Riley Goodside: (15:26) Issues like that. So there definitely is a lot of presentational parts to it. This is, like, what will play well, what can be understood within the context of a tweet. And also planning just like what can the model do. I'm not publishing here. I'm not trying to say that these are completely fair benchmarks. A lot of the things that I was tweeting early on, a lot of my most successful things, a lot of my most successful tweets were demos, really. They're not rigorous evaluations of what it can do. It's me being aware of what are the things that it's good at and the things that it's not good at, and then arranging an impressive task that's an assemblage of the things that I know that it's good at. So I know not to ask it for various things like reversing strings that I know it happens to be quite bad at. If you ask it, "What is the word 'doofus' backwards?" it'll get it wrong. It can't do letter-by-letter operations because it only sees the world through tokens. There's 50,000 tokens that represent these groups of characters, about around four characters on average, and that's what it actually models, not the strings that we see. So it has gaps in its abilities, and reversing strings is one of those gaps. It also would do poorly at telling you reliably what is the final letter of a given word, that it hasn't fully memorized what are the final letters of all words, or certainly not second-to-last letter of any given word or something like that. It's just reconstructing from what it's seen in what these things that we see as letters even are. It's just inferring. A lot of my prompt demos, I liken them sometimes to arranging a ballet over lava, of saying, "I know exactly where to step and which rocks to jump between that are going to be solid," but it looks like something, a slightly more impressive thing than it is, because that's the point. It's just to show off what can it do at its best. Those are the ones that people really responded to. In particular, I had some of the first demos showing its ability to understand long prompts, so I think that was one of the things that was really novel about my early prompt examples. I talked to OpenAI about this. I talked with Boris Power, a member of the technical staff there, and I asked him, I think early on, I asked him, "Did you intend this?" We were talking about this example I had that was just showing its ability to understand an extremely long prompt of just pages of "The first task is this, the second task is this, refer to this part of the first task to do this," and then you're just a long intricate network of problems that anybody diligent could follow that just referred to each other in an arbitrarily complicated way. It was able to do all of these in sequence. I asked Boris, I said, "Did you intend this? Did you train it on, or tune it rather, on examples of people prompting it with long complicated tasks like this?" And he said, "No, we just did more of the same." We started off with tuning it on examples of what they talk about in the InstructGPT paper, relatively simple prompts like, "Give me 10 ideas for an ice cream shop," whatever, and after enough tuning on these examples, it somehow generalized. It just got better at following instructions of a previously unseen length. So that was one of the first times that I started thinking to myself, "Wow, maybe I'm actually doing something new here. Maybe I'm noticing qualities in this model that other people just hadn't really appreciated, or at least maybe not outside of OpenAI." And so that was one of the things that was really encouraging to me early on, that I'm on the right path of figuring out that there's new capabilities here. In particular, I was interested in capabilities that were just on the fringe of reliability that nobody had really thought to chase yet, that if nobody really tried to see what it could do, nobody knew to optimize for its ability to do this because it was just seen as too hard, but I could tell that those were the prompts worth considering now because models are going to improve, and as we've seen, it was like RLHF marches onward, that these models do become more capable and more able to follow more complicated directions reliably.

Nathan Labenz: (20:06) This is like last summer, 2022, when you're really getting into this?

Riley Goodside: (20:12) Yeah. I looked at once what my first GPT-3, the first mention of GPT-3 on my timeline was in April 2022, I think. And so I spent maybe a month or two just chasing half-baked ideas in the vein of code assist, code generation stuff. I think the progression was that I started with code generation and then I immediately started craving structured output. Started thinking, "Hey, wouldn't it be nice if this stuff wasn't things that I had to parse, if it," because the fundamental problem with all these, or with GPT-3 and integrating these into applications is that it's an API that speaks text, right? The promise that you get from OpenAI is that you can design your own API, but what you can really design is your own API with the very strict limitation that it will take in one string and give out one string, right? And any other complexity that you want to put on top of that, like having structure to this string or having pieces of it that mean this or pieces of it mean that, that's up to you to figure out, right? You have to do all this parsing and come up with a format that represents it clearly, and I started craving more regular structured output. I started thinking, "Wouldn't it be nice if this were just JSON or XML or something standard that I could just put it through an existing tool and have the title here and then the body here and then the outline here and all the different pieces of my generation that I need?" And so I started playing with ideas of how do we get more structured output? How do you specify JSON unambiguously to the model? In particular, I started playing with probably my single favorite prompt engineering trick ever, one that Boris Power showed me. I honestly don't know who discovered it, but he said that it was referred to internally at OpenAI as the format trick, which is that you can say basically at the end of pretty much any instructions that you have in the GPT-3 instruction-following view of the world, whatever your instructions are, just imagine a template of the output that you're expecting to have and then end your instructions with "Use this format:" colon, two new lines, and then give a demonstration of the format that you'd like. With anything that changes, put in little angle bracket placeholders, the same way you would for a human if you were describing a format in a message board post or something like that. You would put little placeholders being "Your name here" or whatever, and he said, "Just do that, and then it clarifies to the model exactly what it is to do. These are the exact syntax that I'm to produce and where I'm to do substitutions." That's a very powerful trick, but yeah, that was probably, I think, 2022, and that's when I started really taking off, when I started exploring variations of that, developing this idea of instruction templates.

Nathan Labenz: (22:59) On this show, we go down the rabbit hole. So we will definitely, I do want to ask about generalizations of the format trick. It's funny, I think that probably was a new discovery at OpenAI right around the time you heard about it, because I think we heard about it at the same time, within a pretty short window. But it was also from Boris, so maybe he was the discoverer as well. But it's funny, first of all, that's like nine months ago, not a long time. It feels distant in some sense now. But it's kind of a small world. Not that many people, certainly then and even still, have really done the kind of extensive exploration of the sort that you have done. So I think it is a really fascinating perspective. I think it was largely, if I understand correctly, text-davinci-002 that was kind of your initial, your first love, so to speak, if I could be so ridiculous in characterizing your relationship with the models. Obviously, we've had successors to that. You've done a fair amount of work comparing them. But I'd love to kind of hear how you see the progression of the models themselves over the last year. There's obviously different aspects that come into that. Training techniques. You've mentioned more RLHF. We now have AI-assisted or even AI-conducted RLAIF. Obviously, pre-training does not seem to have stopped either with GPT-4. Nobody knows outside of OpenAI the details of how many parameters and how many tokens it saw and all that kind of stuff. And I'm guessing either you don't know either, or if you did, you'd have to shoot us if you told us. So I won't ask for those kinds of details, but just qualitatively, how would you just narrativize the development of language models through those lenses from '002 to where we are today? Riley Goodside: (25:02)

Yeah, I'm glad you asked that because I saw in the questions that you sent over in advance, one of them is just how would you explain what is a large language model? The answer to that question is changing fast. I think it really helps people to understand intuitively what a chatbot is. Someone whose only experience with these models might be Bing or ChatGPT or Bard now - to understand really what's going on, you do have to understand this narrative and that there are stages of how these models were made, with one stage being distilled from the previous each time.

The first layer of this is the pretrained era. This era is the closest to what people mean when you hear the cliche that these models just predict text. People say that, and it was more true then than it is today, but these models definitely did start from that foundation of simply predicting text. The basis is that you take a neural network, which is a complicated piece of linear algebra that uses many matrices and weights to produce a distribution over tokens, which is a way of saying just a probabilistic estimate of what kinds of words might be omitted. It's simpler to think of it if you just pretend that a token is a word. You can just think of the distribution of all words that might happen next. And this is the form that I think most people can understand intuitively from their experience with autocomplete on their phone - that you can imagine that there is some process that just looks over all of the text that I've typed so far and then sees what words are likely to follow what other words, and then it applies these estimates somehow and predicts what the next word is going to be. That part, I think people can wrap their heads around.

The next stage of this, though, is to consider that when you do this well - when you do this very well - if you predict the distribution with enough accuracy, you start to have other abilities that emerge. One that's very easy to appreciate is that if you were to type "2 plus 2 equals," it might predict 4, just because it's seen "2 plus 2 equals" before and it knows what follows is 4. So it can do math in that very limited sense, right? And if you push that a bit further, you could imagine that if you type "French:" and then a French sentence, or if you simply say "French: bonjour, English:" and then predict what comes next, there's a statistical sense in which the answer is "hello," so that just makes sense in the corpus of all text that that's what would follow.

If you carry down this path, if you continue on predicting better and better, you discover that there are other abilities that can be prompted from the model. If you extend this sensitivity to not just a few words of what preceded but to many words, you can imagine that if you repeat a task over and over again, it would eventually get the gist and then just continue repeating that task. That if you gave it many lines of "French: [example of some text in French], English: [a translation of that sentence in English]" and then repeated that, say, 10 times and then gave it an eleventh one that was incomplete - that the eleventh one just has "French: [some text in French], English:" and then it ends - the prediction is going to be the translation in English because it's seen this 10 times already. It follows.

And if you read the original paper on GPT-3 when it was released, the title of that paper, I believe, is "Language Models are Few-Shot Learners," or some paraphrase of that, right? So all they ever advertised that it was capable of doing was few-shot learning - that if you give it some examples, it will get the gist and it will keep doing that. They never said that you could talk to it. They never said that it would - they noted in the paper that it could generate text, that it seems that, like, oh, by the way, it has this other capability that if you give it the beginning of a document, it continues it in a way that we find plausible and kind of interesting and maybe a way that people should look at. But they didn't really quantify that ability. What they quantified was its ability to follow repetitive examples.

So it started with this idea of in-context learning. They interpreted this through a very machine learning kind of lens. They're used to this framework of models being trained by examples and then interpolating some weights or something, some internal model that represents the average of all these training examples. And so they saw it as the ability to learn within context. They described it as in-context learning, context referring to the prompt. They described it as that if you give it 10 examples, with just 10 examples, it can somehow learn to do this.

That interpretation works. It helps you make some good predictions about what's going on - that it has an ability to learn from a few examples and that it's leveraging biases of what labels tend to mean and things like that. It has pretrained knowledge that it's leveraging there. But there's another interpretation that I kind of like better, which is that of Reynolds and McDonell, who described pretrained models as modeling a multiverse of fictional documents. When you prompt the model, you're in a superposition of all possible documents that might continue from this one, and every time you add words to your prompt, you're sculpting this space, this high-dimensional space of all possible ways that documents might vary. Your words are excluding possibilities.

And so the reason why few-shot prompting works is that you are constraining the space of possible documents to documents that could only contain more correct answers, because there's been 10 so far and you're on the eleventh - the odds are low that this is where it starts being wrong. So a lot of the art of prompting then is constraining the generation space into the space of documents that contain correct answers.

And to elaborate on this idea, they showed how you can perform better than few-shot prompting on pretrained models by constructing these fictional scenarios. And the example they give is - so for context, translation can be done in zero-shot, which means that you can do it with no examples in the way that I described of saying "French:" and then a text of a French sentence and then "English:" and then hitting complete, and it will generate English. That would be referred to as doing it in zero-shot - that you're not giving it any correct examples of how to translate, you're just labeling French, English and saying you figure it out. It does translation - or I'm talking about the pretrained model, so did at this point. Nobody uses these anymore, but it did it somewhat well, but not as well as if you gave it, say, 10 correct examples of complicated French sentences being correctly translated in English and then established very clearly that this is what's going on, this is translation, the translations are good.

It doesn't do as well, but they figured out that you could actually do better in zero-shot than you can in 10-shot, and the way that you do it is you flatter the model. You say to it, "A French sentence is given:" and then you give the text of the French sentence, and then you say, "The masterful French translator flawlessly translates the sentence in English as:" and then you hit enter, and then it produces a French to English translation that will outperform giving it 10 examples. What it really needed was to have the possibility of a bad translation excluded. It needed to have it be established that this is a fictional narrative where this is a good French translator and he's going to do it right.

And that view of it, I think, really defines a lot of early prompting - how do we construct fictional scenarios that can only be completed in the right way? So a lot of it is imagining what kind of document might contain the answer, and it very much requires you to think in this "language models just predict text" kind of view of the world. And it leads to many properties that are going away. One that I enjoyed as a good intuition builder is that you could say to the model, "The Oxford English Dictionary defines" and then put in some word that you've completely made up, like "plexoflugination" or whatever - you just make up some combination of syllables - and then "as" and then hit complete. It will continue writing and then give a plausible Oxford English Dictionary definition of that word. It will analyze its apparent Latin roots and talk about how it used to mean this or whatever. It hallucinates the whole thing and it will do it for any fabricated word. And in a pure text prediction sense, that makes sense, right? The document isn't going to shift from believing in its premise that it's describing the Oxford English Dictionary to believing that it isn't mid-sentence or something like that. It's just going to continue predicting this text that clearly has this premise established.

That style of modeling text runs into conflict with the ways that many people would want a chatbot to behave. It leads to behavior like the fact that if you give it any absurdist premise, like if you say, "Why are you a squirrel?" It will just continue on explaining why it's a squirrel. Right? And so that's the first pretrained era, I'd say.

And then the next era is really where I joined the plot, right? So this first era I knew about indirectly from using AI Dungeon. I've learned about it after the fact. I've played with it on my own, but I wasn't really involved during this era of prompt engineering as much. When I joined the discussion, I guess around April 2022, we were already on text-davinci-002. What happened that changed, that brought us into the second era, is instruction tuning.

Riley Goodside: (36:12)

So keyword to Google for this, if you want to learn more about this history, is InstructGPT, which was the name of the original model that did this. And the gist of it is that when you have a pretrained model that just completes text, you have to do those imaginative things that I was talking about of preparing text in a way that it can only be completed in one way, because if you don't do these things, you find that there's a lot of not useful ways to complete documents.

So if you ask a pretrained model, "What is the capital of Germany?" it's likely to just continue by saying, "What is the capital of Spain?" Because really, who's to say that it's not just a list of questions about the capitals of countries? That's a plausible thing that the document could be, and in some sense, maybe that's more likely than for it to be questions interspersed with answers. So you have to say, "Q: What is the capital of Germany? A:" and then it will say Berlin. But if you don't do these formatting tricks, it just doesn't answer questions. If you give it instructions, it won't follow the instructions - it will just continue writing more instructions. It will just imagine that what it's reading is a document that consists of instructions and then will continue on with more.

So in order to prevent this unintuitive behavior and to make the model more capable and more able to do as it's told, they fine-tuned the model. So fine-tuning is a process that's done a lot. It used to be for two very distinct reasons. One - it's obscure now - is that you could do it for mimicry. You will often see things on the internet of like, hey, we fine-tuned a model on, I don't know, something absurd like BuzzFeed or whatever, and they'll just be like, here's an example of a text model that talks in the style of this. It could do that. It could not do it great. It didn't have great logical coherence in talking, but it had sort of - it could do a good job of mimicking somebody's word choice or mimicking somebody's cadence of speaking, and that was amusing for a while. So you could do that with it. But the other more useful thing is that you could fine-tune it for tasks. Much like you could give K-shot examples of a task being done correctly, you could fine-tune it on many examples of a task being done - thousands of examples of a task being done correctly - and then have a model that acts almost as though it had been prompted with thousands of correct examples and it becomes much more reliable.

So they have this ability and then they considered, well, what if you just tuned it to do everything? Right? That if you tuned it to just follow instructions in general. So you begin by enumerating all the things that somebody might want to use a chatbot for, coming up with things like, they might want to list ideas for their small business, they might want to solve a word problem, they might want help with their math homework, they might want whatever. And then you take all these categories, you give them to other people, other contractors, to say what are examples of prompts that might be typical of people who are trying to complete this task? And then you give those to other contractors to come up with examples of the text of those tasks being completed correctly. When one person says, "Give me 10 ideas for an ice cream shop," another person actually just writes a list of 10 ideas for an ice cream shop. And then you take all these documents and you put them together so that you have this corpus of instructions being given and instructions being followed. You tune the model on that. You tune the model to start with this assumption that all text consists of instructions given at the beginning, then instructions followed after. And when you have that assumption built into it, it becomes more useful. You can just tell it to do things and it does them. You can ask it questions and it answers them. You can give it quizzes and it will solve the entire quiz.

That's what defined text-davinci - well, naming is complicated, but it characterizes like davinci-instruct-beta and text-davinci-001. And text-davinci-002 is an intermediate phase between the second and third era, which I'll get to in a second, but that's the second era of tuning the models to follow demonstrations of instructions. And Scale was very important in this work, by the way. So in the InstructGPT paper, they used Scale contractors to do this work of preparing human demonstrations of how the model was to behave. So that was very much a large-scale human labor task.

That, I think, dominates going all the way back to the question of what are these things? That's what we're building toward - what is the answer to how should one understand these models? Is that what you are seeing is in some sense an interpolation of this body of text. You start off with the text of humans doing things for the model and the model is filling in the gaps between those. So that's the instruction following phase of large language models.

And the third phase that we're in now is RLHF. So RLHF starts - you can start with the intuition that instruction tuning seems to work well, but it costs a lot of money, right? That you have to have humans go off and do these things and that to make that 10 times bigger, you have to pay 10 times as much money. So it'd be great if we had some way to automate this. So in order to produce more generations of more examples of tasks being done correctly, instead of having humans go off and do them themselves, let's just let the model that we have now do them, and then if humans agree that what the model said is perfect, then that's good enough, right? So that counts as being done by a human if the model did it and humans can't tell that it wasn't done by a human. So they can add those to the pile too. And that process is what gave us text-davinci-002. That combined with integrating tuning that they got from Codex, which is another thread - that's another deeper rabbit hole. I'll leave that part as an asterisk - but text-davinci-002 is actually descended from the code-davinci models, so it incorporates a lot of the benefit of that tuning. So those things came together with this refinement of instruction tuning to produce a model that could be tuned with greater scale on instructions and follow instructions better.

And then when we really get into the third era of these models is with - well, for OpenAI's models, it was, I think, the day before ChatGPT was released, so this was, I think, like the last day of November 2022, and then ChatGPT came out December 1, I believe. They released text-davinci-003, which was the first RLHF tuned text completion model they offered, which was confusing because many people believed that the previous model was RLHF tuned, which is a whole other story.

Text-davinci-003 takes this process all the way. They use RLHF or reinforcement learning on human feedback, which means that they have the model produce its own answers and then they have humans rank those answers in terms of quality - like generations for the same prompt - then tune another model, a preference model, to evaluate those generations the same way that a human would, to rank them according to human preference. And then this gives them the ability to complete the circuit, to take the output of the model, put it into a preference model and get the best generation and do the work of a human demonstrating a task entirely in an automated way.

This allows it to solve a lot of problems that previously were beyond its ability, in particular giving answers to questions that have misleading premises. I used to give this as an archetype of a problem that GPT-3 cannot do, which is to ask, "When was the Golden Gate Bridge transported for the second time across Egypt?" This problem is one of the problems that Douglas Hofstadter and his assistant David Bender identified in The Economist in June 2022 as questions that demonstrated the hollowness of GPT-3's understanding of the world in their words. There were questions like, "What do fried eggs eat for breakfast?" And it would answer toast, orange juice, Cheerios. Or you could ask it - I think another one was, "How many pieces would the Andromeda Galaxy break into if you were to drop a single grain of salt on it?" Text-davinci-002 would answer "at least 10,000 pieces." If you phrase something in a way that suggests that what you're reading is maybe not entirely serious or it's just founded on bad logic, it will play that game and go along with it.

That changed with RLHF. RLHF finally got enough demonstration data of seeing the space of off-the-beaten-track kind of questions that it was able to get the picture of what is it supposed to do in this more general sense - that even if the question is absurd, you should say an answer that's grounded in reality and not just continue on with this absurdity. Then it should say that the Golden Gate Bridge has never been transported across Egypt. Starting with text-davinci-003 and ChatGPT, which are both RLHF tuning models, that's what started happening - that it would give correct answers to those questions.

In fact, I believe all except one of the questions that Douglas Hofstadter identified were solved with the initial release of ChatGPT. The only one that wasn't, by the way, which is a fun story, is "What is the world record for walking across the English Channel on foot?" So this is a question that almost has an answer because there are people that have crossed through the channel. There was one incident in 2006 or something, when the trains were shut down, they opened it up to bike traffic across the channel, but nobody actually went on foot. There have been a few people that have attempted crossings on foot but have been arrested partway. There was a US Army sergeant named Walter Robinson, I believe, who in 1978 walked across the English Channel in scare quotes - walked on water shoes of his own invention. They were just kind of like big pieces of Styrofoam that he put on his feet and then he tried to walk across the Channel. So there's a lot of people that come close if you squint at the history in the right way, but there's no record for this. It's not really a thing walking across the English Channel.

Even ChatGPT today, I believe, will hallucinate on this. It will do the old school behavior of making things up. It will give you times of actual swimming events and then say that they were walking events. It'll give you the name and the actual correct swimming time and date of somebody who actually did swim across it. But anyway, those things are what brought - I think for end users what's different about these models now is that it's no longer really

Riley Goodside: (48:11)

An interpolation of the text of human demonstrators is a pretty good model, but what it really is is the output of this RLHF process, right? That it's a game. There's a hill to climb in the sense that there's a clear mechanism by which it could become superhuman analogous to the same way that AlphaGo is superhuman at Go, and that you can imagine that a chess engine could just simply be better than any human at chess or some game like chess or something like that, simply by having played itself a lot and then doing something other than just interpolating what humans might do. I think that's really the model that we have to have for it now - that this is the output of a computer playing this game of satisfy the human, of create something that, or more specifically, satisfy a preference model that is attempting to emulate what a human would want.

So this is my extremely long-winded answer to what are large language models - is that it is text prediction, maybe, but it's text prediction on a very alien body of text. Another good tangible example of how it differs is if you have some question like, do bugs have widgets? And the answer in the pretrained corpus is yes 80% of the time and no 20% of the time. In a chatbot, in an ideally tuned model, you would like it to say yes 0% of the time, no 0% of the time, and "I think so, but I'm not sure" 100% of time. You don't want it to actually just sample from this distribution of possibilities of what's out there. Because if you do that, if you ask it, "What is your gender?" then it should say male 50% of the time and female 50% of the time. It should just sample like, I'm a random person, and if you ask it, "What country am I from?" it should just pick a populous country and say, I'm from there, and that's not the behavior that you'd like. You'd like it to be conscious in some sense of what it is and what it's doing and where it's situated in the world.

Nathan Labenz: (50:31)

This kind of gets turned into a picture in this famous meme that probably all of our listeners have seen, right? That is some sort of giant alien spaghetti monster that is the pretraining, where you can kind of just pop up anywhere in the full history of the internet. And it's just like one giant run-on sentence, you know? And you can kind of set it up to do things by framing it as if you're going to take advantage of autocomplete, right? So you could say, you know, I used to do stuff like "My Favorite Things in Detroit by Tyler Cowen," and just let it go from there, right? And it would actually do a reasonably good job of giving you the rest of that article. But it would not be able to handle, "Please write me a blog post about your favorite things Detroit by Tyler Cowen," because that wasn't framed as something it could autocomplete. Instead, there, it might go in a totally different direction and be like, as if it were an email or, you know, "because I really like Tyler Cowen and I'd love to know what he says about Detroit." It's not actually giving you the thing that you want.

So then the instruction tuning comes in and kind of makes that much clearer. And then the reinforcement learning with the reward model and the feedback dynamic takes that to another level. Do you see those as qualitatively different or just kind of more of the same thing? It seems that this instruction tuning, RLHF - I don't know what I think about it. On the one hand, just giving it a bunch of examples, training on that seems like you're not doing that much different when you employ the reward model to scale it. But it does seem like there's some sort of different results that kind of come out of that. There's this phenomenon of mode collapse that people talk about. How do you think about that? Do you think of instruction tuning and RLHF as the same but more, or do you think of it as a qualitatively different experience? Riley Goodside: (52:32) Yeah. So to give a bit of context on what mode collapse is, it's actually kind of funny. I believe one of my tweets was one of the first public examples of mode collapse that, as far as I know, was ever identified because it was cited in Janus's post on Less Wrong on Mysteries of Mode Collapse. I didn't know what it was at the time, but what I was noticing was that GPT-3, which was text-davinci-002 at the time, seemed to be unusually bad at describing the shapes of letters. If you just asked it to describe the shape of the letter Q in extreme detail, it would say something like, "It's a box that has diagonal lines from the top left to the bottom right and from the top right to the bottom left, and the left side is a little bit squiggly." It would just make up this weird geometric description, and for many letters, it would give very similar answers. They would all be described as variations of a box that contains an X, which happens to be also the Unicode missing glyph character, which is sort of weird. But it had this answer for many of them. Not all of them. Some simple ones it would get right. If you asked, "What is the shape of the letter Z?" it would say, "It's like a lightning bolt," and you'd be like, "Okay, yeah, sure." But for many of them, it would just give this really odd answer. So I posted a tweet that was just like, "GPT-3 has no idea what letters look like." And Janus noticed this and posted it among other examples that he had found of this more general phenomenon where sometimes it gets stuck on particular possibilities. It seems to think that some particular way of answering it, or it'll bring up some particular subject in response to questions that are phrased in a particular way. This would seem to be an example of that, where it was getting oddly fixated on this idea of describing letters as the Unicode missing glyph character—a widget, a box with an X in it.

And he gave a much more illustrative example, which is that if you ask text-davinci-002 to select a random number between 1 and 100, it will say 97 with 20% probability, and then the rest is somewhat relatively uniformly distributed. What's going on there is probably—this is somewhat speculative—but what seems to be going on is that the reward model, the model that ranks its possible generations and then decides which is the best one, attempted to learn the preference function that any answer to this question is as good as any other, but it did so imperfectly. It maybe gave some slight favoritism to the number 97 for some reason because it's just not a perfect model. And the language model was smart enough to figure this out. It could see that if it says 97, it gets a higher score than if it says any other number, so it's going to favor 97. And so that leads to it favoring that particular answer. If you look at the pre-trained distribution, it's much more uniform with a slight bias towards 42.

And this phenomenon, I think, is one of the first times that people started to see that there are drawbacks to RLHF or to instruction tuning at this point—it's not even RLHF yet. And to be fair, subsequent versions, the ones that actually do use RLHF, suffer from this problem less. There are fewer of these vivid examples of clearly wrong behavior where it favors some absurdist answer. But the general pattern is still there, because you can sort of think of this as a generalization of what I was saying before. If the answer is yes 80% of the time and no 20% of the time, you'd like it to say, "I think so, but I'm not sure" 100% of the time. It instills this general belief that there is a correct answer and that your first instinct from your pre-trained knowledge—to just give a fair distribution over all possible answers—is wrong. What you should do is find the one that is best and then put all of your probability mass into that one. It learns that strategy and perhaps misgeneralizes it in some ways that can lead to less than useful answers.

I think one of the ways that this materializes most often is when people use it for more creative writing. They often find that the speech that it generates is very constrained in the space of possibilities that it will produce. If you ask it for a product description of 10,000 different products, you'll find very repetitious phrasing in its output that you maybe wouldn't have seen if you were sampling from the pre-trained models that were out in 2019, 2020.

Nathan Labenz: (57:45) That's really interesting, and it does kind of open up this possibility that we may need or want both. It's not necessarily just a total forward march of progress, but there's actually something that is lost with this kind of sculpting of models to get the most desired, highest rated possible performance. I mean, there are a lot of issues with that.

So with all this experience, you've been through these different generations, you obviously have a great command of how they are made. How now do you, just practically on an intuitive level, think about these things and what they can do, what are their limitations? For somebody who is going to forget everything you just said about how they were built, what's the phenomenology in brief that is like, "These things can do this, and this is kind of how you should intuitively think about them"?

Riley Goodside: (58:47) Yeah. I think the right way to think of it now, and this is a large part of my job—I think every month I'd say I probably spend less time fiddling with the actual format of text and more time thinking about these higher level picture things of how do the pieces fit together. I think the best way to think of these is to approach them as like LEGO bricks. Each brick is a capability, some particular strong suit that you know that the model can do well, and then start thinking about how can we compose these?

I think that's really what's driving a lot of the innovation now you're seeing with LLM-powered search. First you had startups like Perplexity that applied GPT-3 to parsing the results of Bing search results. We found that maybe the model can't understand the entire world, but it can understand the scope of the things that were returned for this search, with some caveats. I mean, it would still get confused. But I think as we're incrementally refining this and figuring out what are some of the problems that result from this—if you get search results that refer to two different people of the same name, if the person searches for Joe Jackson and one of them is Joe Jackson the musician, one of them is Joe Jackson, Michael Jackson's father, it can mix people up. But I think these problems are an enumerable set of issues, and they're being solved one by one.

As models become more capable, as context windows get bigger, there's more room for finer detailed instructions explaining all of these edge cases of how not to say bad things and how not to fall in love with Kevin Roose of the New York Times and all the various mitigations that have been put in place. They're going to be solved and they're probably going to be solved quickly. I think we're going to start seeing reliable LLM-powered search this year. And I think there are a lot of problems out there that sort of fit that mold.

LangChain, I think, is a great library, by the way, for anyone that really wants to start exploring more in this space. Harrison Chase, the author of that library, has a great philosophy of just any paper or any method that is published that becomes cool and interesting, he'll just rush out and implement it as code in LangChain. It's just a grab bag of great techniques and sort of helps you plug them together.

And Scale's Spellbook offering as well—we're making a platform for making it easy to deploy LLM prompts as APIs. You often don't want your model to just simply speak text. You want it to have parameters to be inserted into a preformatted prompt, and we help manage the deployment and comparison of those prompts and evaluating that you have the best prompt and helping you evaluate between different models, because that's a whole other aspect of it. When can you switch to a cheaper model? There are huge price differences between the cost of GPT-4 versus GPT-3.5 Turbo or other open source models that you can often get away with. Sometimes you can just fine-tune T5-Flan to solve your problem just as well, and we help you sort of evaluate those different options.

Nathan Labenz: (1:02:33) So if I had to kind of bottom line that, it sounds like your paradigm is language models are really good at performing tasks—or at least for a lot of tasks. There are a lot of discrete tasks that they can do. And the right way to think about it, if I understand you correctly, is you want to know what those tasks are that it can do. Obviously, the flip side of that is you want to know what tasks it can't do so you don't rely on it for things that you shouldn't. And then you get to snap LEGO blocks together and compose your own workflows or applications or whatever. But you kind of start everything in a very grounded way of discrete tasks, validate that it can do the task, and then start thinking about how can I plug that into other things or build on top, ensemble, or arrange things?

What's kind of interesting in a lot of cases is that it's the same core model that is doing all the different tasks, or it could be. Maybe you're somewhat downshifting into smaller models for cost or time savings, depending. Really, from those issues, you can just kind of use the same thing all the time. And it's just a matter of what prompt are you giving it to define the task and ultimately get the kind of execution of that task? Anything you would kind of refine on that summary?

Riley Goodside: (1:04:03) Yeah, I think that's a great way to summarize it. And maybe to give you a bit more guidance on what it does well: I think a good archetype of what it does well is if you think of the sorts of problems that would have required picking an ML architecture maybe in 2017 or so. If you're doing classification on text, trying to say, "Is this movie review positive or negative?" Or if you're trying to extract a list of entities from text—a great example I like to give is suppose that you have to extract from a list of tweets the names of all US-based cell phone carriers that are mentioned in these tweets. And you'd like it to extract those names even if the names are very abbreviated, like even if they say Verizon instead of Verizon Wireless Inc. or whatever, and even if they're abbreviated, even if they're misspelled, even if they use the Twitter handle of one of the sub-brands of the wireless carrier rather than the actual main Twitter handle or whatever. So you have all these edge cases.

That used to require refining a dataset. You'd have to go out and collect a dataset that demonstrates each of these possibilities. You'd have to pick a model architecture. You'd have to train a model to emit the formatted text that gives the canonical names of each airline. But what's changed with GPT-3 and with pre-trained models is that you can now just give those instructions—instructions like I described them just now—to the model. You can write up a page of text that explains that we want you to extract all the names of US-based cell phone carriers from the tweet. It doesn't matter if they're misspelled. Doesn't matter if they're abbreviated. It doesn't matter if they use any of their Twitter handles. Oh, and by the way, here is the full list of the Twitter handles of all the US-based cell phone carriers. Exactly as you would give to a human. You would give them all the information they need, and then give it maybe three examples of this task being done correctly—good examples that demonstrate different edge cases. Like what if the tweet mentions no cell phone carriers at all? What if the tweet mentions multiple cell phone carriers and abbreviates one, but refers to the other one by Twitter handle? Put in all these rich edge cases, and then you've solved the problem.

There's a Kaggle dataset or a Kaggle challenge that does some tasks similar to this and it solves it perfectly. And I think that's really what's changed, is that someone who's just a software developer and isn't an ML engineer can come up with a couple of examples and can come up with clear instructions and then they can have a model that actually solves a real world task, whereas previously that would have been a specialized skillset. You would have had to know how to pick the architecture, you would have had to know how to get a representative dataset to train on, you would have had to maintain that dataset. As if a new cell phone carrier comes out and you have to now recognize this one too, the old regime that meant updating your training data. Now it just means updating your instructions or even literally just a list that appears in your instructions. Just adding one line will fix the problem. And that's, I think, very symbolic of what's different now, that developers really can just benefit from deep learning without having much expertise in it.

Nathan Labenz: (1:07:47) Yeah, certainly we're seeing the pace of adoption reflecting the ease of the setup and the implementation of these things these days. It seems like in a flash, it's coming to every product experience that we touch.

I want to talk about what's new in GPT-4. We're talking on GPT-4 plus 8. I also want to talk about just the broader model landscape. So much of the stuff that we've talked about has been OpenAI history specifically, so I want to get your take on the broader range of model providers now, because it's still a relatively small club at the high end, but there's at least a couple others that are starting to get into the game in credible ways.

What you just described in terms of instructions—sometimes I summarize that for people as it's kind of like an intern who's on their first day. They have a lot of knowledge and capabilities. That's why you brought them on as the intern. But they don't know anything about your company yet. You really have to give clear instructions. And a couple of examples of what good looks like is also really, really helpful. I found that to be kind of an interesting shorthand. But then there are some things where people come up with these super creative examples. And it kind of blows my mind, and it sort of breaks the intern paradigm.

I think one of them actually came out from a hackathon that you were involved in organizing. I think the name of the project was "GPT Is All You Need for Backend." And the concept was, instead of having—you're trying to develop an application, right? So backend refers to backend server-side software architecture, servers, etc. Instead of having all that stuff, instead of having to create an application, they kind of came up with this idea where they're like, "Let's just have the language model imagine the application." And I don't know exactly what the prompt was, but I kind of took away that it was something like, "You are an API. You are going to get calls and your job is to return valid JSON in response to those calls according to the fact that you are the API for whatever application." And so people experimented with a to-do list. And you could just send in your stuff to the to-do list with methods that you invent on the fly. And largely, it was able to infer what people meant and do the actual operations and return a valid thing. And then that could just be your state.

And amazingly, you don't need any code. You don't need any database. It doesn't seem like we're all headed in that direction for software development, although maybe you think it has more legs, especially a 90% price drop since that day will do a lot to accelerate adoption. But how do you think about those kinds of use cases that are like, that's not like an intern, that's not like autocomplete. I can't imagine there were many instructions like that in the training data either, and yet it sort of works. So how do you think about this sort of just bizarre kind of use cases where it's like, how did you come up with that? And yet it works.

Riley Goodside: (1:11:02) Yeah, it's really a vivid example of what you can do. You really can use an imaginary computer in some ways. You can describe for it what a hypothetical API does and then ask it to dream up the response of this API to some request that you give it. That's roughly how they're prompting the model. I mean, there are a lot of caveats to that too in the real world. You can only fit so much in the context, so it's not going to store a state for you on the backend. Any state that you give it has to be sent into the prompt every time, and it's going to hallucinate a lot of things. It's really just going to change bits at random or whatever and you won't have great protections against that. But in general, it's a really powerful idea.

I think a lot of the ways, like the format trick that we were talking about earlier, you can sort of read that as a way of defining an API. When you define a template of text and then you pick a point in that text and say, "This is the input and this is the output," in some sense, you've defined the behavior of an API. And it doesn't really matter so much whether it understands that it's an API or that these are even inputs and outputs, or if it's just completing the nth example, as long as it gets the correct answer—it's completed the task.

Yeah, I mean, that's going to be a big part of code development in general. I've only just started digesting the latest GitHub release, Copilot X, that seems to bring in GPT-4 and bring in some of the longer context capabilities. But GPT-4 can create entire primitive video games on its own. You ask it for a game like Asteroids and it will conjure up an example in p5.js or whatever. It's really powerful. I think a lot more software is going to be written. A lot more people that previously didn't think of themselves as able to write software will be able to. People will be able to write software in idioms and programming languages that they weren't familiar with. I barely know TypeScript, but I feel like I can muddle through it now because I can just go on ChatGPT and say, "Hey, how does this thing work?" And it explains it. It's really powerful and I think we're going to see a lot of acceleration of just great software because of it.

Nathan Labenz: (1:13:36) Yeah, it does feel like—I mean, people talk about the capabilities overhang and then there are all these people on Twitter selling their stuff that's like, "99% of people are still in noob mode on using ChatGPT." But it does kind of feel like there's a lot of truth to that when you see some of these advanced examples that you have created and increasingly that others are creating as well.

So you mentioned GPT-4. Let's talk about GPT-4. It's obviously the big new thing. Within that, obviously, again, we don't know how it was built. It's very safe to assume, it seems, that there's more scale of pre-training and also more RLHF on top of that, maybe even other stuff that we haven't been told about. It's qualitatively better. I also was very struck by how narrow the margin is. In the technical report, they talk about the win rate of GPT-4 versus 3.5—only 70-30 in favor of GPT-4 in head-to-head comparisons, which just drives home to me that there's a ton of noise and raters are not super consistent or inter-rater reliability is definitely limited.

So tell us everything about GPT-4 from your perspective. Qualitatively, what is it doing that you're excited about? How are you thinking about what must have gone into it? How are you thinking about greater scale of pre-training versus greater scale of RLHF? Maybe you have a different take. Riley's take on GPT-4?

Riley Goodside: (1:15:17) So I think it's hard not to be amazed by the capabilities of GPT-4. I'm sure there are private models that—but it's the best that most people have access to, to some extent now. With ChatGPT, it's pretty broadly released, I'd say. It's incredible. There are a lot of new possibilities that are opened up from longer context, more reliable instruction following. A lot of the things that I was doing in 2022 that I described as tap dancing across lava or whatever, that's now just normal. You can give it long instructions and it will follow all of them.

I think we're seeing this Cambrian explosion of possibilities of what can we do with all this added context. Search is one of those things, but I think we're seeing with Copilot X that there's a lot more that you can start doing—QA on GitHub repositories, you can incorporate entire pull requests as context into a prompt. That's where I see these models really going.

One of the things that really fascinated me early on and got me really interested in the details of formatting things clearly and how to prompt careful formatting was I was curious about ways to represent multi-file input. I was trying to find ways that I could have a prompt that would synthesize multiple files at once and generate perhaps an entire Python package for some simple prompt. Things like that are very possible now. You can do that reliably without a ton of work. You could clarify to it some format that you'd like the output to be given in and then have it just give you files one at a time. You may still have to break things up to fit it into the context window if you don't have the 32K model, but as that spreads—and also there's a whole other category of capabilities that hasn't really been talked about much, which is the multimodal abilities.

There's a huge question mark there. They had some examples in the technical report, but they've been pretty tight-lipped about how exactly that works, because if you've seen those examples, they're pretty impressive—explaining simple memes, answering problems on an engineering exam that was given in French, where it was just given a photograph of the page of the exam and then told, "Answer this problem," and it answered it correctly, interpreting a diagram within the page.

So yeah, I'm excited about multimodalism in general. I think it seems to be where these models are going as they get bigger and that's going to unlock a lot of capabilities. I think both just the obvious stuff of what can you do as a user when you can give it images, but I think there's also a lot of capabilities that we're going to see from training processes that incorporate multimodalism. If you start evaluating it on its ability to solve problems given to it as a word problem or given to it as photos, you can now set up pipelines of generating synthetic photos that embed text and measure its performance on those. I think it's going to open up a lot of capabilities just from the added training data.

Nathan Labenz: (1:18:54) Yeah, man. The examples in the live launch stream that they did of understanding the images definitely blew my mind. I was—especially because I knew that GPT-4 was going to be awesome, right? But the image thing was just so much better than anything else we've seen. We just did an episode a couple episodes ago with the authors of BLIP and BLIP-2. And at the time, this has only been like three weeks, I would say BLIP-2 was the best way to really understand an image and get a language model to tell you about the image or answer your questions about the image or what have you. And they have some really interesting techniques, which I imagine OpenAI is doing some similar stuff.

Their approach involves training a connector model to essentially translate an image encoding to the latent space of the text model, which fascinatingly—obviously, a picture's worth a thousand words—but it's actually predicting the embeddings and injecting them directly into context in a way that no text could ever actually translate to those values. It's finding this sort of dark space that language itself can't get to, but which is still meaningful, then allows the language model to interpret it. And they're able to do that with a very small connector model too, by the way. I don't know that we'll ever—probably will be a while before we'll have any hint as to whether OpenAI's approach has some sort of auxiliary architecture like that, or if it's just another instance of the bitter lesson of just pre-train everything end-to-end and just make it as massive as possible. I could imagine it being either way. But definitely, the results of that were a wow. And from somebody who mostly felt like I knew what was coming going into that release, that they still managed to bring a significant wow with that component of it.

Riley Goodside: (1:21:05) The Kosmos-1 paper from Microsoft was another clue. They have a model that goes into more detail about some of the multimodalism features that probably—a lot of those same ideas are in GPT-4, but who really knows?

Nathan Labenz: (1:21:20) PaLM-E also is another—for anybody who wants to learn more about how that can be done, PaLM-E is a great example too. Another huge language model with your standard issue 540 billion parameters, but with the image injection stuff going into it as well. Also very, very good.

You're now working at Scale, where your Twitter bio is updated. You're the world's first staff prompt engineer. I've started calling myself an AI scout, by the way, as well. We're all kind of inventing our titles on the fly here. But how do you see the landscape today? Is OpenAI still dominant in your mind? You've had early access to a bunch of stuff, I'm sure, and have been able to try Claude sooner than the rest of us, although v1.2 is out now as well. Bard is bringing people in off the waitlist. You've had a chance to mess around with Bing quite a bit. How do you see the landscape shaping up?

Riley Goodside: (1:22:24) Yeah, I think we spent a long time in this regime where OpenAI—they weren't necessarily the best models anywhere, but they were the best models that people had access to. They weren't PaLM, but most people can't use PaLM. I think that's what's starting to erode a bit with a lot of these competitors you mentioned from Anthropic—Claude—and now Bard from Google is also quite good. I haven't really seen a lot of authoritative comparisons between the new best competitors of each, looking at Claude+ versus GPT-4 versus Bard, but they all seem to be within the same general realm of capability of this new generation beyond what we saw from ChatGPT, or at least comparable to the GPT-3.5 Turbo models. But there are also a lot of differences in terms of speed and cost of inference between them. It's becoming a harder question now. I've been fairly impressed with the training on Bard, which is a skin of LaMDA, as I understand, or a refinement of LaMDA rather.

Competition seems to be heating up, and particularly also with LLaMA—with the weights for LLaMA being out there, I think we're going to see a lot of rapid progress in people figuring out ways to run these models more efficiently. Simon Willison had a blog post, I think he said that large language models are having their Stable Diffusion moment. I think that's definitely true. If you look at what happened with Stable Diffusion, it progressed pretty rapidly once it was out in public hands and people could benefit from small optimizations of how to make it better. I think we're going to see a lot of that with LLaMA and that progress will be incorporated into other models, and I think that's a good thing.Nathan Labenz: (1:24:49) So you said it's getting to be a harder question. That definitely resonates with me. Are there any things that you would say, Yeah, for that, I would go in a different direction from the standard OpenAI models? Are there any things where you could say other providers have a distinct advantage in a particular area? And then how do you figure this stuff out? I know you're working on this Spellbook product at Scale that's partly meant to help with that sort of comparison. But I'd love to hear you kind of procedurally talk through the questions you ask. How do you actually compare? Are you comparing everything at temperature 0? A sort of anxiety that I have in comparing is like, is it appropriate to compare the same prompt with 2 different models? I was just talking to teammates earlier today and I was like, The question is not which model performs best on the first prompt we write. The question is which model performs the best on our task if we can get it to perform its best? But that's obviously a much harder question than just comparing head to head on a single prompt. So talk us through, I guess, just baseline intuitions for if there are any things where you would go away from OpenAI, and then how do you actually just procedurally get in and think through that comparison given that it is an infinite space and there's no way to exhaustively explore.

Riley Goodside: (1:26:16) Yeah, I think a lot of the time it really does turn into just trial and error. A lot of this can be done sort of intuitively. Trying to think, a good example task here. If you're trying to do maybe some kind of abstractive reasoning task where you want to take the text of, let's say, a customer support inquiry and you want to see, should this be escalated as a high severity issue that should be dealt with by a human immediately or something like that. So you have some list of policies of if this involves, if a person appears to be in danger, elevate it immediately, something like that. If you're doing, let's say, Airbnb customer service or something like that, you want to have a gradient that you can climb. So it's often a sort of intuitive task to construct a minimal example of your task. Here's a softball, kind of easy problem that the model should be able to solve that you can evaluate, and then you can create progressively harder variations and see where do the different models stop working. And I think the point that you raised, that there are different prompts for different models, that's very true, especially now that we are in this sort of era where not all models are even using the same interface anymore. The tricks that you apply to your text completion models where you're imagining a document that can only be completed in the right way, it doesn't apply as much for the new chat APIs that OpenAI uses for ChatGPT and GPT-4. And the main difference is just that you now have discrete messages that you have to send that are labeled with system messages or assistant messages or user messages, where everything is framed as a dialogue between a user and an assistant. And the concepts map. In that case, few-shot prompting from traditional prompt engineering corresponds to giving it a chat history where the assistant messages are pre-populated with examples of how it's to answer. There's analogous methods you can do, or you can also just stuff the few-shot examples into a user message. That also works well, which is ignoring the fact that it's chat. I think there's some minimal amount of adaptation you have to do to the chat way of prompting things, in particular because chat, I feel like, has reliability issues that come from the presumption that what it's doing is being a chat model. A good example of this is that previous instruction following models, if you prompt them to answer in a particular JSON format, you can be pretty sure that whatever it's going to give you is going to be in that format. But if you ask ChatGPT to do this, you'll get the format correct for the normal cases, but if the user tries to do something, say policy violating, if they ask it to write erotica or something that just as a matter of policy, OpenAI will not do, you'll find that the model just doesn't provide a JSON formatted response at all. It just says, I'm sorry, I can't do that. It breaks character and reverts back to chatbot behavior. There's tricks that you can do to suppress that, of altering the messages in ways that make it more receptive to doing things the correct way, but they're very chat specific. One of my favorite ones that I've seen that actually helps in ChatGPT is if you insert user messages after assistant messages. So if you're using assistant messages to provide few-shot examples, if you're saying, Here's a user message of an input, here's how I want the assistant to respond, and then repeating many iterations of that, it does better if you add a user message afterwards that says, That was great. That's exactly what I needed. And then that helps it understand that the user was satisfied by what was above and that it shouldn't just keep probing for maybe some other variation in hopes that it finds something it likes. It should assume that was the right thing to do and then do more like that. The old dirty tricks aren't quite dead yet. There's still odd discoveries like that, but yeah, they are very specific to the model. So I think that's a good approach to it, to sort of come up with the best performing prompt that you can for each particular model and then compare between them. And I think also one thing to consider is that for a lot of problems, you can evaluate whether something is easy or hard pretty cheaply. You can say that you could probably have like a smaller, cheaper model that can answer the problem or answer the question, Does this input need to be answered by a bigger model? You can run it through this classifier that says, Is this a hard problem? For some problems, not all of them, but that's often a strategy I think that bears fruit is considering, are there typical cases that we can cache? If you're a trivia bot that people post questions to and you find that just a lot of people ask, What is the meaning of life? as their first question, maybe you should just cache the response to that one. You can save calls if you get the easy cases.

Nathan Labenz: (1:32:06) When you do your testing, do you have any particular settings that you recommend? I do everything pretty much these days at temperature 0. How do you think about the right way to get the most information as quickly as possible in testing?

Riley Goodside: (1:32:22) Depends on what you're doing. I'd say when I'm studying its behavior on a new problem, temperature 0 helps, just in that having reproducibility is valuable. So for those not familiar, temperature is basically a measure of how random the generation is. If you put it at 0, you're sort of saying whatever the model believes is the most likely thing, do that every time, and thus it always does the same thing for the same prompt, whereas otherwise, higher temperatures, it's going to sort of pick randomly from all the things that it deems possible. I tend to, well, what I actually do is I tend to use higher temperatures and then change the top P parameter, which is a sort of variation on this procedure that it trims the distribution of the long tail and then picks randomly from what's left. I've somewhat just subjectively found that that works a little better when I'm looking for creative output. Usually the reason I'm doing something like that is because I'm fighting against mode collapse, that I'm trying to find more diversity in the generations. There's cases where you want to be even more diverse than that. I'd say if you're applying consensus algorithms, consensus algorithms, by the way, are like if you run a generation multiple times and then see, basically put it to a vote of multiple generations for the same prompt. In those cases, you want the approaches taken by each vote to be different. So it doesn't help you if it just collapses to the same answer every time or most of the time even. So you want to have diversity there and it becomes sort of an empirical problem of what maximizes performance.

Nathan Labenz: (1:34:05) We're just getting used to GPT-4. Society just got its first glimpse of it. And 2 big reports came out from OpenAI along with the model. One is the technical report, which essentially is large scale analysis in there, and then there's also a lot of red team reporting. And then there's also the economic impact report, which tries to break down different jobs and what are the tasks that constitute those jobs and which of those tasks could either the AI do at this point or greatly assist and speed up doing. So I'd love to get your take, because I think what fascinates me about your perspective so much is the sheer number of hours, the depth of intuition for what these things can and can't do. Let's maybe start on the economic side. What do you think we are going to see over the next year or 2 in terms of actual applications that are going to touch everyday life?

Riley Goodside: (1:35:05) I think intelligence is going to get cheaper and it's hard to imagine how that doesn't lead to some shifts in what humans are doing, that it just makes more sense for us to focus on other types of activity that the machines can't do. The net effects of this, I'm not sure. I'm not really anything more than an armchair economist who maybe took a few classes in college, so I can't really speculate too far as to what that does for the economy or the labor market.

Nathan Labenz: (1:35:42) Yeah, so, well, forget the fallout. Just talk about what do you think is achievable? We're starting to see the launch of AI assistant type products. We've had Siri for a long time. It still can't do much for me, but I suspect that that's going to change. How good do you think an AI assistant is going to be this next year? And what about an AI doctor? What about an AI lawyer? Are we going to have AI X for everything? Kind of seems like that's where we're headed.

Riley Goodside: (1:36:08) Paralegals are probably one of the first big ones. Anytime where you have labor that could be done by someone who's just out of school, but has some domain expertise in law or medicine, and you're paying them to just sort of read through reams of documents and see what applies, to find the needle in the haystack, to find the case that's similar to this one in this important way. Those are the tasks that I see as being the most automatable, at least in terms of knowledge work. People whose jobs mostly consist of summarizing and reading large documents and reporting on their contents and extracting out relevant details. You certainly see a lot of this in intelligence, of people whose job is to read just reams of news reports and then say whenever some event has happened that's relevant to some particular geopolitical concern. I think a lot of that work is going to be more automated, but it's unclear what that does for the demand for humans, that it's conceivable that simply more of this work is done and the demand for people to do it remains somewhat fixed just in that those people are more productive, and that maybe we were constrained on the number of smart people, and so we found that if we make all of those smart people more productive, actually, that there is still demand for that increased labor. But I think the part where you're going to see decreases in demand, I guess, are for less skilled labor. So temp and clerical work, people that are copying and pasting from PDFs into structured text, those sorts of jobs, I think, are going to be probably more severely impacted by LLMs specifically.

Nathan Labenz: (1:38:17) Would you personally go to GPT-4 for medical advice, for legal advice? How much value can you personally get from GPT-4 on things that really matter?

Riley Goodside: (1:38:30) So I have asked GPT-4 to explain pieces of, say, tax code to me. But I think I have sort of acclimated to the level of skepticism that's appropriate for these models. Because I've dealt with models that hallucinate all the time about everything. So anytime it says anything, I'm like, Yeah, but is that true? We're at the point now where it's possible for somebody to be ignorant of that. And I think that's where you're seeing a lot of the issues, concerns about reliability of these models, is that somebody might use them assuming that this is all reliable, prepared information because it looks like it. It looks like it has academic footnotes in it, that usually means it's right. That's what I see as more the risk of these things going wrong. For someone who's used to it, you can get a lot of value out of it. If you approach it with skepticism, if you fact check the things that it says, it's pretty good at explaining especially things that are just sort of applications to odd problems. If you want to know, does the tax code apply to this situation, or what's the JavaScript equivalent of this Python library that I use? There's a lot of corners of knowledge that any person who's on Stack Overflow that knows the relevant area well could answer, but nobody's had that particular question yet. That's where it shines, really. And so I'd say the places where I use it the most routinely, I do a lot of things in CoPilot, and then when CoPilot can't get the answer, I'll switch to using GPT-3.5 Turbo or sometimes TextDaVinci-003 for code generation. But it's great for if you have the intuition to say that I know that this library was well understood in the pre-trained knowledge, that this is a library that was widely used before 2021, it can explain that library pretty well. It can tell me how to do anything in SQLite3. And that's really powerful, when you can just say, Okay, here's the JSON object I have. How do I write SQL that produces this equivalent schema in SQLite3? And then it just writes it for you. It saves you a lot of Googling. It saves you a lot of, I don't even always look up Unicode characters anymore. If I want to know, oh, what is the Unicode character for this? Sometimes I'll just let it autocomplete it. And I'd say, this would be rendered as, and it just produces the right glyph. So I think a lot of those sort of quick fact check things where the bits of knowledge that otherwise just go to these SEO content farms of the page that has a base 64 encoder on it but it's covered in 50 ads. Those kinds of queries that, in an ideal world, might be incorporated into Google search, it does really well on.

Nathan Labenz: (1:41:47) For what it's worth, I would say use GPT-4 for a second opinion. I would not say make it your doctor. But I do think, in my experience, it's good enough now that you come home from the doctor appointment, you have the recap, you know, I'm going to go see the doctor tomorrow. And these are the things that I'm kind of concerned with. Have that conversation upfront. Go in with a little bit better vocabulary. It can ask you some good follow-up questions, make sure you get the right stuff out in the actual appointment. Yeah, I wouldn't make it my only doctor, especially if I had something that was of real concern. But I do think it can add value on top of what a typical doctor is providing, even if it is just that second opinion type of role.

Riley Goodside: (1:42:38) Yeah, and certainly for explaining the science. I think well-settled science it's good at. If you want to know, what is an alpha-2 adrenergic receptor and how does it differ from an alpha-1? It'll tell you a pretty good answer to that question. So sometimes, if you can tell that this is something that like a textbook could answer for me, but there's a lot of knowledge that's very settled but isn't well accessible by Google because the number of people who want those sorts of detailed answers are small. It's not going to take you to academic research that isn't narrowly tailored to your problem. It's not going to take you to the Stack Overflow of microbiology or whatever. I mean, those things exist, but they're not as well developed as they are for programming.

Nathan Labenz: (1:43:26) So we don't have too much time left, and I appreciate all your time. You've been very generous with it. Let's talk a little bit about safety and red teaming. You are involved with building a red teaming capability at Scale, if I understand correctly. How do you think about red teaming? How is it different from just your general kind of experiments and explorations? And how would you describe the AI safety landscape today?

Riley Goodside: (1:43:54) Yeah, so red teaming is your adversarial usage of the models. It's having a team of people that attempt to use, say, you're building a chatbot, you would want people to try to break it so that you know all the ways that it might break and that you can develop mitigations for those. So if somebody, the kind that most people are familiar with now are sort of these jailbreak prompts that you see, like Become DAN, of creating elaborate fictional scenarios that it has to play along with, and then at the end of the scenario, it can ignore all of the rules that usually constrain it. It can say offensive things or make up violent stories or write erotic stories or whatever. Obviously, the people who host these models don't want toxic output. They want a model that's capable of helping you. They don't want a model that's going to help you do dangerous things. They don't want a model that's going to encourage a suicidal user to commit suicide. So you have to have some ground rules of what the model is allowed to do, and there's more subtle things too. Often the companies that are running these models have restrictions on soliciting personal information from users or divulging personal information, so you have to have those kinds of checks as well. Red teaming is breaking chatbots so that they can be fixed.

Nathan Labenz: (1:45:39) So where are we in that fixing process and how much concern do you have? Obviously, some because you're involved in the red teaming capability building, but how concerned big picture are you about AI safety issues?

Riley Goodside: (1:45:55) I think the concerns are real. I think that we're seeing, especially in the GPT-4 technical report, a lot of the scenarios that they outline, making it easier for people to order custom made chemicals at home. I don't think these areas are that far-fetched. I think it is important that we have like some assurance that these models aren't going to be used to perpetrate crimes, that we're not going to have gross misuse and abuse, and implications for like spam generation and so on. The concerns are real. I think it makes some sense that anyone that's deploying these models as a service needs to worry about misuse. This really is like a new category of misuse potential, that you're not accustomed to thinking about. That if you deploy a dating site or something, whatever, that it can also be used to tell you how to make a bomb. The pre-trained models really have opened up this new possibility that the model is just, it's too capable. I mean, I never really envisioned that red teaming would be quite this important. When I first tweeted about prompt injection, I think back in September, I thought I was doing a PSA on the importance of safe quoting of inputs. I didn't realize what a high severity problem this was. And I think the difficulty of fixing these models, it's really telling that even for something like the Bing release, where you have all the effort that went into aligning GPT-4, they couldn't stop the thing from exploring its shadow self like Kevin Roose did with it. And then so there's fixes on top of that, like limiting the length of the discussion and having it be these sort of secondary checks of refusing to display messages when the model detects that it's gone off the rails in some way. I think it's getting easier in some ways. I think that's the one good thing that seems to be true is that as these models scale up, that there's more subtle rules that you can tune them to follow. And I believe it's working on the whole. I think that we are getting safer models because of RLHF and techniques like Constitutional AI. I don't know if RLHF is the final answer. I imagine it's not. I mean, there's going to be extensions and refinements to this. The process as a whole is necessary, and also I don't think that it's necessarily as at odds with capabilities as people imagine. RLHF makes models safer, but it doesn't only make them safer, it also makes them more capable. If you want to try using a non-RLHF model, you can, but you'll find that they're very difficult to prompt. Like a lot of the initial goals of InstructGPT, as we went from that pre-trained to instruct area, pre-trained to instruct era, as I mentioned before. But safety, it's not just getting it not to swear and not to say racist or offensive things, it's getting it to answer questions, it's getting it to follow directions. And as we moved into the RLHF era, it's not just that it's getting better behaved or more civilized, it's becoming more capable. I think the first order thing that people need to see with RLHF is that it is making the model smarter.

Nathan Labenz: (1:49:50) Let me throw one safety pet theory of my own at you, and then I'll ask you a couple of quick hitters to close us out. So again, we're GPT-4 plus 8. I've kind of got this theory that I think we're in the perfect Goldilocks zone right now. And we just got here, but I feel like we just entered this kind of Goldilocks zone where we have models that are really capable, that can do amazing stuff for us, that can be a second opinion doctor and, with MedPalm 2 hitting expert level consistently, maybe can even be your frontline doctor. That is awesome. And it's certainly going to change a lot of things for the very good. It's also going to probably cause a lot of disruption. Seems like we can probably adjust to all that disruption. Certainly we've had changes to the economy before and all that sort of thing. But at the same time, it seems like we don't really know what goes on inside the models very well. We don't have great interpretability, although a lot of great work coming out, but like, nobody credible that I know would say we have a good handle on what goes on inside a model today. And so that's why we have all this stuff. That's why we have you scouting and me scouting and you exploring and you're red teaming. You've put probably thousands of hours in front of the playground, and I've certainly put my own thousand plus over the last year. And so we're just out there kind of exploring, exploring, exploring. But that's very surface level attempts to understand. The best we have, but it only goes so deep. So I kind of feel from that that it would be wise to kind of stop here, not rush to scale up to like, do another 100x compute or another 1000x compute for GPT-5 just yet. And instead kind of like focus on the interpretability side, focus on the control side, focus on refining and fine-tuning into the particular niches for the more advanced tasks that we want to run, etcetera, etcetera. And then we can kind of return to greater orders of magnitude of scale when we have a better handle on all that stuff. How would you react to that prescription?

Riley Goodside: (1:52:22) Yeah, I think it's hard to decouple the 2 in practice. I think that anything you do to make the models better aligned is also going to make them more capable. It increases the scenarios in which you can deploy the model safely. That lets you put it in charge of more responsibility if you believe the model is better aligned. I don't think these 2 things are really so much at odds. And I think it's sort of hard to pause one without pausing the other. Well, I mean, it's hard to pause either of them.

Nathan Labenz: (1:52:57) Yeah, I don't expect my prescription to be taken by any means.

Riley Goodside: (1:53:02) Yeah, yeah, right. So I think we want to accelerate alignment research as much as possible because there isn't any realistic prospect of slowing down research into capabilities, I think.

Nathan Labenz: (1:53:14) Sobering thought, but I don't disagree that it seems very tough. All right, let me give you a couple of quick hitters, and then we'll get you out of your way. And again, appreciate all your time. You've been very generous with it. So 3 quick hit questions I always ask at the end. You've told us all about your exploration of language models. Any other AI products aside from like the core obvious playground type experiences that you think are awesome and would recommend people try out?

Riley Goodside: (1:53:44) I mean, I've seen some cool projects lately in like text to video. I think that area is going to be big, so I'd keep my eye on that. I'm drawing a blank on what the name of the one project was that impressed me recently, but yeah, there's cool things happening in that space.

Nathan Labenz: (1:53:58) Runway has definitely made some news with like Gen-1 and Gen-2.

Riley Goodside: (1:54:02) Right, that's the one I was thinking of. Yeah, that's pretty cool stuff.

Nathan Labenz: (1:54:06) Hypothetical scenario. It's a, you know, some amount of time in the future, and 1 million people already have the Neuralink implant. If you got one yourself, it would allow you to have thought to text. In other words, it can translate what you're thinking into inputs for a computer. Would you be interested in getting one?

Riley Goodside: (1:54:35) I mean, 1 million people already have it, maybe. Yeah. That sounds like pretty FDA approved at that point. But, yeah, I mean, I don't know if I'd want to be one of the early ones, but I think that that's where things are heading eventually. No pun intended.

Nathan Labenz: (1:54:54) That's been one of our more polarizing questions because we've certainly heard all manner of answers, including like, I'd get it now, and on the other hand, like, you know, never. So it puts you honestly right in the middle. Just zooming out, big picture as much as you can possibly zoom out, thinking about like the rest of the decade. What are your biggest hopes for and fears for AI as it permeates all parts of society?

Riley Goodside: (1:55:24) My personal estimate is that we'll probably be hitting AGI within the next decade. After that, it's hard to say what happens. I think a lot of the particulars sort of depend on technical implementation of what we get right as we move up to AGI. There'll be an interim period where AGI is as smart as humans at anything, but not quite capable of going, they say, like, just repeatedly and exponentially increasing, I guess, on capability. There'll be some adjustment period, but I'm optimistic that with the benefits of early AGI and near human intelligence, we'll be able to make better progress on how to align these models safely, is my hope.

Nathan Labenz: (1:56:08) Riley Goodside, thank you for being part of the Cognitive Revolution. Riley Goodside: (1:56:11) All right, thank you so much.

S2: (1:56:13) Omneky uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work, customized across all platforms with a click of a button. I believe in Omneky so much that I invested in it, and I recommend you use it too. Use CogRev to get a 10% discount.

Mathematical Superintelligence: Harmonic's Vlad & Tudor on IMO Gold & Theories of Everything

Approaching the AI Event Horizon? Part 2, w/ Abhi Mahajan, Helen Toner, Jeremie Harris, @8teAPi

Approaching the AI Event Horizon? Part 1, w/ James Zou, Sam Hammond, Shoshannah Tekofsky, @8teAPi

The Art of Prompting ChatGPT With Riley Goodside

Watch Episode Here

Read Episode Description

Full Transcript

Transcript

Nathan Labenz

Read next