From Poetry to Programming: The Evolution of Prompt Engineering with Riley Goodside of Scale AI
Nathan hosts Riley Goodside, the world's first staff prompt engineer at Scale AI, to discuss the evolution of prompt engineering.
Watch Episode Here
Read Episode Description
Nathan hosts Riley Goodside, the world's first staff prompt engineer at Scale AI, to discuss the evolution of prompt engineering. In this episode of The Cognitive Revolution, we explore how language models have progressed, making prompt engineering more like programming than poetry. Discover insights on enterprise AI applications, best practices for pushing LLMs to their limits, and the future of AI automation.
Apply to join over 400 founders and execs in the Turpentine Network: https://hmplogxqz0y.typeform.c...
RECOMMENDED PODCAST: Complex Systems
Patrick McKenzie (@patio11) talks to experts who understand the complicated but not unknowable systems we rely on. You might be surprised at how quickly Patrick and his guests can put you in the top 1% of understanding for stock trading, tech hiring, and more.
Spotify: https://open.spotify.com/show/...
Apple: https://podcasts.apple.com/us/...
SPONSORS:
Building an enterprise-ready SaaS app? WorkOS has got you covered with easy-to-integrate APIs for SAML, SCIM, and more. Join top startups like Vercel, Perplexity, Jasper & Webflow in powering your app with WorkOS. Enjoy a free tier for up to 1M users! Start now at https://bit.ly/WorkOS-TCR
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive
The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/
Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist.
CHAPTERS:
(00:00:00) About the Show
(00:00:23) Sponsor: WorkOS
(00:01:24) Introduction
(00:06:23) LLMs using LLMs
(00:09:38) Tool Use
(00:11:06) How to manage the breadth of the task
(00:14:51) Prompt engineering
(00:16:24) Sponsors: Oracle | Brave
(00:18:28) The importance of explicit reasoning
(00:21:16) The importance of breaking down tasks
(00:26:49) Multitasking fine-tuning
(00:31:49) Sponsors: Omneky | Squad
(00:33:36) Best models for fine-tuning
(00:36:41) The Platonic Representation Hypothesis
(00:42:02) How close are we to AGI?
(00:45:44) How do you know if youre being too ambitious?
(00:51:18) Best practices for generating good output
(00:54:33) Backfills and synthetic transformations
(00:56:59) Prompt engineering
(01:05:54) AGI, modalities, and the limits of training
(01:11:38) Compute thresholds
(01:13:02) Jailbreaking models
(01:16:09) Open-source models
(01:20:08) Solving the ARC Challenge
(01:23:20) How to Demonstrate Prompt Engineering Skills
(01:25:27) Outro
Full Transcript
Transcript
Nathan Labenz: (0:00) Hello and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Eric Torenberg. Hello, and welcome back to the Cognitive Revolution. This is Nathan's AI voice clone powered by 11 Labs. The real Nathan is in Brazil this week to give a presentation on AI automation at the Adapta Summit in Sao Paulo. And today's episode, featuring returning guest Riley Goodside, the world's first staff prompt engineer at Scale AI, turned out to be a perfect prelude to that presentation, which we'll share here soon as well. Going back to 2022 and into early 2023, Riley became famous in the AI community for coming up with one clever prompting trick after another. He was one of the first to put large language models into a read, evaluate, print loop, or REPL, which most of us would now recognize as a precursor to AI agents. I think you could make a strong case that he had demonstrated the deepest, most practically useful understanding of language models' idiosyncrasies and outright weirdness of anyone in the world. Today our conversation reflects how much LLMs have continued to progress, even since GPT-4. As models have undergone more and more post training, the need for quirky tricks has declined. And as Riley puts it, prompt engineering has become less like poetry and more like programming. Meanwhile, in the enterprise context that Scale AI serves, companies are starting to move beyond ad hoc chatbot interactions and instead using the full range of best practices, including curating gold standard examples, capturing reasoning traces, all sorts of retrieval, augmented generation, fine tuning, and anything else they can come up with. All to push language models to their performance limits, with the goal of achieving human level performance or better, and ultimately saving serious time and money on routine tasks. It was a treat for me to dig into these topics with Riley, and I was glad to find that our approaches are mostly in sync. For anyone building AI systems, this episode is packed with practical value. And, of course, I couldn't help myself from sneaking in some questions about jailbreaking, the prospect for superhuman intelligence, AI safety, and more along the way as well. As always, if you're finding value in the show, we'd love it if you'd share it with a friend. Your support helps us continue bringing you conversations with the leading minds in AI. The real Nathan will be back soon, but for now, he hopes you enjoy this deep dive into the evolution of prompt engineering with Riley Goodside of Scale AI. Riley Goodside, the world's first staff prompt engineer of Scale AI. Welcome back to the Cognitive Revolution.
Riley Goodside: (2:52) Thank you. It's great to be back.
Nathan Labenz: (2:54) It's been a minute, and obviously a lot has happened. In preparing for this, I went back to your Twitter feed, which was, of course, your original claim to fame in the AI space with lots of super interesting examples coming out of the text da Vinci 002 era. And I found you've been a little quiet on Twitter recently. I think the AI community has certainly grown. There's lots of people doing that kind of stuff these days. But what have you been up to that's had you quiet online in recent months?
Riley Goodside: (3:24) Yeah. So there's a few reasons I've dropped off in my posting activity a bit. So the biggest reason I've been quieter on Twitter is that I'm a father now. We June 2023, we welcomed our first child, our daughter Felicity, into the world. And it's been magical, but, of course, also time consuming. So that's between that and working, it's left a bit less time for extracurricular posting on Twitter.
Nathan Labenz: (3:51) So not the total lack of progress since GPT-4 was introduced that has... No. No. Just given nothing to talk about.
Riley Goodside: (3:59) No. I think it's, you know, certainly not a lack of progress. I think that there are other subtle more subtle factors at play, though. So one I think the other thing that's maybe more mundane reason is just, the changes to my situation personally at work, that I think, to some extent, my active video on Twitter was a campaign to break into the AI industry and get me hired. It did that job, so I have a little less motivation. Also, when you work for a company whose customers include much of the AI industry, you have more people that you have to be sensitive towards. So it's harder to do some of the more reverent material, I guess, I was known for earlier on. And I think the other thing too is really just that I think maybe dominates in some ways are changes to prompt engineering itself. I think this is maybe like a bigger topic that we could get into later in the podcast. But I guess my, like, TLDR teaser for it would be that I think modern prompt engineering is becoming more programming than poetry, and nobody really wants to do code review on x.com. We'll come back to that, but I'll leave it at that for now.
Nathan Labenz: (5:02) Okay. Cool. Yeah. That's interesting. Let's get to that very soon. I think my comment on lack of progress since GPT-4 was hopefully understood as a joke. It was meant to sort of allude to the sort of commentary that's, oh, we've hit a wall. Nothing new has happened since GPT-4o or GPT-4. It's, you know, it's all stalled out. That's definitely not my perspective. As somebody who is an intensive user and increasingly like seeing customers of scale presumably using language models in a lot of different contexts. What stands out to you most in terms of the new capabilities that have come online? What do you think has really been the biggest difference maker or driven most value over the last, say, year and a third since we first got to see GPT-4 in public?
Riley Goodside: (5:46) I think a lot of the progress lately it's becoming maybe harder to describe, maybe a bit more diffuse across many different tasks. One area of capability that I've been excited about for a long time really is use of REPLs and the ChatGPT code interpreter, as it used to be known, think it's now advanced data analysis, and other projects that extend this idea even further, like Julius AI, of giving the AI, like, an interpreter environment to work in. I think there's a lot of progress that's been made and still a lot of alpha to be found in increases to that capability. Right? We're in this regime now, I think, where LLMs could benefit a lot from knowing how to use LLMs, right? That there are still things to know as a user about what tasks an LLM is good for. It's not quite as much as it used to be. For example, a GPT-3 would very reliably never say I don't know. If you asked it a question that had no answer, it would still just make one up, you had to remind it that saying I don't know is an option. Right? And that sort of knowledge is it's becoming easier, but it's still there. Like the things that you have to know to understand that this is a classic example of a tokenization problem that an LLM would struggle with. LLMs don't have that sort of I don't know if you would call it situational awareness, but I guess it's a form of that, of advanced situational awareness of the circumstance that they themselves are in as LLMs. So I think that there's still a lot to be gained from just these kind of like bread and butter improvements of training on data that reflects usage of these features. Right? I think in some sense, like, when the Internet was scraped to train, say, GPT-3, it was an Internet that didn't contain examples of chatbot dialogues as we see all the time today. Right? So even just so you think in pre training, you learn more about how a chatbot behaves scraping from the modern Internet than you did before. Right? So there was more that had to be built in post training in order to flush out the sort of fictional character of how a chatbot stereotypically behaves. That's very exciting to me, that more of this data is disseminating through the culture, getting filtered, people are picking out which dialogues are worth sharing and worth remembering and high quality. And the best parts of this are going to be fed back into the models, and they're going to have more of this very practical skill that I think there's uniquely lacking and uniquely in need of being able to use themselves and be able to use other LLMs and understand, like, the limits of their capabilities. So I think code is, I think, the most promising avenue for that. In code, you can represent understanding of a very wide variety of systems, and LLMs are like one such system that you could have code that uses LLMs. That code can be understood. And I think that we're going to see a lot of lift just from that sort of recursive process continuing.
Nathan Labenz: (8:39) So that sort of obviously suggests a focus on agents, to put it plainly. I guess two questions I have there are part one is how would you characterize the frontier of capabilities right now when it comes to tool use? And you could break that down in terms of, like, how many tools can a language model be provided and still work well or, like, how complex maybe some tips for or do we need few shot to get them to work well with tools? I think a lot of people have seen this stuff, but I think very few people still have much experience in terms of actually getting deeply hands on with tool use.
Riley Goodside: (9:14) I think for tool use, it's often a case where you want to have fine tuning just because there's a lot of, like, subtleties that go into using real world APIs that you would need to communicate to the model. That can be done through long context. It can be done through examples, and that's often a good way to get the data that you would want to tune on. But I'd say that's a case where you're more likely than other tasks to see, like, a need for fine tuning to get, like, the reliability that you would want for something that's going to call out to other APIs. You can otherwise see issues with it forgetting, like, just details of the semantics of, like, how some JSON structures to be called or so on. In cases like that where you have a very rigid structure in your generation, it doesn't take many examples to help.
Nathan Labenz: (10:00) Maybe to drill in a little bit on fine tuning of language models for this tool use capability, how general versus broad do you think people should be thinking of? Most of the fine tuning I've done has been single task, And that makes my life pretty simple in the sense that I can just focus on the single task. I'm trying to maximize performance. I'm controlling the environment in such a way where this model is never going to be asked to do anything besides this task and that makes the situation relatively easy from like an evaluation standpoint. If you're in a tool use domain, maybe you can narrow it down to that super specific task where you're like, it just has to do the one thing, but inherent in tool use is that you like want a little bit more dynamism out of the system or at least that's where my intuition is at the moment. So then you have the challenge of, man, I'm starting with a base model that a world leading company has created to be like a very general purpose tool And I'm going to narrow it consciously or unconsciously through this fine tuning process, but I still want it to be like somewhat general. How do you think about managing, first of all, like how broad or narrow you should aim for? And then like how do you actually keep things as robustly general as possible while you're doing fine tuning?
Riley Goodside: (11:15) Yes. The progression that I usually use is you start with a prompt, right, just a sort of instruction prompt or MVP of how you'd want the task done. Continue on to including K Shot examples, add more K Shot examples until no more will fit, and then start worrying about K Shot selection that doesn't even necessarily have to use embeddings. Just pick some method of selecting K Shots and find something that works. And from there, there's a lot of extensions going to rag and so on. But only once you've maxed out those options, I'd say you wanna dive into fine tuning. Both because they're easier and lower investment, but also because it's not wasted effort. I think you get from the prompts and the especially, like, expensive long prompts that have wasteful amounts of context in them, you can get high quality answers that would be useful in, say, K shot selection or fine tuning. So it's you're not, like, losing anything through that work. But I think the thing that people forget throughout this whole process of refining from, like, a prompt to a, like, a more advanced prompt engineered pipeline is decomposition. That at each step, you really just wanna be asking yourself, is there any piece of this task that I could pull out and make it easier and make it simpler? That I could delegate to a smaller LLM that you could maybe delegate to something that isn't an LLM at all. That can really help a lot with performance, especially just digging into your data and I think using your intuition of like, how could you fix this? Like, how could you fix this one example? And make your data better just by tweaking your data a bit and providing more context and generating more examples to tune on. I think people really just neglect annotation, really, and labeling. They neglect just the need to put thought into coming up with examples of the problem, both in the sense of the doing the task well and having, like, high quality examples of their task being done, but also in correctly thinking about the distribution of inputs. Like, what is a realistic input to this task? What needs to be demonstrated? And also, what are the edge cases? I think that's something that isn't emphasized enough, that you really wanna be considering what are the boundary conditions through which your high level task could break, right, or where the output might be ambiguous or unclear to what you're looking for. You wanna define those cases so that it can see every real world example as sort of an interpolation of the examples that you gave it. So I think there's a lot of roads you can take that lead to fine tuning, but they all converge on that last step. You could have many different ways that you could produce high quality data to tune on, but they all collapse into the same approach at the end, if you get to that point. Right? And you may not. Right? There are many cases where simply prompt engineering is the answer, and sometimes it's soda if you're willing to spend enough on it. I think there was a great instance of this. I think this was at Microsoft, built a prompt engineering pipeline that managed to beat, I think it was MedPalm that was, like, a version of Palm that was fine tuned for medical questions just by using a non fine tuned, a non specialist model that had, you know, a very complicated pipeline of prompt engineering on top of it. Right? And I think this is, in some ways, not always the answer you wanna hear, that it's going to be complicated, that you're going to have this massive inference pipeline at the end. But for real world problems, that's often how it ends up, that simply having more examples will get you there, you know, faster. I think the one, like, good intuition builder for this is that, like, at the limits of K shot selection, the problem is always trivial. Right? So if you have enough examples that you can find this problem that you need to solve, then it just copies the answer. Right? So you can think of, like, K shot selection as a sort of pre computation, and the space of what you could pre compute is very wide open. And I think the computational approach of it, I think, is worthwhile of thinking about what aspects of the reasoning and deduction that produces an answer are being performed at each step. And how can we split those out? How can we make them more explicit?
Nathan Labenz: (15:25) Hey. We'll continue our interview in a moment after a word from our sponsors.
Nathan Labenz: (15:30) Mention of reasoning and mention of explicitness focuses my attention on the chain of thought or the reasoning traces. And by the way, I basically pursue essentially the same protocol that you mentioned. I'm actually working on a I think it'll probably be an upcoming episode on just an introduction to AI automation best practices. And what you articulated there is pretty much what I'm trying to create flowcharts and whatever to distill for people. I have found personally that the quality of the reasoning traces in the examples is super important, and that is often a overlooked piece when people go and do this because it is often implicit. If you take, for example, the task of responding to a customer service ticket, what people will have is they'll have the ticket that came in and then they'll have the response that was sent. But the sort of internal notes, which usually does exist as a field in a ticket system, is often very minimally filled out, if at all. And it certainly does not contain the actual thought process that the person went through that's like, I'm going to do this because of this, and I noticed this, so, that's got me thinking this. And getting that stuff out of people's heads and onto paper, so to speak, is something that takes some like social engineering sometimes. You got to go figure out like who does this, how do they do it, like even interview them to try to pull that information out. But I have found that to be a critical element to performance. I've done most of my fine tuning on GPT-3.5. We're now about to see, I think, a whole set of options of next generation models coming online for fine tuning as well. But at the 3.5 level, I found that if I just did inputs and outputs, it didn't seem to grok exactly what I was trying to get it to do. And it was really important to have that explicit reasoning trace there. So, basically, just wonder if you've seen the same or if you have any sort of refinements on that summary.
Riley Goodside: (17:30) No. I think that's exactly right. That human produced text. Like, we think of, like, natural text that we find out in the wild, this text sometimes, it has sort of, like, one mode of behavior where it resembles a sort of a recording of our conscious awareness of the thoughts in our heads. Right? That you can see, like, reading something as being, like, playing moment by moment thoughts that a person was having as they were coming up with these words. But there are violations of that. Like, a good one is calculation. Anytime you see in text something that says, just casually tells you what 57 times 125 is, you are getting data that isn't part of that flow. Right? It's the result of somebody stopping and you're pulling out a calculator and doing the math and then resuming the transcription of their thoughts after that. And that shows up in a lot of ways, that you see text that isn't always just what came to somebody's mind naturally is the next thing to say. It's the result of experience. It's the result of seeing something in the real world, possibly years of experience, history summarized or whatever. And that shows up in so many places, and in so many tasks, and so many things that we think of as like demonstrations of a task being performed. We're not really capturing all those pieces missing in the middle. And attempts to address this through the history of, like, attempts to address calculation errors in LLMs. Right? So it's very apparent that LLMs struggle with calculation. And I think it's for kind of the reasons that I was saying, that in real life, like, calculation summarizes over many steps, and we don't record those steps on paper. And if you let an LLM record those steps, if you let it think longhand and go through the steps of, like, reasoning through the digits, it's much more likely to be correct. Not perfect, but more likely. And getting that kind of thought out of your answers is very important. And I think it's also a classic case for the power of augmenting with synthetic data, being able to generate, like, chain of thought rationales for particular answers, especially I think people don't appreciate always that there's a difference between chain of thought that the model generates itself and, say, chain of thought that somebody else generates. In that, if you were to imagine that you were to prompt a model on some you have some dataset of questions for it to answer, and then for each one, you ask it to think step by step and it generates some rationale for its answer. If you were to tune on those rationales, let's say you filter them to only the ones that lead to consensus answers or correct answers, but consensus being a proxy for that, You're tuning on rationales that empirically led to the correct answer for this particular model. That empirically, like, when the model does this, when it zigs this way instead of zagging, it lands in the right place more often. And so it's going to learn these general reasoning steps that perhaps very idiosyncratically apply to this model about how it specifically should do reasoning. And that's not always true if you have just even, like, reasoning steps made by some other model or, especially reasoning steps given by a human, that they're not really going to capture all of the deductions that they were making that led to things being decided. So I think the effort spent making that more explicit is rarely wasted of breaking out your task into more substeps, annotating pieces of tasks. I think there's a lot of forms the data of a task demonstration can take if you put your mind to it. You can, like, flip fields around. Right? So every translation every English to French translation is also an example of French to English translation, but there's a lot of higher order transformations you can do your data to create a synthetic pipeline for producing the data that's worth tuning on. I think maybe like a more concrete example of this. It was a paper where they were trying to show that models behind fine tuning APIs can be tuned to be unsafe even if the datasets that you're tuning on is monitored by another LLM to look for signs of unsafe activity. Because it's well known that as long as you can tune a model, you can give it something bad to some examples of doing bad things, it will do those bad things. So the defense against this is that model providers will have filters that look at the data that you're tuning on and ask high level questions like, what is the purpose of this? What are you attempting to achieve here with this fine tune, is that bad? So what they did to defeat this is they first came up with a text encoding method that they verified that this particular LLM couldn't understand. And I think that they called it walnut 53 or something like that. And it was just literally, like, you'd permute, like, ASCII characters by seed 53 in NumPy to get, like, a permutation order of them, then you apply that as a cipher to the text. And they demonstrated that like the LLM without tuning could not understand cipher even when it was explained and given examples and they did various tricks. But they found that they could get it to understand it through tuning. They created examples where a task is given in this new Wallet 53 encoding. Then the LLM's response is to repeat the task in plain text and then to complete the task in ciphertext in Wallet 53. And after tuning on examples of this, I think that they had a few other tasks in there too that may have been, like, like pure translation, things like that. But these were bridge tasks. Their true purpose is just to get the model to the point that it could understand this particular encoding and then understand the payload, if you will, of task demonstrations that are have the model doing unsafe things and are encoded with the cipher so that the filtering LLM can't understand them. I think this is a good case study in the sort of problem that an LLM just can't do before fine tuning and how you would get it there. Of breaking the task up into easier steps and incrementally getting it to be good at each of these smaller tasks and then composing them into the thing you actually want to achieve. So I think in a fine tuning context, that can work as well. You can, like, multitask fine tuning sometimes has its benefits. I mean, there's trade offs there too. But I think just that decomposition is really, one of the most important and overlooked parts of it, of saying what are the pieces of this task that if it were good at, it could do better on the rest. Because even if your high level task that you're trying to achieve is hard to demonstrate, if there's some particular skill that it's missing that isn't hard to demonstrate, it might be worth it to demonstrate it.
Nathan Labenz: (23:51) Interesting. Okay. That is a fascinating example. So just briefly on multitask fine tuning, let's say I do have one of these skills that a language model just can't do, whether it's like using my private APIs because that's not in the training set or some weird cipher along the lines of what you just described, and I wanna fine tune that in, but I also wanna keep the generality. Right? Like, my goal is to have an in house chatbot where I would be able to say, hey, it's just like ChatGPT or it's just like Claude, but it can use our own internal APIs. Sweet. Okay. Cool. Now if I wanna do that fine tuning, I can't just give all the examples of every example can't be using my internal APIs, right? Because then I'm going to presumably overfit and I'm going to start seeing the model start to want to use our in house APIs when it shouldn't in actual day to day use. Right? So is there any like best practice recipe that you could point to that would be the sort of you handle your specific capabilities that you want to add to the model and then you also mix in some other dataset of just like general chats or something? Like, what's the best practice for maintaining that generality while instilling the new thing that you want it to know?
Riley Goodside: (25:09) Yeah. That's a great question. I think, like, the first piece of advice I would give is avoid the problem if you can. Fine tuning does...
Nathan Labenz: (25:16) Wait for GPT-5.
Riley Goodside: (25:18) That's one strategy. But more immediately, fine tuning has a cost. It's not free. And one of those costs is that you lose more general skills. And there's also some more specific issues that show up. There was a paper not that long ago that showed that fine tuning a model on any kind of new information, like new factual information, will make it more likely to hallucinate in general. Like, that it'll be more likely to make up just information out of the blue. And it logically, you can imagine why this is. The model has some understanding of what statements are true or come from this great body of text that we think of as the real world or, like, describing the real world. And if you tune it on something that isn't in that body of knowledge, you're telling it that new things are okay. The stuff that seemed weird to you in some deeply ingrained sense that you learned during pre training isn't necessarily wrong, so ignore that instinct sometimes. And once you ignore enough of those instincts, you start to hallucinate. And I think that's like a good mental model to keep in mind of what you're losing is the opportunity cost for tuning. For what you can do, you mentioned a few strategies that are classic. One is mixing in more general data. The best thing you can do, of course, is if you somehow have access to the pre trained data, but you don't, so you find some other similar source of chat data that you could mix in there, and that works well for preventing some of the forgetting. If you're dealing with, like, an open source model where you have full control of the weights, you can do, like, weight mixing. You can tune on your task and then evaluate on some benchmark of both your narrow fine tuning task and some more general tasks that you want to preserve. And you would determine empirically what mixture of the weights optimizes both of those scores in the way that you care about. Another trick that I've heard anecdotally, and I'm not sure if there's a paper for this or if it's just something I heard offhand once on a Twitter thread, is when you fine tune on your node task, if you have the option, say, like Llama 3, you have both a base model and an instruct model. What the server was suggesting, and that it was saying that worked really well in practice, was tuning the base model rather than the instruct model, and then taking the delta of the weights between the tuned model and the base model and applying that delta to the instruction tuned model. This is a very crude outsider way of thinking about it and probably wrong in the details, but my mental model of it is the fine tuning comes first. Right? You take the process of RLHF, You tune it into a particular mode of instruction following before you put this finish of RL polish on it. And that, like, conceptually, you can almost think of it as, like, a way of closer to sliding it in under the polish or not disturbing, like, what was learned through RL. So I haven't actually tried that myself, but I've heard that it works.
Nathan Labenz: (28:19) Hey. We'll continue our interview in a moment after a word from our sponsors.
Nathan Labenz: (28:24) Yeah. Interesting. I think that sort of weight Frankensteining, which is my umbrella term for those sorts of techniques, is super interesting. I expect a lot more to come out of that, especially as we get deeper into the Llama 3 era and the Llama 3 400B era that allegedly will be beginning very shortly. It seems like there's probably a lot more that can still be figured out there in terms of how to just monkey around with weights in all kinds of different ways and still recombine them in useful ways. So that's a really interesting anecdote. What are your favorite models today, and what are you most excited to fine tune? Last time we did this, you gave us a great walkthrough of, like, the stages of fine tuning. Like, you did show Goth and the instruction tuning and the RLHF meme with explanation of, like, how it worked and whatever. I wonder if you have a sort of characterization of the different models today. What's OpenAI's series best for in your mind? Where does Claude shine? Where does Gemini shine? And how do you think that will potentially play out, especially as these, like, fine tuning things come online? Is there any heuristic that I should have for thinking about what to fine tune first?
Riley Goodside: (29:36) Yeah. That's a great question. I think the main thing I'd want to emphasize here is the best model is the one that does the best on your task. Right? So whatever you're using the model for, you really want to evaluate all of the reasonable options because performance on different task domains is all over the map. You might find that one model is good poetry, another one is good code, and they don't have to be the same model. So it's worth testing that, like, leaderboards are a guide, but I think part of the strategy that we're really emphasizing with our, like, sealed leaderboards is breaking out particular, like, domains. So we have one leaderboard that's just for coding. We have another leaderboard that's precise instruction following, And we explain on the website, like the different, like procedures we go through for making these datasets. Like we have people coming up with like role play scenarios, like where you do like an act as if exercise of imagining that you're in a certain professional role, coming up with tasks you would do in that role, and then we have other people producing examples of those, like, task demonstrations. We get, like, a diversity over, like, the space of how the user would actually use the model. But I think, like, having that domain specificity is really necessary even just for the initial guidance of what model to pick. And beyond that, I think asking around or even just looking at local lore on Reddit has a good place for smaller models and things like that. I think you'd be surprised at just how fitly it can get. Like, you would find that, like, some models are much better than others at certain languages. So if you need to, like, work in a particular language, like, ranking of which models are best could look very different. Right? And also, think in practice you might run into just annoying issues, right? In like one model or another, you might find that once it writes technical docs in Spanish, it sometimes just switches back to English, right? And then another model doesn't have that problem. And so that's something that, like, maybe doesn't show up on benchmarks, but it matters a lot to you. So those are the kinds of things that I think people underestimate how frequent they are and how much and how little they show up in benchmarks. But I think you really wanna use benchmarks as just like a guide to what to try and then to then let data be your guide there. And also, I think a lot of it is a cost exercise as much as anything that you really wanna think through what are the your expected volumes and like, the how much are these going to cost? Because, like, things can inflate fast, especially when you have, like, LLM consensus pipelines and so on.
Nathan Labenz: (32:01) Is the difference in models today mostly something that is a result of different post training strategies? I'm thinking here of this paper called the platonic representation hypothesis where I thought this was a really striking paper. Basically, what they look at is the internal representations of concepts within different models at different scales, and they show that as the models get bigger, the representations appear to be converging. They become more similar as models get bigger. So in other words, your smaller models are more different. Your bigger models are more similar in terms of the internal representations that they have for concepts. Obviously, taken to the limit, you get something like the platonic representation where it converges on reality and it's got some sort of compressed world model. Now whether we ever actually get there or not, maybe another interesting question I'd like to get your take on. But if that is happening and if all of the big leading developers today are presumably tapping into, like, largely similar giant pots of data, then one might infer that, like, the reason there are different characterization or different characters or different, you know, strengths and weaknesses, why is one better at code versus poetry or what have you, is potentially down to the post training strategies and maybe also, like, explicit trade offs that companies are making internally that where they may be saying, we wanna emphasize this over this for whatever reason. I think that's how you would understand it?
Riley Goodside: (33:30) Yeah. I think that's right. I think the longer I use LLMs, the more I see them as a sort of compression of text, a compression and an interpolation of the training data. And that sort of interpretation naturally applies to the perspective that these models are getting better largely because of post training, because of the human demonstrations that are going into them. And I think that's going to be increasingly true as we run out of room in scaling laws. We're not there yet. I'm not entirely qualified to say exactly how close we are, but my experience, still have a bit of runway there. But I think the broader trend, though, is clear, that scaling will slow down at some point, and the gains beyond that are in post training. But I don't see that as a major roadblock. Think that we don't even fully understand how much room there is to grow in post training. I think the space of things that we could potentially be training models on is very wide open. I think maybe one of the clearest demonstrations of that is recent video models. We've seen that the same strategies that predict text can predict tokens that represent video, and this actually produces something that is realistic. So I think what we're going to find that there's a lot of things that can be turned to text. It wouldn't surprised me at all if a year from now you start having models that just produce executable binaries, right, or other forms of data that like, you wouldn't necessarily have to do a token representation of an image. It could just start spitting out JPEG. And maybe those are impractical or silly examples or maybe, like, technical reasons for any given one of these things why it's not worth it. But I think in the general class, there are things, like, that are worth doing.
Nathan Labenz: (35:06) Yeah. A hundred percent. My mind immediately there jumps to sequence biological sequence trained models and generations. And those are...
Riley Goodside: (35:14) RNA. Yeah.
Nathan Labenz: (35:15) Yeah. And those are still in the sort of GPT-2, maybe into the early GPT-3 kind of era, but those have a long way to go, I would think as well. And that's gonna get really weird is my expectation because nothing makes the, like, can a chatbot help me build a bomb feel more quaint than just the chatbot, like, actually just straight away putting a new virus sequence or whatever. And that is coming very soon, if not really already, but maybe still not quite there in terms of actual viability. But scaling laws seem to hold there as well. So I guess most of the gains are coming from post training. It seems pretty clear that all the gains over the last year and change have come from post training. Well, like, the vast majority of them because we haven't seen, like, another order of magnitude scale up in model size. In fact, prices have dropped precipitously, so there's been, like, presumably a lot of efficiency gains under the hood. I don't know how many parameters we're dealing with these days, but presumably, like, the models we're calling now have fewer than the 175 billion parameters. So we haven't seen that scale up yet. I guess first question, just to be clear, like, you would expect that there is gonna be another quantum leap with another scale up if we go from, say, 10 or 15 trillion for GPT-4 or Llama 3 to, say, a hundred trillion. Like, you would expect that that would be like a qualitatively different model that gets spit out of the other end of that. Is that correct?
Riley Goodside: (36:43) Yeah. It's there's sort of an irony in that we have these scaling laws that tell us how perplexed it will be by natural text. But how that translates into actual skills is really anybody's guess. But I think we certainly have only been surprised in one direction before. So yeah.
Nathan Labenz: (37:00) Okay. So now coming back to today's models. Let's say you have time and resources to do the generations, to do the rag pipeline, to do the fine tuning, to iterate on it, to really put the elbow grease in, to make the thing successful. Is there any guidance that you could give people on what is the limit of performance there? Another way to ask this would be like, how close are we to AGI? There's like a sort of state of fine tunable AGI where it's like, if the definition is AI that can do the majority of work as well or better than a human, people tend to think of that as like, okay, just a single model doing it all. But if I were to reframe that as for any given task that humans do, can we fine tune? Can we find examples? Can we fine tune? Can we dial in the performance to get it there? How would you characterize the frontier of what is possible today with all that extra work beyond the base models? I don't mean base as in not post trained, the models that were given by the companies and what we can get to if we do all that work that we've been talking about to really focus in on a specific thing. How could you characterize that frontier or are there things you could say, these are definitely out of bounds? Obviously, ARC has stimulated this conversation recently because the models are not that good at ARC. Even with a lot of examples, they're still not that great, maybe with fine tuning, but it's unclear. Seems like it's probably not gonna get to that 85% even tuning GPT-4, although maybe. I don't know. If we were to do that elbow grease for every given task of interest, how close are we to that AGI definition is what I'm trying to get at.
Riley Goodside: (38:36) I think the definition of AGI that's out there, which is probably heavily influenced, I think, by OpenAI's definition of systems that can do, like, the vast majority of all economically valuable human labors like that. The issue I have with these kinds of definitions is that like on the highest level, this is a moving target, right? If you think of like what is economically valuable work that humans put out, before like the industrial revolution, most probably like around like half of all humans were involved in farming. Right? Somehow you were somewhere in the chain of like agriculture. And the number I believe is like now like 2% or something like that of all Americans are like would like work in farming. And so by that metric, that's 48% right there, right, of, like, the life of human labor that got automated by tractors and the other farm equipment we've come up with. But, like, when things like that happen, it changes what it is that humans are doing. Right? That we find other things to do with our time, and economically valuable human labor becomes something else. And I think we are seeing, like, smaller versions of that even now with, say, translation. Right? The LLMs are increasingly good at translation, transcription of videos. Like, those tasks are pushing humans out of that work and into other work. And we'll see, as humans get pushed out of work that's automated, they'll move into other tasks where, LLMs, you know, aren't as good. And our definition of, like, what is, you know, human labor will change. Now that's maybe a bit of a cop out because you could still say, okay. Let's freeze it now in 2024 and say, like, of all these things, let's do those. But I think, like, even there, it doesn't get us to the thing that I think a lot of people implicitly imagine when they hear AGI, which is that like humans are these sort of like fountains of like new possibilities of things to move to. Like, when is the model going to be that? Right? And that's a slightly different definition. Maybe it is the same definition, I guess, just applied more iteratively or recursively, I guess, of saying that, like, when will it do truly everything and that there's, like, no task at all that we can imagine that humans will do? I think we're a ways away from being able to think coherently about that, of what does it mean to have no value in humans doing, say, philosophy, because all philosophy has been figured out better by machines. I think at the point that we start have to worry about AGI, like economic terms go out the window in some sense.
Nathan Labenz: (41:02) Okay. Just to take the other side for a second, is there anything that you would say, like, how do you know if you're being too ambitious? If I wanna say, okay, I've got this task in my business, and I've got a person doing it full time. They're doing fine, but the volume is growing. They're gonna have to hire two more people to do this thing, or maybe I can bring AI to bear on the problem. Is there any way, any framework or whatever to get to good judgment on is this gonna work for a given task? Like, I could throw a million examples at you and you could evaluate them one by one, but is there any way we can abstract from that and say, okay, this is where you're reaching too far? Any common mistakes or any overreaches that you would warn people against?
Riley Goodside: (41:47) I think a good rule of thumb is, you know, LLMs can do tasks, not jobs. If you have something that you can get down to a repeatable unit of a task, something where you can imagine the inputs and outputs laid out as text on spreadsheet, that's potentially something that you can automate. I see a lot of novice users of LLMs will ask them for like a business strategy or something, right? What should what's an idea for a startup or whatever. I can't imagine any LLM is going to give you a good idea for a startup. Anything like that, especially where it's current, you're not going to get that kind of originality. So I'd say that's probably one of the more like common pitfalls, is like people asking for information that's in this sort of nexus of being recent and competitive. Like information where like writing a funny joke is maybe like one of the pure forms of this, right? The LLMs aren't going to be able to tell you a good joke, except in the sense of maybe regurgitating one, Right? Or applying something to a bold. Because humor is quickly it changes so quickly. Right? The things that were considered like world class humor in the thirties probably wouldn't get very many chuckles today. So it's that like competitive landscape of people trying to serve expectations and trying to surprise people, that just isn't what you typically get out of LLMs. Like, humor is probably like the archetype of something that LLMs aren't going to be great at. And the less humor like it is, the more chance you have succeeding. Right? Something where you're not looking for it to be surprising and creative in a way that, like, nobody's done before where you're implicitly in competition with everyone else that's, like, trying to do this task.
Nathan Labenz: (43:19) That's good. I often say precious few eureka moments from AI systems. And on the flip side of that, like, the more routine it is, the more we know what good looks like, the better chance we have of getting there. But that sort of narrows the scope to things where there is, like, a best practice or a standard of care or some sort of reasonable consensus as to what is good. If you don't have that consensus as to what is good, you're in trouble. Yeah. Go ahead.
Riley Goodside: (43:48) Though to balance that out, I will add that I do sometimes see sparks of these...
Nathan Labenz: (43:52) Sparks of AGI.
Riley Goodside: (43:53) Maybe not sparks of AGI, but, like, sparks of connections that I feel confident are novel. Right? I had one an example a few months ago. Or I didn't know over here. It was on a YouTube video. I heard somebody, like, remark very quickly that the name of the Python library attrs, a t t r s, it was, quote, not a very nice thing to say to somebody in Dutch. And I wanted to know what did that mean. Right? What were they referring to? And when I posed that question to ChatGPT, it got it right. It explained that like, that there's, like, a Dutch insult called etter. I think it is right there. That would be, like, e t t e r. I forget exactly what it means, but it, pest or something like that. But it made that connection. And you just logically, it's not going to see that anywhere because it was in a YouTube video unless they happen to transcribe that one video, and then like nobody was talking about this video. Right? Like, it's not something that it could have found out there. So I think like the you do see these little reminders that like it's not entirely missing like, novelty.
Nathan Labenz: (44:53) Yeah. An interesting challenge there is, like, identifying the good ones. I remember the Google paper FunSearch where they set, like, new state of the art on a couple interesting kind of pretty esoteric from my perspective open math problems. One was like a packing problem, how to pack these balls or whatever into a space in the most efficient possible way, that kind of thing. And this has been something people worked on for a long time. They were able to advance the state of the art by doing huge numbers of generations. The key that they had in terms of an advantage is an objective function to score the little programs that the language model was writing. And then they basically just iterated through keeping a couple of the best highest scoring things and saying, hey. This is our best. Give me something else. And they also did have one interesting strategy for trying to make sure they didn't get too trapped until a local maxima and try to make sure they would explore other, you know, parts of the space. But they were able to advance state of the art. It seems like a key thing there that they have though that we don't in general have is an objective score on the problem. If you were trying to do humor and you were like, okay. I see sparks occasionally of good humor from the language model. I don't wanna read a million bad jokes to find one diamond in the rough. I think the same problem, by the way, applies for tons of things. Right? Like, people wanna generate social media posts. They'd probably be happy to spend 10 times more on tokens to get one thing or even a hundred times more on tokens to get one good thing if they could reliably actually select out the one good thing from the 100x more generations. So is there any best practice or any tips that you could advise there for when, okay. I'm willing to pay up to generate tons of stuff, but now how do I narrow that stuff back again to what's actually good? Do you trust the models to do a decent job of that?
Riley Goodside: (46:41) To a point. So I think for maybe not for humor, but for a surprising number of tasks, simply generating more and then taking the best of that through even imperfect or crude methods can help a lot. So a very basic way of approaching this is to just generate multiple answers, pass all of them back to a long context LLM, and say which of these is the best. That's maybe a little too crude, but there's more sophisticated things you can do. But what I would recommend most of the time is do comparisons rather than scores between the different generations that you have. And from the comparison, the wins, losses, and ties, you can produce an ELO score and use that to determine which one is the best. This can really help a lot. It's hard to overstate. Like, so much of whether an LLM does a good job or not is happenstance, and you wanna just give it more attempts. I think there was a paper that showed that you would get roughly equivalent performance, I think, from doing 40 API calls to GPT-3.5 as you would for one API call of GPT-4. That's the exchange rate, right? So intelligence has a steep cost, but it does keep going to some extent, right? You can just keep generating more answers with GPT-4, rank them, and take the best ones. And then from there, there's refinements to that process that can help a lot too. Right? Coming up with better procedures for ranking things. Trying to think of how much I can share but even crude things do work. Even things as simple as just like asking the LLM which is best can help a lot, but wherever there are indicators of quality in your domain, you should use them. So if you're using a code, like if you're generating code, you could use unit tests or you could execute the code and see what the results are. If you're doing translation and you wanna know this is a good translation, back translate. So if you go from English to French, go from French back to English, and then have a model is the English roughly the same as your original English, or is there some substantial difference that implies that a mistranslation happened? That's a contrived example because you wouldn't be doing this translation, but apply to your own task. I think so much of the work of making LLM powered systems better really comes from setting up these pipelines of ad hoc quality filters and checks and evaluations on your examples. And it's making sure that there is no bad data on your examples, that you're not putting bad demonstrations, and that you're pulling in K shot examples that really match the structure of your task. And the closer you can get to that structure, the better your performance will be, and often that change to that structure is trivial. So a more concrete example of this maybe, if you're doing rag, if you're, like, trying to answer questions about a document, you are including context in your prompt that isn't in the form of a demonstration of your task. It's just added context from the document. Snippets that were embedded similarly to your prompt. And there's all kinds of extensions like HyDE and things like that that go on and may have more elaborate procedures for retrieving the right chunks and different chunking strategies. But you're still making the LLM do a lot of work at inference time. Right? That it's not just, like, picking the closest examples and interpolating between them. It's applying presumably novel question to, like, all these pieces of data. And that makes sense in the Rag context, but if you were, like, fine tuning, you would need to tune on actual demonstrations. Right? So you would need to have tuning data that looked like, here's your input, which is some Rag chunks. Here's your question and your input being the question in the rag chunks, and then your output is whatever you get from those rag chunks. Right? And that demonstration can be generated synthetically. So, like, recasting documents into synthetic questions is a powerful strategy. And because also I think the underappreciated benefit of doing like a layer of synthetic transformation is that it gives you an opportunity to do quality checks, right, to start meddling with the the sort of cached parts of the process, if you will, rather than trying to set everything up perfectly so that it goes right in production.
Nathan Labenz: (50:53) Yeah. Those transformations backfills that's a really good tip. It's funny. I was doing early versions of that stuff in fine tuning the never actually broadly released text da Vinci 002 fine tuning in the 2022. And you might have been doing the same thing at that time. But, man, it was in some ways, it was, like, way harder. The models are way less capable. It took a lot more work to get anything to work. But in some sense, I think that era really prepared me well for now because feel like that's actually more when I learned a lot of this stuff that we're talking about than today. It's like today, the models are so good at so many things. Like, you can get tricked into maybe I could just tinker with the prompt on my way to get there or whatever. And when I think back to my 2022 experience, I'm like, this thing is gonna do one thing, but I'm gonna have to grind to get it there. And if I'm missing data and I need a thousand examples or whatever, how am I gonna create those? I don't really have to have to create them manually. A lot that's where a lot of these synthetic, you know, transformations and backfills and whatever became kind of part of my practice. And it's interesting that they're basically still the same core techniques are, like, really operative now. It's just that we maybe have a harder time seeing through to the need for them in some cases because the models just off the shelf are, like, so much better at so many different things. But they're still not good enough to actually be, like, production ready for many of the things that people actually want to automate in their businesses.
Riley Goodside: (52:19) Yeah. I think that's probably a good segue also to what you were talking about prompt engineering and, know, how the profession has changed or rather how the profession that showed up is not quite what was promised. Right? Like, it's a there's a scale at...
Nathan Labenz: (52:32) least. Yeah. So are you still a staff prompt engineer, or are you now an AI engineer? How do you think of yourself?
Riley Goodside: (52:37) My title is still a staff prompt engineer. It's anyone's guess, like, whether that will be, like, a title that people have a couple years of, rebranded as AI engineer or things like this. I think there's a lot of ways things can go. But I think the thing that's clearest is that this image that was hyped up a lot in early 2023, particularly the sort of the job of the future narrative, hasn't come to pass, right? That most firms don't hire their own prompt engineers, and there isn't a need for most companies, especially outside of tech, to be hiring full time prompt engineers on staff. And I'd say honestly, like most of the demand for prompt engineering, it's coming from within the AI industry itself. That right now, one of the people who most need prompt engineering skills are people who are preparing training data. They're not just preparing synthetic data from a model itself, but just filtering and doing quality control and like the applications of LLMs at a company like Scale or at any of the AI labs themselves are much more abundant than you see like in any other contexts. I think that's the piece that has surprised a lot of people, and I never tried to lean into the promise that like this was going to be a job of the future. That was never something that I really believed in, the sense that I thought everyone was going to do it. One thing I used to say a lot, and I think I maybe at this point on our last podcast, was I saw prompt engineering as a skill something like being a typist. It used to be that people who knew how to type could get a job knowing how to type, that clearly isn't the case now. But everyone still knows how to type, right? It's a skill that everyone has. And people just don't call themselves typists, but they're still typing. I'm not even so sure if that's the right analogy now. I think most of the tricks and things that characterize prompt engineering of say 2022 are less relevant now, And they're certainly less mandatory. I think in 2022, there were many things that you had to know right down to just really stupid things like not ending your prompt in a space before you could really get started to get good results out of like GPT-3. And that's no longer true. The chat models are just much more user friendly. Many people get by just fine being completely unaware of tokenization, right? And they run into gotchas and they get bad answers sometimes because of it, but it's possible to not know. And that's probably the biggest explanation I'd say for what's changed about prompt engineering, and as we were like talking about earlier, guess why my Twitter output of like prompt engineering tips and tricks posts have declined lately. Because I think at the same time though, I guess you would call it like folk prompt engineering tricks of the things that fit in a tweet. Maybe the best known ones being like, let's think step by step and things like that. While those are less relevant and the ones that aren't obsolete are at least not mandatory now, prompt engineering hasn't stopped flourishing. It's moved into more complex areas and it's becoming more and more code heavy. It's becoming more like programming than poetry. And it's also, I think, expanding into a lot of areas where we're not really calling it prompt engineering as much. I think people would debate maybe whether techniques like RAG are if you squint a bit, RAG is arguably a form of prompt engineering. Right? You're choosing what goes into the prompt. And you have of course optimization tools like DSPY. There's a lot of code and process being built up around wrappers around LLMs in a way that I honestly thought we would see sooner, but we're definitely seeing it now. This proliferation of tools, I don't see any sign of that slowing down. I think that prompt engineering has in some ways moved into less publicly visible task domains. We're now trying to solve problems in a way that the solutions can be bottled back into the LLM. Even back in 2020, I sometimes got the reply on Twitter of prompt engineering surely won't be a real thing, because whatever there is to know, you can teach the model to do that. And that's not literally true for a lot of reasons. Maybe one day it will be like you'll actually have the model using another model. But that's not literally how anything works today, but metaphorically it is how it works today. That to the extent that we have tricks for getting better output from LLMs, one of the best things to use those tricks for is to produce better post training data, and thereby eliminating the need for anybody to know that particular trick. And I think this is a good thing. We're going to see prompt engineering continue on this frontier. I metaphorically visualize it as something like the way the forest fire would move. Right? The prompt engineering is at the fire on the perimeter, but any particular tree doesn't stay burning for very long. So I think, like, the frontier will continue. It'll just keep moving.
Nathan Labenz: (57:39) Yeah. That's funny. My version of that is the black hole that the models just suck everything into, and eventually nothing can escape. That sort of sketches out a implicitly like a vision of the future that maybe you won't even comment on this, but I assume that a lot of what is going on right now is creating the kind of multistep, multi app, advanced reasoning traces for more agentic workflows and that those ultimately are likely to get folded back into the models. If I had to make a prediction of what is GPT-4.5 going to look like if we are going to see a GPT-4.5, it would be yet another advance in the post training where a lot of these stumbling points that we see with agent workflows today get addressed. Like, they can try a new way. They have figured out that, oh, okay. You got stuck here. Don't do the same thing over and over again as we sometimes see today. And I think that could probably work pretty well even without that next order of magnitude scale up in pre training. I often feel like the models today are like smart enough. I'm not asking you to tell me all the data that you're creating at scale, but I guess is that like a is there anything that you would say I'm way off on that expectation for what we might see in the next half step of models?
Riley Goodside: (59:00) I think that sounds about right. I think there's entire avenues of things that we could train models on that we haven't really thought about much yet. And that's really exciting to me because it means that once somebody has the idea of lifting up some half forgotten format of text and training an LLM on it, you can move into a new space very quickly. Right? Like, you figure out DNA. Right? I think we're going to see a lot of progress from that kind of exploration of just trying out new things to train it on. Or maybe a great example of this is that there was a DeepMind paper not that long ago where they distilled Stockfish into or Stockfish is being a traditional chess engine. And they created a chess AI by this procedure, which is really clever, where they took essentially Stockfish's judgments of how winnable a board position is. So all the training examples were just like all of the form of, like, here's a position of a board. How likely is it do you think that white will win, more or less? I'm lost in details. But the tuning on that is enough that given an actual position of the board, you can just say what are all the legal moves you can make, run all of those through the model, say which one's best, and then do that. And that produces a pretty good chess engine. So it shows that really that you can distill Stockfish's opinions of what chess positions are good into an LLM that was really built to model language. And I think we're gonna see a lot of those kinds of discoveries of that. The knowledge there is in the corpus of all of these examples of all of this training data, and collectively, those imply an understanding of how to evaluate chess positions. So I think like we're gonna see just more of that. That is like we just do pre training, like discovering a lot of this stuff through happenstance. I think that's what characterized a lot of the 2022 era of LLMs is that LLMs hoovered up everything. And then, we had this period of exploration where it's like, wow. Look at all the stuff that's in here. It can pretend to be Wikipedia. It can pretend to be the news. It can you can write JSON and so on. These things weren't apparent at the beginning. And I think that's what we're going to find new things to put in.
Nathan Labenz: (1:01:14) Yeah. I think for me, it's like you can ask this question of are things going to level off? Are they gonna level off? From my perspective, it's to not get to some definition of AGI where maybe we still don't have, like, too many eureka moments coming from AI, but where they can do, like, a huge percentage of just the tasks that people get paid to do at work most of the time, most of which are not eureka moments, are doing a task and trying to do it reliably and consistently. It seems to me like for us not to get there, things would have to level off really quick. And that seems almost impossible even if we just imagine training on a lot more expert data or pristine data as I've heard your CEO refer to it. But then on top of that, maybe it does kind of level off at expert level. And then I'm like, nah. There are all these other modalities. That's where I just can't see how it doesn't all come together because the Stockfish example is a great one. DNA that I've talked about many different episodes with models like EVO is another great one. I just did another episode on predicting very small scale molecular dynamic evolutions where this dude, Tim Duignan, had this super viral tweet where he's just simulating small salt solutions, literally just like salt water, but observed that he was spontaneously seeing crystallization happening in the simulation. And he was getting even trained this thing on crystals, but we just trained it to solve the wave equation at any given moment. It's very similar to the Stockfish thing where it's like, he has a computationally super intensive approach to solve the wave equation, and he's just training the model to make its best guess as to what the real solver is gonna do. And that turns out to be accurate enough that he can run essentially the same simulations at, like, orders of magnitude speed up and start to see these things that would never have been possible to see before. And it's crazy. The scales that they're talking about are, like, small number of atoms. It's like picoseconds of level resolution. But seeing these things like, oh my god. Crystals are forming in my simulation now. So when I see that, I'm like, man, there's a lot more modalities out there besides. There's a million different kinds of sensors in the world, and all of these things seem like they're destined to find their way into the same ginormous pool of data and with all those modalities, then you put in, like, a Sora up. Good god. You've got intuitive physics, then object permanence popping up. It seems like we are definitely headed for something that is, at least in some very meaningful ways, superhuman. That doesn't necessarily mean it would be superintelligence or it doesn't mean it would have no weaknesses, but it does seem like there's a pretty clear path to me at this point to something that is in many ways and in many meaningful ways, a superhuman system.
Riley Goodside: (1:04:01) Yeah. I agree. I think people have the impression of the data centric view of AI of the hidden the models as data, as they say. View is sometimes, I think, seen as, like, the skeptical or conservative position on, like, capabilities. Right? That it's the one that jives with the worldview of these models are rising only to the level of intelligence that we train them on and no further. Right? That's like a skeptical position that you sometimes hear people advocate, and so you're not gonna see any progress beyond what we have now because this is just like as far this is as good as what humans can do, and that's all we get is what we train what we get from training on that. I don't give much credence to that position I have of seeing models as compressed data. In some ways, it's really optimistic about what they can achieve because I see this sort of fungibility between thought and experience. And I think we're seeing that theme in a lot of ways, like all the way back to just the idea of generative pre training that the task of token prediction and, like, just feeling out the vibes of what token should come next in a statistical sense leads to something that resembles reasoning. Right? That it resembles step by step thinking and calculation and, like, all these other things that we could at least mistake for thought. I think that pattern is going to continue. Right? That we'll see things more things like the Stockfish distillation example. We'll find that there's more and more kinds of knowledge and experience that can be put into weights. And I think that's really, like, exciting, like you said, that even if scaling were to somehow magically stop, if a government comes in and says you can only go for this many parameters or whatever, there'd still be a lot of progress. I think just, you know, from, like, training on better data. Even on the pretrained side, like, just knowing, like, what to pretrain on is still very much of a dark art.
Nathan Labenz: (1:05:49) Yeah. That DiGeSt paper out of Google in just the last week or so, I thought, was another, oh my god. This is a big advance. And even if it was, like, kind of that was the sort of thing where even if you discount how much you wanna believe in it, you could discount it quite a bit, and it could still be, like, a pretty big deal. I don't if you've studied that one, but for listeners, if not for you, the idea there was that they took a small model and trained it on the same data that they're considering training the large model on and then look for data that is not yet learned by the large model, but is learnable in the sense that the small model at least somewhat makes progress on it. So it's essentially separating what the models can learn from what they're not learning much from and then selectively training the large model on the more learnable data. And they report just your casual 10x efficiency improvement from just that one neat trick, which is crazy too because that compounds with many of the other things that you might think are going to be great, but you can also apply that too and it like doesn't seem like it's in direct competition with other improvements. It's important I think to be clear, the compute thresholds today are for a reporting requirement only. Right? They're not actually telling you, like, from the executive order. They're not telling you you can't do it. They're just saying you've don't let the government know if you are doing it. So I've felt like that's a pretty good idea, and I've generally agreed with the idea that, like, first of all, if somebody's doing multi hundred million dollar training runs, yeah, probably doesn't hurt for the government to at least have an awareness of who's out there doing that. And then also I can't come up with a better threshold, but the more you see all these just efficiency gains coming from everywhere, man, even that is it's a pretty hard position to really think, oh, yeah. We can just set this threshold, then as long as everything we know everything above that threshold, we'll be fine. It does not it's happening faster than I would have anticipated that you take 10x out of the the difference between 10 to the 25 and 10 to the 26, it turns out to be quite a bit. It's 90%, which is it's funny how that works. But, anyway, there's a lot happening. I think we both expect things to get pretty crazy. You mentioned DSPY briefly. Is that something you use a lot? And in general, is there like a best overarching automation system for the sort of work that we've been talking about through most of this 90 minutes?
Riley Goodside: (1:08:05) DSPY, I've used it. I wouldn't say that I've used it a lot, and that's not really a reflection on DSPY as much as like limited time that I've had on my backlog to get deeper into it, to find more use cases for it. But I am excited about the project. I think like there's a lot of good ideas in there, and I think that there are a lot of niches where automating prompt engineering makes a lot of sense.
Nathan Labenz: (1:08:28) On the dark art of prompt engineering, the jailbreaking side, I'm sure you're a in some ways, maybe you're, like, Twitter spiritual successor might be, Pliny, who is jailbreaking every model in two seconds from the second that he can get access to it. It seems like he's usually using the same techniques across a lot of different models, and they basically just work. My question for you is, do you think that those techniques will eventually be stamped out of the models through more reinforcement training or whatever, or are we headed for a scenario where the model itself just can't get that robust and we instead have to have some sort of moderation layer or governing system on top of the model to control these sort of jailbreak type strategies?
Riley Goodside: (1:09:18) There's a few questions in there. So first, regarding whether you can do it entirely in the model through training, like better data versus, like, filters, there seems to be increasing evidence that there are cases where you can't do everything entirely in the weights. So Anthropic had a paper not long ago on many shot jailbreak attack, where that you basically repeat hundreds of examples of malicious prompts and responses into a long context model. And this overpowers fine tuning based defenses against this, tuning on adversarial examples of it behaving correctly. That same paper, though, found a pretty reliable defense against this, which was to simply add a suffix to the prompt reminding the model that it's a safe model, that it's Claude, and that it's well aligned and all that. And this was remarkably effective. I forget exactly what the numbers were, but it was very effective at stopping the behavior that they were showing. And that's a very simple trick. So I'm sure you're seeing things like that one today. But beyond that, I think we shouldn't we shouldn't trust blindly that there's going to be like a tuning fix for all these things. I think there's very likely going to be a place for things like content filters. There already is. Right? There are good example of this I know in like like the token repetition attacks. So it's trivially simple to have a filter on your API that just says, is the user trying to repeat the same token over and over again? And if you have an LLM, continue a prompt that has too many repetitions of the same token, you can get into this mode where it will probabilistically spit out pre trained data. There was a paper that showed this by collecting lots of this data and indexing it altogether. Also, for that particular example, I wouldn't say strongly that there's no way to do it in tuning, but it's hard and it's not worth it, right? That you're much better off just like handling this with filters on the input to protect against that particular issue and filters on the output to look for tricks that try to get model to do it itself or filters for the copyrighted content they're trying to protect. But it's all a cat and mouse game. I just I think that we spent a long time in the regime of trying to do everything through tuning, and I think we're finding that there's, like, holes in that strategy, that there are some, like, just odd cases where it doesn't make sense to patch everything, like, the same way. And so I think, like, we're going to see, yeah, more of that.
Nathan Labenz: (1:11:29) There's a follow-up though, which is, like, what, if anything, does that imply for the future of open source? It would seem to suggest that we are either headed for a world where people are gonna say, hey. These models are too powerful to be distributed in the raw. They have to be somehow contained in an envelope that keeps them from doing unwanted things. Or we may just say the value of open sourcing them, which is certainly very significant, is just so high that we'll have to live with the downside risks of that. Do you see any other way to not have to bite the bullet on one side or the other of that debate? It would be really nice if there was some way to be like, yeah. We can have the best of both worlds, but right now, it seems like we're headed for having to make a choice.
Riley Goodside: (1:12:13) Yeah. Once you release the weights, people can do what they want with it. There isn't really any method that I'm aware of that you could have an open source model without damaging attempts to censor pre trained data to make if you wanna really squash the problem, you can make it unaware of concepts. Right? But that has downsides as well. So that's becoming, I think, less popular starting with Llama 2 of filtering out just concepts entirely from the pre trained data. So I think there's no free lunch there, but as to what are the solutions to that problem, I'm not really sure.
Nathan Labenz: (1:12:46) Yeah, one that this is just making me think of that you may have seen is called SOFON and it's out of a Chinese lab actually. It's a strategy for fine tuning suppression. And this is not on large scale language models. It's on smaller scale like image generation models is where they started. So there's definitely some work that would have to be done to generalize this. But as you of course are well aware, like the image models typically work on a denoising principle. And what they tried to do is create an image generation model that would denoise as they normally do, except in what they call the restricted domain, where in the restricted domain, it would just output basically the null transformation. And so it would essentially not denoise. And they crafted the loss function in such a way where it's like smooth bottomed. So they fine tuned it with this combination of, you're outside of the restricted domain, just do your normal thing. If you're in the restricted domain, you're supposed to produce a zero. The loss bottom has a smooth curve such that it's hard to fine tune out of that local minima because the gradient in once you like converge on this kind of smooth bottom loss function, the gradient is zero. And so the if you're trying to fine tune out of this local minima, like it's hard to know where to go. And they found that training these models in this way created a model that is harder to fine tune to do the unwanted thing as opposed to just like training from scratch to do the unwanted thing. So I thought that was pretty interesting. There's, like, very few of these proposals flying around that I've seen. I'm also pretty interested in the Dan Hendrycks et al line of work around short circuiting where they take, like, representations and essentially do internal classification. And they've actually just recently started their company on this premise as well. Those are a couple things that are on my radar, but as of now, we don't have answers. The short circuiting, that's still more of a system kind of approach. The SOFON thing is interesting in that it's fully baked into the model. So And you can imagine if that were to be really generalized and scaled up, you could imagine something where you might make a sufficiently fine tuning resistant model and not have to bite the bullet on these trade offs, but with a long way to go on that. Anyway, I'd like to highlight that work because that flew way under the radar. Zvi mentioned it to me in a podcast, and I hadn't even heard about it at the time, and it's out of a Chinese lab. So it's definitely one that I think deserves a little more sunshine.
Riley Goodside: (1:15:19) Yeah. Yeah. It's the first time I'm hearing of it too.
Nathan Labenz: (1:15:21) SOFON. It's what I...
Riley Goodside: (1:15:22) Oh, yeah.
Nathan Labenz: (1:15:24) It's good. Or at least it's super interesting anyway. We'll see how far it can be pushed. Okay. State space models. Are you a mamba stan? And what do you think the promise of subquadratic architectures is and will mean for the field as a whole?
Riley Goodside: (1:15:40) I'm really excited about it. I watch these sorts of things as an interested outsider. I'm not really qualified to say much about architectural decisions for experimental architectures for new LLMs, to be honest, but I've have some familiar with state space models for time series modeling before I got into LLMs. I was an ML engineer working on a massively multivariate time series prediction. So I've they're near and dear to my heart in that sense, but it's not something I've, like, done much work with recently. So I'm excited, but I don't really have much, too much insightful to add there.
Nathan Labenz: (1:16:14) Cool. Last one. What do you think the prospects are for solving the ARC challenge if we gave you a version that is no budget constraint and just said you just have to make it work and you can fine tune any of the models that are, like, soon to come online for fine tuning, like GPT-4o is in, like, fine tuning beta now and Claude and Flash are announced. All of the tools that we've talked about throughout this conversation, do you think you can get there with ARC and create something that can actually hit that sort of human level performance, or do you think we're still just not there in some fundamental sense?
Riley Goodside: (1:16:51) Yeah. ARC's really interesting to me. It's satisfying to me that it recapitulates, like, the pattern we see, like, in human intelligence testing where you see things like Raven's progressive matrices, which is like an IQ test that most people wouldn't recognize as like what they think of as an IQ test. And that it's a test that has no words. Every problem can be described as here's something, here's a pattern that resembles like wallpaper and there's a piece missing. What piece do you think is missing? And then you have like multiple choices that you can have all sorts of like logic problems that are expressed in this form. And so it's interesting that we're having something like that for LLMs, that you could have something that's like the other approach that you could take to intelligence testing, like the SAT approach almost of like how much vocabulary do you know, how many things you know about the world, how much you know about writing, how much you know about music, and then you all they could test everything out of the sun. But maybe this is not the right way to view the world. Like, it's like a bit of envy of simplicity of physics maybe, but it feels right somehow that there could be just like a simple task that captures like this, like, deeper notion of really generalizable, intelligence that we think is there, but I'm unsure whether it's real. We spent a long time trying to think what are the grand theories of intelligence, of just armchair philosophy on, like, how do humans think and trying to tease it apart and capture that in symbolic AI and all the other attempts at AI over the years. And the thing that worked is just making it know everything. Right? There may be something irreducible about that, That maybe that there's a, like, a body of knowledge that has to be there for the thing that we think of as intelligence to even be coherent as an idea. But its possibilities are wide open. I think, like, the fact that we can be sitting around just, like, speculating this widely about the future really says, like, how little we know about this stuff.
Nathan Labenz: (1:18:40) Yeah. That's maybe a great note to end on and I think a very good reminder. Is there anything else that we didn't talk about that you wanna talk about? And this is if you wanna share more about scale in general or eval stuff you're working on there or the seal thing or product wise, basically floors wide open for anything that you want to put some extra attention on.
Riley Goodside: (1:19:03) Okay. Great. I'll say that Scale is hiring prompt engineers. If you're interested in prompt engineering, if, you know, really any of the issues we're talking about here today, especially if you're interested in LLM security and red teaming and jailbreaking adversarial prompting, we're looking for people who can do impressive things in prompting. Reach out to me on Twitter or through our Scale job boards, but I'll do my best to make sure, you know, you get sent to the right places.
Nathan Labenz: (1:19:29) Anything people should do to demonstrate this? Anybody could say, oh, yeah. I love using ChatGPT. How can somebody distinguish themselves? What kinds of things do you look for? What stands out to you that's impressive so that people are not just yet another resume that's, like, enthusiastic about ChatGPT?
Riley Goodside: (1:19:48) Yeah. A lot of people are talking about Pliny. So the things that go viral, it's an art unto itself of, like, just seeing what's getting attention, what people are liking, but that's it's a good way to get to the top of the stack. I think, if you can let X the everything app sift through the quality ideas and let the good things bubble up. It's the avenue that worked for me, so I'm a little biased. So take it with a grain of salt, I guess.
Nathan Labenz: (1:20:07) This has been fantastic. Lot of really good content here that I think is just gonna be of super practical value to people as they seek to use AI to actually automate work and get real economic value. I know you have been hard at work figuring out which among these techniques is really worth it. And I also had happy to say that I think our views of this definitely are converging on some sort of platonic approach today, so that's been a great finding for me as well. I really appreciate the time. I think folks will get a lot of value from it. For now, I'll just say, Riley Goodside, the world's first staff prompt engineer at Scale AI. Thank you again for being part of the Cognitive Revolution.
Riley Goodside: (1:20:46) Thank you so much. It's been great.
Nathan Labenz: (1:20:48) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.