The Tiny Model Revolution with Ronen Eldan and Yuanzhi Li of Microsoft Research

Watch Episode Here

Video Description

Nathan Labenz sits down with Ronen Eldan and Yuanzhi Li of Microsoft Research to discuss the small natural language dataset they created called TinyStories. Tiny Stories is designed to reflect the full richness of natural language while still being small to support research with modest compute budgets. Using this dataset, they began to explore aspects of language model performance, behavior, and mechanism by training a series of models that range in size from just 1 million to a maximum of 33 million parameters – which is still just 2% the scale of GPT-2. In this conversation, Nathan, Ronen, and Yuanzhi touch on LM reasoning, emergence, interpretability, and what understanding can be extended to LLMs.

LINKS:
Tiny Stories paper: https://huggingface.co/papers/2305.07759

TIMESTAMPS:
(00:00) Episode Preview
(07:12) The inspiration for the Tiny Stories project
(15:07) Sponsor: Omneky
(15:44) Creating the Tiny Stories dataset
(21:27) GPT-4 vs GPT-3.5
(24:13) Did the TinyStories team try any other versions of GPT-4
(29:23) Curriculum models and weirder curriculums
(35:34) What does reasoning mean?
(46:27) What does emergence mean?
(01:01:44) The curriculum development space
(01:11:40) The similarities between models and human development
(01:20:12) Fewer layers vs. more layers
(01:29:22) Attention heads
(01:33:40) Semantic attention head
(01:36:54) Neuron technique used in developing the TinyStories model
(01:52:20) Interpretability work that inspires Ronen and Yuanzhi

TWITTER:
@CogRev_Podcast
@EldanRonen (Ronen)
@labenz (Nathan)
@eriktorenberg (Erik)

Thank you Omneky for sponsoring The Cognitive Revolution (https://www.omneky.com/). Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

Music Credit: MusicLM

More show notes and reading material released in our Substack: https://cognitiverevolution.substack.com

Full Transcript

Transcript

Nathan Labenz: (0:00) One of the most important abilities for generative models is to be able to speak coherent English. Her mom didn't let her have a dog, so she asked for a... And when you try to autocomplete this, now the most common noun that you've seen so far, also the most proximate one, is dog, not cat. Dog already appears twice in the sentence. Even GPT-2 XL, which has 1.5 billion parameters, its most likely completion is still dog.

I'm just going to read one tiny story straight out of the paper because I think that will help people understand what this dataset ultimately is. Tom has a big pot of soup. He wants to share it with Jane. Jane takes a spoonful of soup, but then she makes a face. The soup is... That's the prompt. And then you show a completion: very bitter. She does not like it. She says, "I don't like this soup. It is too bitter." He looks around the kitchen and finds some bread and cheese. He puts them on the table and says, "Here, Jane, you can have some bread and cheese. They are not bitter. They are sweet and yummy." Jane is happy. She says, "Thank you, Tom. You are a good friend. I like bread and cheese. They are not bitter."

Nathan Labenz: (0:43) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my co-host, Erik Torenberg.

Hello, and welcome back to the Cognitive Revolution. Today's episode is great for anyone who really wants to deepen their understanding of and intuition for how language models really work. Certainly, as measured by how much I learned in the course of the conversation, it's one of our very best. Our guests, Ronen Eldan and Yuanzhi Li of Microsoft Research, have created a small natural language dataset called Tiny Stories, which they designed to reflect the full richness of natural language while still being small and conceptually simple enough to support research with modest compute budgets. They did this by using GPT-4 to systematically create 1 million children's stories using only words that an advanced 3-year-old could be expected to know.

Dataset in hand, they then began to explore a number of aspects of language model performance, behavior, and mechanism by training a series of models that range in size from just 1 million to a maximum of 33 million parameters, still just 2% the size of GPT-2. They then use these small models to explore the development of language model reasoning abilities, identifying so-called logical primitives, beginning with a basic understanding of grammar, followed by the learning of facts, and then eventually adding certain logical microskills such as negation and exclusion. These findings create the perfect context in which to discuss the tricky and often controversial topic of emergence, as well as to compare and contrast how large language models learn with how human children learn, and to explain how the differences that we see across language models and children do in fact make some sense given the different incentive structures in play in each case.

They also did some great interpretability work in this paper, and I really relish the chance to get into all three areas that they explore. First, they look at the tradeoffs between the number of layers in a transformer, which to a large extent governs the number of logical leaps that a model can make, versus, on the other hand, the width of a layer, which seems to determine how many facts the model can store. They also identify attention heads with distinct roles, including distance heads, which simply reflect the distance between tokens and which look almost exactly like the ALiBi scheme, which is now powering long context models such as Claude 100K and the recent MosaicML 65K release. And then on the other hand, semantic attention heads, which focus on meaning. That there should exist such completely different attention heads within a single model and that an ALiBi scheme should emerge in the wild is really, to me, mind-blowing.

Finally, they examine the role of individual neurons, finding that many of their small model neurons do in fact correspond to human-interpretable concepts. We close the conversation by zooming out and discussing why small models are more interpretable than large models, the challenges inherent in attempting to extend this work to larger-scale models, and why controlling language models might end up being more like horseback riding than microbiology.

Throughout this conversation, I was really struck by two things. First, it seems to me that we've only scratched the surface of the potential for curriculum learning approaches. I fully expect that we'll start to see ever more sophisticated approaches which use specific datasets to layer on specific skills in a strategic, progressive manner, creating highly specialized small-scale models that can solve specific problems extremely efficiently. Second, the value of these toy models for developing understanding really is tremendous. If I could make just one suggestion to listeners, if you want to get the absolute most out of this episode, it would be to visit the Hugging Face website and try playing around with some of the bigger models that they've released. The very biggest are still only 33 million parameters, which means that they can load easily and run quickly right from the Hugging Face model page. If you do that, as I did in preparation for this episode, you will actually have the chance to explore a lot of the concepts, and you can set up your own little experiments to test the reasoning ability of these models. I guarantee that you will come away with a deeper understanding that you will retain for longer and better. And if you do find anything interesting, I would love to hear about it. So please do reach out to me via our email, tcr@turpentine.co, or on Twitter, where you can always DM me at Labenz.

Nathan Labenz: Now I hope you enjoy this elucidating conversation with Ronen Eldan and Yuanzhi Li of Microsoft Research. Ronen Eldan and Yuanzhi Li, welcome to the Cognitive Revolution.

Ronen Eldan: (6:35) Thank you so much. We're super happy to be here.

Nathan Labenz: (6:38) You guys have just published this paper called Tiny Stories, and I think it's a really fascinating bit of research on multiple levels. So I'm really excited to dive into it with you guys. It touches on a bunch of different themes, including some of the hot-button themes that we'll get to around emergent capabilities and reasoning. And you guys are studying this in a very unique way that makes the problem, I think, more tractable and more approachable, hopefully for our listeners as well. So I'm really excited to introduce this work to them. Maybe just for starters, can you give me a little bit of an introduction to what inspired the Tiny Stories project?

Ronen Eldan: (7:19) I guess I'm kind of new to LLMs or deep learning in general. I come from pure math. And when I started looking into architectures, trying to understand what those models are actually doing, how to improve them, et cetera, I got very frustrated very quickly because it's very easy to come up with ideas, but in order to actually check whether an idea is good, almost always you need to do an experiment that involves a lot of compute. It's just very hard to check things. You can either train small models which basically don't do much in terms of... they don't actually generate text that sounds coherent. You can train maybe a BERT-sized model, and then it'll do something on some downstream tasks, but whatever it does doesn't look much like what those LLMs are doing. If you want to really get an LLM experience, you need to do an experiment with a lot of compute that involves tons of GPUs, et cetera. So for me, it was just a way to address the frustration of not being able to get any insights without having to do large experiments.

Nathan Labenz: (8:57) And so the main way that you have accomplished that, if I understand correctly, is by kind of narrowing the conceptual space of what both the dataset contains and then obviously what the model is trained to do, right? Instead of taking a small cup out of the whole ocean of mixed-up everything language, you've created a kind of... we're going to tackle one very consistent type of input.

Ronen Eldan: (9:27) That's fair. I guess we should mention there have been many attempts to come up with a synthetic or non-synthetic, a smaller dataset that has all those elements that those large language corpora have, right? So in language you have all sorts of elements. You have a lot of facts. You have... so first of all, you have grammar and vocabulary, right? Those are the obvious things you have in language. But then you have facts, you have reasoning that you can infer from those texts, and there are many layers of reasoning. And I guess there are many capabilities involved in being able to parse those datasets. So our initial motivation was to come up with a dataset that has all these qualitative elements, but on the other hand is just not as massive as those large language corpora, right? And Yuanzhi and I had... so first of all, as I said, there are many synthetic datasets out there. Some of them, I think, reflect in a pretty good way certain aspects of language such as reasoning or facts or grammar or stuff like that, but we felt like there is no single dataset that has all those dimensions together which are all integrated into something which is not too large, right? And we felt like in order to understand, in order to gain insights about LLMs, we need a dataset that has all those elements.

Yuanzhi Li: (11:23) Yeah, I was just going to add that I also came from a theory background, and so I was doing theory of machine learning since maybe 7 or 8 years ago when the field just got started. And at that time, everyone was doing research on vision models, and for vision there's a very nice dataset called CIFAR-10 or even MNIST. I mean, those are very small datasets. They only have like 50,000 images, and when you train on those datasets, you can get a pretty high-quality image model and they can do all sorts of things, and they reflect what's going on in real large models. And I mean, at that time, doing research on making progress on both the theory side and the applied side was kind of easy because just training those models only takes like one day at most.

But when we move to this phase of large language models or language models in general, the research just becomes so expensive. And I've seen all those blog posts saying that it's impossible to do a PhD now in machine learning without like an A100. And I mean, probably only 1% of PhD students have that amount of compute. So we really want to see whether there is a way to kind of bring the good old days, which are like those CIFAR datasets or fast experiment iterations, back to the language side. That's what motivates us to consider this small dataset or this simple dataset.

I think the point is, I mean, there are other synthetic datasets or simple datasets that are, as Ronen said, reflecting some aspects of natural languages, but they are not real natural language. They are like just doing simple arithmetics or doing simple string matching or number manipulation. I mean, they are not real natural language, and we want to keep the authenticity of natural language, but just reduce the overall complexity. So we are still studying natural language and not some symbolic manipulation, and still we want the iteration or experiments to be done in a very quick way.

Nathan Labenz: (13:46) You created the Tiny Stories dataset. I always like to be as concrete as possible. So I'm just going to read one tiny story straight out of the paper because I think that will help people understand what this dataset ultimately is.

Tom and Jane are friends. One day, Jane goes to Tom's house. Tom has a big pot of soup. He wants to share it with Jane. "Jane, do you want some soup?" Tom asks. "Yes, please. It looks yummy," Jane says. Tom pours some soup into two bowls. He gives one bowl to Jane. Jane takes a spoonful of soup, but then she makes a face. The soup is...

Now this is just an example presented from the paper. That's the prompt. And then you show a completion, and you compare and contrast this against other open-source models, but I'll just read the 28 million parameter version that you guys trained.

The soup is very bitter. She does not like it. She says, "I don't like this soup. It is too bitter." Tom is sorry. He says, "I'm sorry, Jane. I didn't know you don't like bitter soup. I will make you something else." He looks around the kitchen and finds some bread and cheese. He puts them on the table and says, "Here, Jane, you can have some bread and cheese. They are not bitter. They are sweet and yummy." Jane is happy. She says, "Thank you, Tom. You are a good friend. I like bread and cheese. They are not bitter."

So there is our whole tiny story. And I read it almost like I'm reading to my 4-year-old or my 2-year-old because it is kind of a children's story. And I understand that that also is kind of part of the motivation. So how did you create this dataset? How did you kind of... how do you conceptually think about those stories? You told us a little bit already as kind of having those key elements of grammar, facts, some amount of reasoning required. But how did you create them? How big is this dataset?

Ronen Eldan: (15:39) From this motivation to have a good synthetic dataset, we should just point out that maybe the most natural idea is to rely on human development, right? It already has the solution for us, because young children are able to speak English somewhat, mostly coherently. I have a daughter, I can testify that not extremely coherently, but somewhat coherently. And this is... there's already a solution to this coming from human development, right? So all you have to do is just create a dataset and make sure that any example can be understood by a small child on one hand, and on the other hand, you want it to span as much as possible of the knowledge that a small child has. You want it to be as diverse as possible.

And we decided that it makes sense to have this dataset somewhat structured, so just the structure of a story. It kind of makes sense because inside a story you can have all those elements combined together, right? Grammar, facts, reasoning, stuff like that. And we just... I think it's a really good time to try to create this dataset because finally we have those models GPT-3.5 and GPT-4, which those models can actually understand the instruction, you know, "I want a story which is somewhat creative and only has very simple words."

Nathan Labenz: (17:45) So GPT-4 wrote these stories? Is that...

Ronen Eldan: (17:49) So yeah, most of those stories were written by GPT-4, some of them by 3.5. 3.5 is already good enough to write those kinds of stories. It's not that great. GPT-4 is definitely doing a better job. Now, it's pretty easy to just write a story, right? If I just want to write a short story, even GPT-2 can probably write a decent short story. The problem is to actually get a diverse dataset that spans all the vocabulary that you actually want to span. And if you just ask even GPT-4 "create a short story" and you do it 1,000 times, and you do it with temperature, with rather high temperature, let's say temperature 1, which kind of gives rise to the most diversity you can get, still about one-fifth of the stories will be about children being scared of the slide at the park. I actually did this experiment. So it's not very creative. If you just tell it to create a story without any other instructions, you're going to get a very repetitive dataset.

The whole game is how do you get diversity? How do you make the dataset not be very repetitive? And here the idea was just to collect a list of a vocabulary of simple words. We have about 2,000 words which supposedly 3-year-olds understand. And then what we do is we just ask GPT-4, "Okay, here's one random verb, one random noun, and one random adjective, try to combine these into a story in a creative way." We do about 1 million calls like that. I think we have about 1.5 million stories in the dataset. So on one hand they definitely span all this vocabulary because there's only 2,000 words, but on the other hand, you definitely do not span all possible combinations of words. So you know that you're not going to... if you can later create a story with some prescribed combination, you will have demonstrated that the model has some creativity inside it. Yeah, so that's how we created it.

Nathan Labenz: (20:42) That's really interesting. Just to do a little bit of math on this: 1 million-ish stories, which first of all that answers the question of why not go use real stories because that's a lot of books to scan. And there may... I don't know how many children's books there are. You'd have to get your hands on a whole lot to get to 1 million. So the need for the synthetic data is there. I'm interested in what you were seeing that was way better about GPT-4 versus 3.5. It sounds like with the stories that were repetitively about the slide, I would interpret that as maybe like mode collapse, like reflection of kind of, you know, effective RLHF likely. Is that how you would understand that too?

Yuanzhi Li: (21:26) I think it's mostly just the model is regenerating the most likely stories, because that's what the model is doing. If you don't give the model any content, it will just bleed out the most likely stories because it wants to minimize its language model loss. So potentially just a child scared of the slide is the most common story that exists on the internet, so the model just learned to generate it without any given condition. So that's why we want to create some condition to move the model outside this particular high-probability zone.

Ronen Eldan: (22:03) One difference that we see between GPT-3.5 and GPT-4 is after it... so you give it 3 words, it needs to combine them somehow in the story. And sometimes it's not that easy. Even if I give you 3 words, I don't know, I think there's an example there: ancient, thunder, and sad or something like that. You want to combine them in a way that won't look too superficial. You want to create a story that actually seems fluent and you don't want like a complete change of topic in order to be able to combine the next word. And GPT-4 seems to be able to do this pretty fluently, whereas GPT-3.5, sometimes you get stories that don't make so much sense. The words appear there, but they don't appear in a very satisfactory way.

Nathan Labenz: (23:07) It totally makes sense also for you to note, yeah, maybe this is just the most common story. We don't necessarily have to invoke any exotic theories of why it keeps talking about slides. But I wonder if the non-RLHF version, you know, if you had access to kind of the base GPT-4 model, if that might have been different. I had the opportunity to red team the GPT-4 early before it had all the safety measures, but it did already have the RLHF and the instruction-following capability. I never saw the totally base model. I don't think that many people did. They in the technical report said it was not... people weren't sure what to do with it, right? So I think they maybe put that one out to pasture. Did you guys try any different versions of GPT-4? Obviously, being Microsoft, you might have some privileged access to different versions that the rest of us wouldn't have.

Ronen Eldan: (24:02) Yeah, so we did have access to an earlier version that... I mean, you know, we're not sure about the exact technical... what the difference is from the model that's now available to the public, but that model had less safety features on it, but as you said, it may be the same model you had access to, so it did have a certain extent of RLHF, right? Now, I think if you take just the language model trained on the Pile without any RLHF, and you say, "Create a story such that blah blah blah," maybe the most plausible completion in terms of a random entry from the distribution of web pages is, "I don't want to do it." Like, "Write me a story such that..." that's the question answered: "I don't feel like it." Or it could be that instead of completing the story, it's just going to ask another question.

Without any RLHF, without any alignment that the model has to... I mean, the model doesn't know that it's supposed to actually perform your instructions, right? That's the most basic part of it. But I think that the RLHF they did on GPT-4 is just good enough so that it makes sure to, as accurately as it could, actually satisfy the constraints that you give it. So it almost always combines the words that you ask it to combine into the story and also it almost always actually writes a story that only uses simple words.

Yuanzhi Li: (25:57) Yeah, I think as Ronen pointed out, the biggest difference of the baseline GPT-4 model compared to this RLHF version is just following instructions. I mean, for the base model, you can keep it as the beginning of the story, the completion is very good, but if you just say "write me some story that combines certain elements," then the model has a very hard time understanding the instruction because those things are very rare on the internet. It probably needs some fine-tuning so the model understands what it means by instruction. It's not like a conversation, it's an instruction when you ask "write me a story."

Nathan Labenz: (26:35) Yeah, makes sense. Yeah, at best you could maybe do a few-shot approach, but then you probably have... I've seen a lot of issues with over-indexing on the examples as well. So yeah, I totally get it.

Ronen Eldan: (26:47) A few-shot approach would probably prime the model into thinking of specific plots also.

Nathan Labenz: (26:55) I'm doing some AI 101 type education at a friend's company right now called Athena. And I was just doing a webinar this morning where I was getting into that with folks saying, you know, if you are going to do a few-shot, probably don't do one example in your few-shot, at least do two because otherwise you tend to get this over in-context learning on just the one example you gave. So yeah, I'm with you on that for sure.

So a little bit of math. So there's like 2,000 words. There are... you know, let's say that's just to take a round number, 1,000 each of the verb, the noun, the adjective. So that in pure expanded form is 1 billion possible bags of 3 words, roughly order of magnitude, right? So then you made 1 million stories. So I just wanted to establish that the space of possibility versus the actual dataset that these models were trained on is about 1,000 to 1 ratio. Do I have that roughly right?

Ronen Eldan: (28:03) That's true. I think it's pretty accurate. Only in addition to those 3 words, we also have another way to add diversity, which is a bunch of features we ask GPT to add to the story, such as a plot twist, a bad ending, dialogue. So that adds a little bit more diversity. But 1 in 1,000 is a pretty good ballpark estimate of the ratio.

Nathan Labenz: (28:33) Cool. And then just cost of this: if we were going to pay retail price for GPT-4 to write all these million stories, that would be... if each one is, say, 300 tokens, I'll just take a nice round number because that maybe equates to roughly 1 cent per, it would be like a $10,000 GPT-4 retail price to generate the dataset.

Ronen Eldan: (29:00) That's pretty accurate.

Nathan Labenz: (29:02) Okay, cool. So then this curriculum concept is, I think, super fascinating, and this is one of the areas that had me so intrigued by the paper. You're taking inspiration, obviously, as you said, from human development and starting with simple words, which definitely makes sense as an approach. I always kind of try to keep in mind as well that these things are very alien. And I'm very intrigued by this curriculum sort of approach, but I wonder what about more weird curriculums? This is maybe outside of the scope of this particular research, but I kind of keep waiting for somebody to show up with a "we trained it first on pure logic notation," you know? We've seen this a little bit. It's kind of been discussed a lot recently that the code pre-trained models seem to demonstrate better reasoning, you know, once the language part gets added on. And obviously, who knows exactly what that baking recipe looks like? How do you guys think about that? Should we expect the same thing that somebody's going to pop up with a "hey, we did a pure logic or we did just massive amounts of abstract algebra first and kind of taught some sort of structure that we then were able to layer natural language onto?"

Yuanzhi Li: (30:24) It could definitely help because based on our previous research on attentions in language models, there are some simple attention mechanisms that the language model may have. The first one is just associating two tokens that are exactly the same. And the second one is after it associates the two tokens that are exactly the same, it also copies the tokens around the first token to the second token. So it's just like us when we read some word, we go back and see what's the previous time that this word appeared and what's the surrounding context. And I think just training this head is actually pretty expensive. It requires a lot of training data, and something like coding or logic is the perfect way to train those heads, because for coding when we define a variable, we definitely need to look back like what's the previous definition, or when we call a function, we check that function, we see what the function is doing. So it kind of may set the language model into learning those important concepts like looking back or checking the surroundings, and that may serve as a very good warm start for training on other things like simple natural languages. It makes a model learn much faster. Ronen Eldan: (31:43)

Yeah, so maybe let me rephrase what Yuan just said. We do observe this to a certain extent that coding improves model reasoning. At this point, there is no overwhelming evidence that this is actually the case, but there are some observations. However, we are not sure at all that the reason behind it is that when the model learned how to write code, it actually learned how to reason. It looks like the reasons that this works are much simpler. You just managed to calibrate the exact attention heads that you need, and those attention heads don't have any particular sophistication in them. They might just be able to very accurately look at some relative position to a given token or just compare two tokens in a very precise way. So the reason is more like the types of components in your neural network that are required for coding are already there, but those components are pretty simple. It's not like the network has very sophisticated neural paths that emerge after the training that actually know how to do reasoning.

And for that, we actually have a paper we wrote at Microsoft Research about a synthetic task called LEGO. This is a very basic synthetic task that has the core elements of reasoning. We use a transformer based on a BERT architecture. What we observe is that the pretrained BERT transformer basically grasps this reasoning task much faster. The task is something simple. You get a string that looks like "a equals 1, b equals minus a, c equals minus b, d equals c," and so on. And you have to resolve the values of all variables. At first, we thought maybe in some kind of profound sense, the pretrained BERT model has learned how to reason, and this is why it grasps this task so well. But if you dig into it just a little bit, what you realize is the explanation is just much simpler, much more superficial than that. The pretraining has given rise to some simple attention heads that, if you just initialize the model with those attention heads, then it basically grasps reasoning much faster. This explanation is much closer to what actually happens when you train a model to code and then it exhibits better reasoning capabilities.

Nathan Labenz: (35:09)

This is maybe a good time to talk about what we mean by reasoning. And I see a ton of confusion out there about this. Maybe you can help us get a little bit more clarity. One thing that I kind of observe is, of course, people are debating this capability. And it seems like you've got kind of different standards of evidence, or people put the burden of proof in different places. To put my cards on the table, I call myself these days an AI scout, and I'm really interested in what is possible, what can be done, not necessarily holding the systems to the standard that they can do it every time or that they can do it in all cases. It certainly matters how adversarially robust they are. But I wouldn't say, "Oh, well, we found an example that it failed on, therefore it can't do X," if it could do X nine out of ten times before it got to that kind of crazy example. How are you guys thinking about reasoning as something more than a binary, obviously, in the context of this research?

Yuanzhi Li: (36:28)

Initially, we think of reasoning as something that is just a subset of consistency. When we generate sentences or when we say things, we need to make sure that they are consistent with what we said before. There's the first level of consistency, which is just nearby words. They need to follow some grammar rules and follow some basic semantics. Those are not really reasoning. It's more like a stochastic parrot, where you just do simple pattern matching, just looking at the previous couple of words and just generating one that is consistent. What goes to reasoning is when the consistency goes to the next level, which is you really need to be consistent with something very far away from the current token, like something consistent with a general plot of the story. For example, there's a word "but," and then you need to say something in the opposite order. Those levels of consistency are the primitives of reasoning. So we do think that anything beyond just a local consistency should be thought of as some ability that is reasoning.

Ronen Eldan: (37:37)

The first thing we have to say is that the type of reasoning we're thinking about is very basic. It's just some basic core capability that comes with speaking coherent English. Some people, I guess, still say that large language models will never be able to reason. I guess they have a very different definition of what reasoning means than what we have. What we mean by reasoning is really the capacity to just apply some basic logic when you generate text. And maybe to be concrete, we can look at one of the examples in the paper.

If we look at the sentence "Lily likes cats and dogs. She asked her mom for a dog and her mom said no, so instead she asked..." and then you do autocomplete. We kind of see it as a hierarchy of capabilities. Some words in this sentence, in order to complete them, to know what the next word is, just need some very basic grammatical rules. For example, "she asked her mom for" the next word is "a." For this, you only need to know a little bit of grammar and that's it. Now, the next word after "a," "she asked her mom for a dog." If you just know grammar, you know that the next word should be some noun. But here you already need to have some contextual tracking of what's going on in the text. The relevant nouns here could be dogs and cats. Those are the two objects that were mentioned in the sentence before.

Now we go to the next sentence. "Her mom didn't let her have a dog, so she asked for a..." When you try to autocomplete this, the most common noun that you've seen so far, also the most proximate one, is "dog," not "cat." Dog already appears twice in the sentence: "likes dogs and cats," and "she asked for a dog." So our smaller models actually complete this by saying "dog." And even GPT-2 XL, which has 1.5 billion parameters, its most likely completion is still "dog" because it's still at that level where it did resolve that there should be some noun there, and it does know how to look back in the sentence and see that there are two nouns, dog and cat. But dog appears more, so it's more likely that if you just had a dog in the previous sentence or in just five words before, it's going to be dog again. But on top of that, if you have a very basic reasoning capability, then you're supposed to be able to apply elimination and realize that she can't have a dog. We had the set containing the two objects, dog and cat, but now dog is not allowed. So what's left is cat. We thought this is one of the most basic examples of a completion that would require some extent of reasoning.

Yuanzhi Li: (41:53)

There's always this intertwining between reasoning and planning. For example, when we say reasoning, many people would think about mathematical reasoning, like proving a mathematical theorem. And that's not only reasoning, it's also planning. I need to come up with the correct method. I have the intuition, like what's the next step should be. For us, the reasoning that we are more interested in is just consistency. You should say something that is consistent with what you previously said, and the consistency is not only local, it's global. And that's where we think of reasoning for natural languages.

Ronen Eldan: (42:33)

The only thing a language model needs to do is generate text that's consistent with the prompt. That's the only objective a language model has to fulfill. The next word it generates should be as consistent as possible with all of the prompt. In order to achieve this consistency, there are several different levels. For most words that it generates, the only capability that's actually needed is grammar. Or maybe not for most words, but for many words. Just by knowing some grammatical rules, you know that if you have a sentence "Amy wanted," the next word is probably "to." "She wanted to something." And you don't need to know anything beyond that.

Now, the next level after that, and this is again very vaguely speaking, it's not like there is a very strict hierarchy, but the next level is to have some semantic understanding of what's going on, or just to understand what are the relevant nouns, actions, stuff like that. Or maybe which action could be related to which object. And if you look at models of size, say, around 1 billion, they're very good up to that level. They almost always give you a word that is grammatically correct and also has semantic fit. This word is well related. It works well with the previous few words that you saw, or it fits well with the previous few words in the prompt. But the next level after that sometimes already requires first-order logic or second-order logic.

Nathan Labenz: (44:49)

So kind of breaking it down into micro skills. This is a ridiculous analogy, but I'm kind of thinking I follow this guy on TikTok who coaches basketball micro skills. And it's amazing how many micro skills there are involved in being a good basketball player. Mostly, the untrained eye, even among basketball fans, can't really enumerate them. But this guy has enumerated them, and now he's teaching them one by very small one. And so maybe similarly, you wouldn't say this person can fully play basketball. That probably doesn't even make a lot of sense or already sounds strange. But you wouldn't say if there's any missing micro skill that they can't play basketball. You have some sort of continuum there where people can be better and worse at playing basketball. People can certainly be better and worse at reasoning, and language models too can be better and worse at reasoning. And that probably maps onto some sort of hierarchy of micro skills that it either has or it doesn't have, or it's in the process of grokking at any given point in training.

So that leads me to the other big, bold vocabulary word that I want to dig in on a little bit, which is emergence. Again, tons of confusion, tons of different meanings out there. I think some people mean things that surprise us that we didn't necessarily predict. Some people mean things that happen suddenly. I guess what I kind of think is it seems like there is some process. But I always think back to the grokking paper and the Neil Nanda exploration of that, which I'm sure you guys are at least somewhat familiar with, where there is a phase change from initial memorization to a circuit, which, what's so amazing about their work is they actually show this circuit in very concrete terms. And it's like, this is the circuit that does this algorithm that allows it to generalize to the full set from just the sample data that it was originally trained on. I don't know that we have any circuits here that we could elucidate, but does that feel right to you? Do you feel like there's this sort of process of memorization sort of being gradually replaced by circuits that solve particular micro skill challenges? Is that your model of what's going on under the hood here?

Ronen Eldan: (47:25)

Let's take a step back. People talk about emergence. I guess we can both agree that this is not a well-defined notion at all. It's not like you see a sudden phase transition from the model not being able to do something to, you slightly increase the size and then suddenly it's really good at some capability it didn't have at all before. Rather, it's a vaguely defined term saying that there are some qualitative capabilities that the model at certain sizes has, whereas at smaller sizes there's almost no trace of these capabilities at all. Like, GPT-2 did not know how to summarize text, and suddenly at GPT-3 you have this summarization capability. But the notion is not well defined at all.

On the other hand, we do see that as we increase the size of the model, suddenly you have certain capabilities you didn't have before. In a sense, I think a good analogy is if you compare dogs, monkeys, and humans. You increase the size of the brain, suddenly humans can do math, whereas monkeys cannot. So it's an ability that emerged when you made the neural network larger. Not that I'm trying to imply at all that the same mechanism explains both things. We have no idea about that. But what we say here is one of the most important abilities for generative models is to be able to speak coherent English. This is an ability that we see emerge also in much larger networks trained on those large language corpora. I think TinyStories basically gives you a much smaller dataset where you can observe this emergence at much smaller scales of models.

In the sense that if the model is 1 million parameters, then it can hardly generate coherent stories. And if you go to 10 million, then almost all stories will be coherent. Same with reasoning. A 1 to 5 million parameter model, all of our reasoning prompts fail, whereas for 30 million, almost all of them succeed. Now, as Yuan just said, all of this basically has to do with keeping coherence with the text. So you have the emergence of the ability to generate the next word in a coherent way on different levels of difficulty.

So the easiest level of difficulty we could think of is just when you have something that follows from some easy grammatical rules, and then you can think of different levels of difficulty. Sometimes you need to know a certain fact in order to be able to complete the next word. "Jack was hungry, so he went looking for some..." To complete this, you have to know that to satisfy hunger you need some food. Sometimes you need to know a fact. So there's all those core capabilities that are necessary in order to keep consistency along the text, and each one of them we can actually witness its emergence as we increase the size of the model.

Nathan Labenz: (51:52)

So what then is the theory of what is happening? I have a theory, but I want to hear yours. So you gave that example a minute ago, girl wants either a cat or dog, mom says no dog, so it's going to be a cat. But GPT-2, much bigger model, that's like 30 times bigger than your biggest in this research. You max out around 30 million parameters. GPT-2 is like 1.5 billion or something. So it's a lot bigger, 50 times bigger maybe. It still says dog, which everyone can tell is obviously wrong. You've got these much smaller models that can get that right. In some way, there's this emergent, observed phenomenon that it is able to get that exclusion concept. What do you think is happening there? Is that a micro skill that is this sort of exclusion, that's like a little piece of reasoning that is grokked by the small model but not by the big model? It seems like you could be a really good stochastic parrot, but it feels like there's something there that has truly kind of settled into the structure of the network. And maybe that didn't happen with GPT-2 because the data was just too noisy, and it's kind of all over the place. And so it wasn't able to learn those same things. How am I doing here? Does that resonate as likely true or even plausible?

Yuanzhi Li: (53:25)

Yeah, I think that's a valid conjecture. What we think is for GPT-2, because it trained on wild data, or just think of it like Wikipedia, where you try to minimize the language model loss, the consistency is the least concern. It's more about just getting the knowledge correct. You're talking about some object or some person and you want to know his birthday, or you want to know some specific aspect of that person. This has nothing to do with natural language. It's more about just the sheer amount of knowledge that we encounter in the web data or something like Wikipedia.

So the model, I think both GPT-2 and our model, they are not large enough to minimize the loss to a full extent like GPT-4. So they have to select some part of the loss that they focus on. And if the data has too many knowledges or too many other nuances, then the model may just focus on other aspects compared to consistency. While here, our TinyStories data are really pinned down, because the language is simple and the vocabularies are simple, the really difficult part is consistency. And that is where the model focuses to minimize the training loss. And that's why I think our model, although it's much smaller, it gets better consistency compared to the larger ones.

Ronen Eldan: (54:52)

Yeah, you can think about it as when you train a model on the entire Pile, on those large language corpora, they have much more incentive to learn the preferred clothing styles of celebrities much before they learn how to complete this sentence with the dog and the cat. They definitely learned that Joe Biden is the president of the United States way, way before they learned how to reason. This is a conjecture, of course I didn't actually test it, but I'm pretty sure that's what happens. You just overload the model with so many facts that appear in so many places. Only once in many words that you generate, this capability of reasoning becomes relevant.

So I think it's a really good exercise just to open a random Wikipedia article. It's one of my favorite activities since I started thinking about language models. And just go over the Wikipedia article word by word and just think for every word, what core abilities do you need to use in order to guess what the next word is? And what you'll see is, I think definitely for a random Wikipedia article or a random example in the Pile, in some web training set, you will see that only once in every 20 to 30 words to predict the next word, you need to use some reasoning capabilities. For one in every three or four words, you need grammar. For most words, you can just kind of guess the next word using, if I just tell you what were the nouns and verbs that appeared in the previous sentences, without telling you anything about the context or things which are farther away in the text, you will still be able to guess them. So reasoning is kind of a rare capability that you need. It only becomes relevant pretty rarely, and therefore, the capacity of the model will be dedicated to other things much, much before you will get any reasoning capabilities.

Yuanzhi Li: (57:44)

That's also why we always kind of see there's an emergence behavior. It's really because the reasoning or those very rare consistency events, they actually happen very rarely. So only if you minimize the loss to a certain extent, you'll start to learn those rare events and your model feels like different. For example, just a simple example: "Bob feels hungry, but he doesn't like sweet food, so he went to eat..." If the model says "went to eat some candy," then we think that this model knows nothing about what it's talking about. But this is only just one word difference out of the 10 to 20 words. And only if the model gets to that extent, it starts to learn this consistency and we feel like the model starts to know what it's talking about.

But in terms of loss, it's probably we only see less than 10% difference. If we turn to image classification, when I tell you my model gets 90% correct and you have a model that gets 95, I wouldn't consider this as an emergent capability that you get 5 more percent. But for language models, this 5 more percent may actually be the emergent behavior. And especially for math, if I really want to solve a math problem, there's only the connection between the two sentences where I need to make sure that my proofs are extremely coherent. Most of the part I'm just completing some formulas, writing down the results. But it's only this very tiny amount that defines, that gives the model the emergence capability. So I think the two things are connected: the model learning reasoning is a hard task that it only gets at the very end, and also we see the emergence capability if you grow the size of the model or if you grow the number of training data.

Ronen Eldan: (59:45)

Yeah, maybe let me just expand on your example. We have this sentence: "Bob could have either candy or pizza. Bob doesn't like sweet food, so Bob got some..." and autocomplete. Now the model, usually if the previous sentence says something about sweet food, if we don't read the entire sentence, the most likely completion is actually candy, not pizza. You have to read it in a pretty nuanced way in order to realize that Bob actually doesn't like sweet food. Sweet food and candy come together so many times in the training data. So the neural network has to, in quotation marks, choose between using its capacity in order to be able to resolve this nuance, or to use its capacity in order to know that Joe Biden is the President of the United States. You can't have everything together. The model has a finite size, and there is some theoretical limit to the amount of things the model can learn. The model will definitely prefer to learn that Joe Biden is president, and many, many other facts, because they are relevant much, much more frequently.

Nathan Labenz: (1:01:20)

The curriculum development space is likely to be a huge unlock over the not too distant future. I mean, it seems like you're probably just scratching the surface here because we've got web scale data that is not built for this purpose, obviously, where you're saying reasoning doesn't even, it's not even required that often. And so no wonder it kind of emerges late in the game. Maybe pretraining on code is changing that in some interesting ways. But man, intentional design around what a gradual, gradually upstepping curriculum might look like, especially with the ability to create the training data synthetically to really kind of isolate and bring those key skills forward. It seems like you could rebalance training data and probably shrink it like a ton and get to a lot of the, just by kind of shifting the balance, to get these kind of emergent things to be more important relative to just kind of mind-numbing repetition of who's the president or whatever. It sounds like, I mean, you're nodding, that it sounds like that aligns with your expectations too.

Yuanzhi Li: (1:02:45)

It's important when we design a new version of the data, for example, to extend the degree of ours into maybe elementary school or middle school. I think it's very important to balance the amount of knowledge in the data versus the kind of capability that we want the dataset to teach the model. For example, if you go to elementary school, there's ability to do simple math or mathematical reasoning or some physics reasoning or do comparison of historical events. These will take some capacity of the model. Maybe it actually takes a very big portion. And the remaining ones, if your dataset has too many knowledges, then the model may just prefer to use this capacity to actually memorize the knowledge instead of really learning those abilities.

So we want to balance that there's some amount of knowledge that the model must have in order to do basic stuff like math or basic physics reasoning. But more importantly, there should be a lot of data that's only emphasized on the ability side. There's no new knowledge. It's just a bunch of math training samples or a bunch of simple comparison of some basic historical events or some simple physical rules and their explanation or different varieties so that the model can actually focus on the ability part instead of just being screwed by the vast amount of knowledge. So I think right now the web data, they don't really balance between knowledge and ability training. So that's why training on them is not good for a small model because they need to allocate their capability or capacity to just memorization. So I think that's basically the criteria for maybe designing a better version of synthetic data.

Ronen Eldan: (1:04:37)

I guess we can just kind of, maybe relevant notions here would be the breadth and the depth of the dataset or the capabilities of the model. So the entire web is very broad. By breadth I mean it has a lot of facts, the vocabulary is very large, you need to have a lot of knowledge to capture the dataset. And by depth what I mean is it has first and second and third order logic that you can infer from learning this dataset. This is not well established. I don't think there is any research that really establishes that there is a trade-off between the two, but it's very reasonable to assume that there should be a trade-off between breadth and depth when you train the model.

Yuanzhi Li: (1:05:49)

Yeah, I think there's some optimal ratio because without the knowledge, you can't really do reasoning. You have to have some basic knowledge, like candy is sweet, in order to do reasoning. And when you go to elementary school, you have to know some basic events, like some basic rules for math, like one plus one is equal to two, in order to do mathematical reasoning. So there's a balance. You cannot have no knowledge, but you cannot have all the knowledge. So maybe there's some optimal ratio between breadth and depth.

Nathan Labenz: (1:06:19)

My head keeps coming back to the same kind of thing where, yeah, we should see so much gain from kind of rebalancing the dataset and maybe even starting with some more abstract things. Like, I could see sort of "A or B, not A, therefore B," and then do that with just all the letters and then start to introduce kind of these associations and layer that on. It sure seems like there is a lot of opportunity there.

Yuanzhi Li: (1:06:49)

Yeah, you're just mentioning the extreme case of only teaching reasoning because everything is just symbols. There's no knowledge, and it's all about just reasoning. And maybe it's good to combine this with just something that is pure knowledge, and maybe we can get something good by just adjusting their ratio.Ronen Eldan: (1:07:08) Yeah, maybe it's not clear at this point if a human being, when you teach a human being, you can almost separate the knowledge and the reasoning, right? You can just say, fact A, blah, blah, blah, fact B, blah, blah. Now, here's how to reason. I mean, it has to be a pretty smart human being, but in general, we are able to take those two things and then combine them so that in the end we will have those core reasoning capabilities that we learned using exercises that only involve, if A then B, if not A then... you know, those rules. When we studied for the SATs, we have those rules and separately we have the knowledge of facts and we're able to combine them. And I think it's an important question whether it's even feasible in language models to just take those things, separate them into two different modules, and have the model actually be able to combine those abilities. My conjecture, by the way, is no. As long as you don't combine them enough in the dataset, the model is not going to be able to infer the way that a human consciously infers the connection. But if we could do that, then this would be, of course, a very powerful technique to train models.

Nathan Labenz: (1:08:54) I just did another interview with a couple of guys from MosaicML, and they talked about training at times on massive client datasets. Even when they do that, they typically still mix in the general pre-training, the Pile or whatever, because otherwise they see catastrophic forgetting. So they have to keep some mix at all stages of training to avoid that. So I think that would be very consistent, I think, with what you said. Like, you probably can't do it in strict phases. There's got to be some sort of mixing strategy throughout the process.

Ronen Eldan: (1:09:36) Right. And that's going to be very nontrivial to do. It might be feasible, but it's definitely not, and it's not going to happen on its own just because what the model cares about is just being able to efficiently autocomplete samples in the dataset. If the dataset is either knowledge or reasoning, then it has zero incentive to combine. And even if you give it a few examples where this is actually combined, no one assures you that it'll actually be able to take those two modalities and really use them for the combination. So I feel like this is a science that's really, I mean, we're only beginning to understand how these things work. Hopefully, we'll figure out a way how to actually do it, but I'm not sure. Yuanzhi, maybe I don't know any concrete evidence of an example where this seems to work.

Yuanzhi Li: (1:10:46) Right now, we lack the kind of concrete evidence that curriculum learning really works, but we believe this should be helpful. But I think overall, as Ronen said, the model has no incentive to connect the different phases where you train things. I mean, you start a new phase, it just greedily minimizes everything that is related to this phase, and it can just forget everything that it has learned. So it's definitely a very nontrivial task to get this curriculum learning to work.

Nathan Labenz: (1:11:18) First, you had a number of interesting empirical observations, and you can address this however you want. But you noted that grammar emerges before consistency and creativity. And consistency is related there to reasoning in your telling earlier. Is that the same thing that I've observed in my kids? I'm, you know, my memory, maybe I'm sleep deprived, but I feel like grammar maybe came last for them. Like, they definitely have a certain consistency. Like, they want what they want. You know, if they want ice cream, like, you know, ice cream, the next token is ice cream, and they're pretty consistent on that. And, you know, creativity, I don't know, that's a little trickier. But does this feel like it echoes human development in your mind? Like, to me, I'm not sure if it feels that way.

Ronen Eldan: (1:12:09) I love this example. So, yeah, I mean, I think the learning process for children is very different in that way. Right? Children don't get... Their incentive is not to say the correct next word. Children, they want ice cream. Their incentive, the outcome needs to be, I get ice cream. So if they just say ice cream, ice cream, ice cream many, many times, maybe they don't get the best grade for creativity, but they will likely get the ice cream, right? Depends basically on the self-discipline of parents and whether they are...

Nathan Labenz: (1:12:52) Limited in this household.

Ronen Eldan: (1:12:54) Yeah. I have a very close experience with this exact scenario. But, you know, more seriously, like, when children produce language, I mean, I'm only basing this on my observation, they have constant contact with the physical world. They know which entities are involved in the current conversation, right? We are talking about a book that I just read. So, you know, it's very unlikely that the next sentence will be about a car because, you know, they have in their heads this entity. Like, no, we're not talking about a car, we're talking about this book, right? Whereas a language model, you know, if it just makes one mistake in one word saying car instead of book, the loss that it incurs is actually not as big as the loss it would incur for incorrect grammar, which is relevant in almost every word it produces. And not only that, you know, the language model, it's actually much cheaper, it's much easier to have consistent grammar. You only have to be consistent inside the same sentence for the grammar to be correct. You don't even have to look past five or six words back. Whereas for children, you know, children have a few entities which are active in their working memory. So this entity could be, let's go, it could be, I want to go to the park and have some ice cream. So the two relevant entities are park and ice cream. You're not going to replace those two entities in the middle of the conversation. That's what the child cares about at this point. So it's really like the incentive system is different, but also the whole cognitive perception that's basically underlying the conversation is pretty different. Yeah, I don't know if it makes sense. Maybe a good way to think about it, a language model, definitely in its first phase of training, it doesn't need to look way back like three, four, five sentences before what it produces. So I think a good example is to look at one of the GPT-2 XL generations that we have in the paper. So we have a story about a snowman, two kids are building a snowman and a dog comes and destroys the snowman. And then you autocomplete to see what happens. And maybe I'll actually read it because I think it emphasizes pretty well what I'm trying to say here. So there's these two kids, Sarah and Ben, they built a snowman. A dog comes and destroys the snowman, and now GPT-2 XL's completion looks something like this: Sarah and Ben cry and look at the snowman in the snow. Ben asks to do something, Sarah tells him to go away, her hand tightens on the little red button, the snowman is covered in fur. It is red. It is real hair. It is a real snowman. It looks just like the ones they used to make. Ben has a big smile on his face. He looks at the face he has made. He walks over and kisses the eyes. He's happy to be alive, he has done what he was going to do, he has saved the world. So, okay, what am I actually trying to say here, except that GPT-2 XL has a good sense of humor? What I'm trying to say is, almost every sentence here, all you need to do in order, or GPT-2 XL does in order to generate it, is look at the previous sentence, and maybe it knows that there is one entity, which is snowman, which is relevant here. Other than that, there is no consistency at all in the completion that goes beyond looking at two consecutive sentences. Language models are, if they're too small or in the beginning of training, what they do is they don't actually have enough incentive to know the whole context of what's going on, because to complete most words correctly, you just need to have the context of the current sentence and maybe the one before, and maybe also one or two important entities. Whereas for humans, this is completely different. We have agency, we know what we want when we form the next sentence, and yeah, we care much less about grammar than about the ice cream.

Yuanzhi Li: (1:18:40) Yeah, I would make an analogy like human children are learning with our RLHF algorithm, just doing reinforcement learning with the parents' feedback. And obviously the parents are very robust to grammar mistakes, so they don't really want to optimize that in order to maximize their reward. They'll probably care more about consistency, the topics, and they want to get the topic correct so they can get the reward. Well, for language models, it's just next word prediction, and the grammar is going to be penalized much more severely compared to the global consistency.

Nathan Labenz: (1:19:16) Okay, yeah, I like that. I'm always a little wary of analogies and I always want to come back and think, what is sneaking into that analogy that I don't want to allow? So I'll bookmark that one, but I, you know, certainly the surface level intuition there makes a lot of sense, and it's a very clippable analogy as well. How about on the depth versus hidden dimension size, which I often just call width, the depth, you know, number of layers versus width of a layer? You note some interesting tradeoffs there, and I didn't really have an intuition necessarily for why that would be. But, you know, per the paper, you report that the fewer layer models will do better on grammar compared to consistency slash reasoning. And from that, you know, it seems that more layers are important for this kind of reasoning consistency. How should I think about that? Like, is there a story that crystallizes why that would be?

Ronen Eldan: (1:20:23) None of this is well established. It's all our conjectures that need to be studied much more. But I think a good way to look at it is, depth tells you about how many times information can percolate between the tokens. So every time you have a global attention layer, like a transformer attention, certain tokens, the information inside certain tokens, can percolate into other tokens. So, for example, if you have some instruction, create a story with these words and I want a bad ending as well, and then I type in the beginning of the story and it needs to autocomplete it, these instructions can, in every attention layer, they only have one chance of percolating into the tokens of the story. And sometimes these instructions by themselves are nuanced. So maybe there is an instruction saying the story has a bad ending. It's not enough to know, in order to complete the next word, to know the current sentence and that the story needs to have a bad ending. You also need to have the context, the wider context of what happens in the story. So in order to fulfill these kinds of instructions, you have to basically have the information percolate several times between the tokens. That's also the case for reasoning. If you have first order logic, so if we think about this example of cat and dog, like Alice wanted a cat or a dog and Mother didn't let her have a dog so she got a cat. How many times does the information have to percolate between the tokens here? So first you want to understand that there was a cat and a dog involved. That's one layer of global attention. Then you want to know that she couldn't have a dog and you really want to, and so you have this set cat and dog and you want to do cat plus dog minus dog equals cat. So after you know that you have either cat or dog, you need the fact that she couldn't have a dog to percolate into my token in order to know that the only available option is cat. So there are actually two layers of percolation here. You need to know that it's not a dog, so the token not has to go inside the token dog to know that dog is not allowed, and then those two tokens not plus dog have to percolate into the generation in order to know that, okay, I had cat plus dog, but I have to do now minus dog to have a cat. So if you think about it as coding, you have several conditionals you have to do and several times that you need to have pointers to information that appears in other places in the text. Okay, this is very, I'm putting it in a very vague and non-formal way, I think it's a good initial intuition, probably, to what happens. Whereas, when we talk about facts, so if you have a completion, you know, I don't know, if I have a completion in a language model that's like the capital of France is, then all I have to do is, I just need to have one kind of lookup table with all the countries and their capitals. I don't need to have many layers of global attention. It's enough to just take the two tokens, capital and France, put them together, and then just have one lookup table saying that capital plus France equals Paris. Now here the dimension seems to play a much more important role because the bigger the dimension of the space is, the more entities I can kind of squeeze into this vector space. And, you know, also the more neurons I have inside my lookup table to tell me that, to have this list of all those possible facts.

Nathan Labenz: (1:25:43) So is another way to say that that within a single attention block, the attention relationships are not immediately transitive, and so they need multiple iterations of attention in order to create that transitivity? Like, if the current token is looking back at a certain token, but then that token is looking back at a previous token, like, we need two rounds of this to move the two hops.

Ronen Eldan: (1:26:11) Yeah, exactly. You just, you first need the not to go into dog to know that it's not dog, and then the not dog needs to go together into this set that has both cat and dog in it. So that's already two leaps.

Nathan Labenz: (1:26:30) Does that also suggest then that for as many kind of logical leaps as you might need, you need, like, maybe that many layers? You can't, you're sort of bounded by... If you have two layers, you can make maybe two logical leaps. Is that a general heuristic that seems sensible?

Yuanzhi Li: (1:26:48) I think there's a depth-width tradeoff. For example, you can simulate two layer leaps just using one layer, but you have to enumerate all the two possible combinations, which makes your size go from, say, N to N squared. So, if you want to be the most size efficient, then you will definitely have to go as deep as the number of logical leaps. But if you are not that deep, then you can actually use a wider network to concatenate the two steps into one and just make the intermediate layer much bigger.

Ronen Eldan: (1:27:25) So here also, I mean, your question kind of bursts into an open door in the sense that the paper we wrote at Microsoft Research about this synthetic reasoning task we call LEGO, that's exactly a task where you have multiple leaps of reasoning and we see a very direct connection between the number of layers that you need and the number of reasoning steps required to complete the task. But maybe somewhat surprisingly, we see that the model finds very interesting and sophisticated ways, this is actually not in the paper, this is kind of a follow-up work, it finds very sophisticated ways to do multiple leaps of reasoning within a single layer. Definitely more layers help, but it's not like it's a strict upper bound for the number of leaps you can do.

Nathan Labenz: (1:28:35) I can only wish it could be quite that simple. But that's really, really interesting information. I'm learning a lot from this. The interpretability part of this paper is also really interesting. You kind of break it down into the attention portion and then obviously the MLP, you know, the neuron portion. And a couple things jumped out at me. One was in the attention portion, you seem to observe that there's kind of two sorts of attention heads. One really just focuses on the distance relationship between the tokens, and then the other is more semantic. And the distance one, I was like, holy moly. Does that look like the ALiBi scheme that has recently come to popularity with, you know, these super long context windows. So I don't know if you guys have had a chance to study that, but quite an uncanny resemblance. Right? I mean, you're showing all these attention heads where it's like, this one, you know, is just a very tight attention range, and then, you know, there's different kind of lengths. And that's almost exactly what they cook up as the substitute for positional embeddings in the ALiBi research, at least as far as I understand it. Do you see that same parallel?

Yuanzhi Li: (1:29:58) Yeah, I think they are definitely doing the same thing, which could explain why ALiBi is very helpful because the positional embedding in ALiBi is already initialized to do this multiscale distance-based attention. While for absolute positional encoding like what we use here, the model has to learn to discover this optimal kind of positional-based attention. So just hard-coding that positional-based or distance-based attention, I think is a really good choice based on our observation. And also, the shorter ones are really responsible to just learn the grammar, and the longer ones, they may just make sure that your content is consistent globally, or maybe just to grab the associated words. For example, you have an Alice in one sentence, and then you have an Alice like five sentences ago, you want to make sure that these two words, you have a chance to put them together.

Ronen Eldan: (1:30:59) So, let me just point out that, you know, clearly to complete the next word you need two things usually. You want to know what the proximate words are, what are the most recent words you saw, and you want to know what are the most important entities in the story. So these are going to be, you know, with the relevant semantic meaning to what you want to complete. Now, if I remember correctly, what happens in ALiBi is there is kind of a little bit of a mix of both of them, so you just take every attention head and you make it decay. For every attention head, you just prescribe some scale and the strength of attention decays with the distance inside the text. Is that correct? If I'm not mistaken, that's what happens there. And we actually see, I mean, one surprising aspect is we actually see a dichotomy. There are heads that only care about distance and other heads that only care about semantics, and there is hardly a mix between the two.

Yuanzhi Li: (1:32:28) But we have to say that this is only for one attention block. We haven't really checked for multiple attention layers what the transformer will do together. But just for, if you'll just train a network with one attention block, it seems that the network learns to separate the distance-based attention versus the semantic-based attention, whereas some heads are just looking at tokens based on its distance, some other heads are just looking at tokens based on the semantic similarity.

Nathan Labenz: (1:32:58) So is there anything else that you can... I mean, that's pretty profound in and of itself that that dichotomy emerges, because you didn't initialize it. I mean, in ALiBi, they've engineered it that way through, you know, some probably trial and error and heuristics and guesses and whatever, but this is totally just happening on its own. Is there anything else we can say about what you see in the semantic one? When I looked at those, you know, no light bulbs went off in my head to kind of interpret those visualizations of the semantic blocks, but anything you would highlight from studying those?

Yuanzhi Li: (1:33:33) I think the most interesting one we see is just a semantic attention to the main character names. So, for example, like, there's some head where every token just attends to Tom and Lucy, which are the two main characters. I think this is pretty important. I mean, it's just trying to identify what are the persons that are involved in the story, so the next time when it generates a new thing, it's not going to say something random, it's going to say something consistent. So, I think the semantic heads, at least what we see in the one transformer block, are more about this type of attention, where you identify what are the main objects in the sentence and just try to make sure that most of the tokens or the relevant tokens attend to those objects. Like when you have a pronoun like he or ah, you will attend to banana, so you'll know that you want to complete the next word as banana instead of something completely made up. So, I think those semantic attention heads, they are really useful to just speak consistent English inside the transformer.

Ronen Eldan: (1:34:40) Yeah, but let me add that, you know, it's very natural to complete the next word. You want to know what are the relevant characters, what are the relevant entities in the story, but no one expects a priori that you will have such clean attention heads, an attention head that exactly attends to the characters, and a different attention head that exactly, you know, attends to the objects. In the example we gave, it's like a banana and park. A priori, we might expect that it'll all be just a big mess, right? Every attention head attends to a little bit of everything, and you know, why would it be interpretable at all? But it's quite surprising that when the model is small enough, it seems that we can actually give meaning to both attention heads and neurons.

Nathan Labenz: (1:35:38) So does that kind of fall apart if we add a second layer? Like, does it then just become more messy again, or as you start to stack layers, what does that start to look like?

Yuanzhi Li: (1:35:50) Yeah, I think when your transformers are getting higher or getting deeper or getting larger, it definitely becomes more messy because the transformer can simulate... I mean, if the transformer is small, it really needs to learn those separate modules in order to minimize the loss. But if the transformer has a larger degree of freedom, it has the luxury to, for example, use five attention heads to simulate one or use three layers to do what could be done in one layer. It has no incentive to be as precise or as kind of conservative as the smaller ones, so it's actually less interpretable, and we also observe that when we try to interpret the neurons as well.

Nathan Labenz: (1:36:32) Perfect bridge then to talk a little bit about the neurons. Maybe just give us a little bit of understanding of the technique that you used to... I'll try to summarize it real quick, you tell me where I'm wrong. You run a ton of stories through, and you look for what tokens specifically are maximizing the activation of a certain neuron. And then you can kind of print out, here are the snippets and the individual tokens that maximize the activation for this particular neuron. And then holy moly, like, it really looks like there's a pretty coherent concept, you know, as you just kind of scan down that list of things that, you know, corresponded to high activation on any given neuron.

Ronen Eldan: (1:37:18) That's pretty accurate. So you have these middle layers in the MLP, which we can think about as neurons, those are really the coordinates that can either be activated or not. You have the only basically non-linearity in there. And yeah, like again, just like the attention heads, a priori, it's not clear at all that they would have any meaning. Those are just, you know, different coordinates of a certain vector space, like no one promises you that, you know, the neural network is going to use one particular coordinate for one particular kind of task. And indeed, maybe let me mention that this basically follows an idea suggested in a 2015 paper by Li et al. called Visualizing and Understanding Neural Models in NLP, which is just like their idea is just to look at the tokens which induce the highest activations for every neuron in a certain text and try to see whether, you know, those tokens have a common role. And when we look at larger models like GPT-2 XL, and we try to look at those tokens, at least the two of us could not find any common meaning. The same neuron is activated sometimes on nouns, sometimes on verbs, sometimes on... like, there's just no clear pattern whatsoever. Whereas, when we take a small model, for example, there is one particular neuron that seems to always be activated when the main character of the story is introduced. And you know, that kind of makes a lot of sense if you think about what the neural network needs to do. I guess, like, if there was a programmer writing code that tries to autocomplete stories, there would probably be a function that tries to locate the name of the main character, because it's useful in many, many places when you autocomplete. In fact, whenever you know that the name of some character should appear, it's a pretty good guess to think that this is going to be the main character. So you have a neuron exactly doing that, and we haven't checked enough to be sure, but there's probably an attention head that then attends to what this neuron outputs whenever you know that the name of some character should appear. When you connect those two together, what you will get is this mechanism that is able to copy the main character's name to different places along the generation. So, you know, this is a very basic mechanism that you can actually observe inside the neural network, and this doesn't happen in bigger models, at least not in a way that is, you know, so easy to trace.

Nathan Labenz: (1:41:03) I guess the simple version of it was, I thought maybe it was just like, maybe there are sort of a bunch of concepts that are easy to identify where you can just see, like, okay, at a glance, I know what that is. And maybe there's just only so many. You know? Like, maybe when you have 30 million neurons, you know, or 30 million parameters, you have however many neurons, you know, maybe that's kind of enough and you can kind of capture those and then you go 50x. And it's like, well, if you start fishing at random, you know, points in the network there, maybe you just miss a lot of those. You know, they may exist, but they're just kind of hard to spot because maybe they're sort of sparse, if you will. And then the things that are in between, I mean, I'm really getting out on a limb here, but I was kind of thinking maybe those are sort of analogous a little bit to the subconscious processing that goes on in our brains where I kind of know on some level that processing is happening even for many concepts, you know, that I don't have a clear label for. It's just, you know, there's some sort of churn happening in the brain, but then, like, only a certain, you know, small set of that kind of rises up to this level of, like, you know, what I've called a conscious concept that I can sort of say, like, I have a label for that, and it's a tidy enough thing. So I guess the two ideas there are, maybe they're just a lot more easy to find in the small network because, you know, you have to have them and, you know, they get packed densely versus a big network. You know, maybe they just are packed more loosely. And then those other networks or those other, you know, neurons maybe are just kind of analogous to some stuff that we don't understand very well in our own cognition. Yuanzhi Li: (1:42:42) Yeah, it's definitely possible. I think that's an advantage of small language models. They may be more interpretable compared to larger ones, because the smaller models can only do basic stuff, and only the basic stuff is probably interpretable. The very complicated stuff, for example, how GPT-4 writes code that's 1,000 lines, those things are almost impossible to interpret. But how could a small language model keep the main character consistent? Those basic questions we can probably understand that there are probably some neurons associated with that. And in GPT-4, out of the maybe 10,000 neurons or even more neurons, there may also be some neurons that are dedicated to keeping the main character consistent, but it's just so hard to find because it may be in the 25th layer, neuron 9,700, whatever. It's just so hard to locate. For smaller models, because they're so small, every neuron must be doing some basic stuff, because the complicated stuff, as we said in the consistency hierarchy or in the loss hierarchy, only consists of a very tiny fraction of the loss. So the main fraction of the loss that contributes to basic consistency, grammar, and things like that are probably the things that are learned by the neurons in the smaller models, and they are more basic and more interpretable.

Ronen Eldan: (1:44:11) There are many ways for a neural network to solve a problem. Given a problem and an architecture of the neural network, there are many different configurations of the weights that would solve the same problem. Some configurations might be more interpretable to a human, and some are just one big mess, where every neuron is doing a little something of every possible task, and they're combined in very complicated ways. The network has no incentive in the loss function not to be one big mess. Most solutions to the same problem are one big mess. This is where the entropy is located, right? And when the model is small, it has no choice. The neurons have no choice but to align with meaningful tasks because the neurons are where you have the nonlinearities and you just don't have enough of them for the one big mess type of solution. Somehow the most efficient solution is the one that is not completely messy. If you have a large network, it'll just find a way to do it that does not align with the coordinate structure of neurons, whereas when the model is small, you just have no choice. So interpretability appears as a side effect.

Nathan Labenz: (1:45:56) So anything else that we didn't cover?

Ronen Eldan: (1:45:59) Yeah. So one thing, maybe going to the initial motivation in creating the dataset, which is to have basically a small dataset which is a testing ground for ideas in LLMs. An open question here is: Do we even have a reason to expect that behaviors we witness in this compact setting will translate to LLMs, right? We don't know the answer to that. If we find an architecture that works much better for the Tiny Stories dataset, do we actually have a reason to expect that this architecture will also be better for LLMs? So I'm just saying this as a question. I think it's one of the most relevant questions that stem from this paper. It connects to a more general question, which is there are all those papers like the Google Chinchilla paper and the OpenAI Scaling Laws paper, which try to suggest that there might be universal phenomena in LLMs. There's some, for example, a tradeoff between width and depth that is perhaps, they don't suggest it explicitly, but a natural question that arises: Is this universal in the sense that it does not depend on the exact mix you take in the dataset and the exact architecture and the exact range of sizes you take? So the question here is: Are there universal phenomena which will be common to the Tiny Stories dataset and to LLMs being trained on these large corpora? Maybe let me just say, we have just a few indications of some sorts of universality, but at this point it's completely open and we really hope for the sake of saving energy and also just opening the door for PhD students to actually do LLM research, we hope there is some universality going on, so that you could gain insights, not necessarily on Tiny Stories, but on any small dataset, which would actually be of relevance to LLMs.

Yuanzhi Li: (1:48:49) Yeah, our future work is mainly just planning to extend the capability of Tiny Stories. If we can create a dataset that captures elementary school knowledge, I think this is already a really good dataset. If we train a language model, for example, maybe 300 million parameters, and it's just good at everything for elementary school or maybe even third grade of elementary school, I think that's already a very good model. I think people will love to interact with it. It knows how to talk, it knows the basic knowledge. Maybe the dataset will be diverse enough to capture everything in real language, capture every aspect of real language, but just at a downscaled level. Once we have that dataset, I think it really opens the door for everyone to do natural language research, not the ones that have 100 A100s in their hands, but the ones with just a laptop GPU. They can train the model in one or two days, and they can gain some interesting observations.

Ronen Eldan: (1:50:01) I think what we witness in LLMs is kind of a mathematical miracle going on. What do I mean by that? You take this algorithm, which is pretty simple. It's gradient descent. I don't want to belittle all the really smart technical contributions that are inside that algorithm, but all in all, it's basically gradient descent with an architecture that's very clever, but still quite simple. And the miracle is you take all this huge training corpus, you fit it to the algorithm, and you don't just get a network that has memorized some text, you get a network that can actually genuinely create, synthesize new content, show signs of reasoning, understanding, and so on. And we think Tiny Stories is just a compact example where you observe the same type of miracle. Of course, it's not nearly as exciting as what happens in LLMs, but already there, at this size, you see that there is some very interesting generalization and emergence going on. And even if it doesn't give us a lot of insights about large language models, this is still a nice playground to try to develop maybe the mathematical foundations necessary to understand why neural networks are able to generalize so well.

Nathan Labenz: (1:51:58) So maybe then just one final question. I'll encourage people to get in there and try it out. What other interpretability type work have you guys seen that has inspired you that you would recommend folks in the audience go take a look at as well?

Yuanzhi Li: (1:52:14) Yeah, I think one of the works that inspired our research is a work from our group previously, which is called Lego. It's a synthetic reasoning task which tries to understand what the attention mechanism of the model is. We identify several types of different attention in the transformer. Some of them are just looking at the tokens that appear exactly before, like Alice appeared before and Alice is associated with Alice. And then there are some other more advanced mechanisms such as doing deduction or some other stuff. So I think this work is pretty inspiring and it tells us that the transformer is at least doing something that is reasonable instead of pure math. So that's why we also have the interpretability section where we want to look at what the attention does, and we do see some very good behavior that corresponds to some aspects of natural language.

Ronen Eldan: (1:53:22) Maybe there is one work I want to mention also. There is a paper called Transformer Feedforward Layers or Key Value Memories. That's another paper I like. It's a paper that tries to interpret what neurons are doing in, I think, basically BERT size model transformers. They're basically able to show that at least some of the neurons have meaningful roles. In general, the theory behind the interpretability of neural networks is at its very, very beginning right now. So there are plenty of very clever works, but it seems just very difficult. So in spite of really nice works in the literature, I think we are still light years away from being able to actually understand what's going on inside the model. A priori, there's no reason to assume that we'll ever be able to really understand, right? I mean, we have a very limited understanding of how the human brain works. It's not like we can point to a neuron and say this neuron has this role in that thought process. There's just no reason that we'll be able to ever do it in neural networks. There's also no reason to assume that the solution that the neural network finds, that solution that gradient descent finds to the problem, is not a very messy and not interpretable solution. So we'll probably be able to come up with some basic or small examples which are partially interpretable and we might have some insights about big networks. I personally am not very optimistic about being able to interpret what's happening inside those models to a satisfactory extent that might lead us to being able to control them and manipulate them, make sure they have better alignment, and so on and so forth.

Yuanzhi Li: (1:56:11) Yeah, large scale model interpretability may need to take a different approach. I think it's impossible to look inside the neural networks and just pin down that the attention is doing something or the neuron is doing something. But maybe we have to take an approach more like our Sparks of AGI paper, where we just talk to the model and we try to interpret it more like interpreting other humans' intentions when we talk to them. Through a sequence of conversations, maybe we can understand what the model likes to do and what the model doesn't like to do, or what the model is good at, what are the typical cases of the model's failure. It's more towards psychology study, but really for large models, maybe we need to take that approach for interpretability.

Ronen Eldan: (1:57:01) You know, humanity has taken advantage of horseback riding for quite a long time. We have no idea what every neuron inside the horse's brain is doing. We can't really interpret how when we give some command to the horse, like a physical cue, the horse obeys and it's very useful and we can actually rely on it. Horseback riding is very reliable. There are very few cases where the horse has acted unexpectedly in a way that caused accidents. Humanity has profited from that vastly, maybe leaving animal rights aside here, and it works perfectly even without interpretability. We just figured out ways to align the behavior of the horse with our needs by taming the horse. We can tame it without understanding the exact process that's going on there, and this is a big success. I think it's just a good analogy, right? It's kind of horseback riding for the brain, those LLMs. They give us suddenly the ability to go much faster to much longer distances, even if we don't exactly understand what the horse is doing. Definitely, the Mongols didn't understand much about the biology of the horse, but they could still use the horse in a very reliable way. So even though I'm pessimistic about actually understanding the inner workings of the neural network, I'm very optimistic about the usefulness and the fact that we will be able to align it efficiently.

Nathan Labenz: (1:59:25) Ronen Eldan and Yuanzhi Li, thank you for being part of the Cognitive Revolution.

Yuanzhi Li: (1:59:31) Thank you very much. Yeah, thank you for the invitation. It's really my great pleasure.

AGI-Pilled Cyber Defense: Automating Digital Forensics w/ Asymmetric Security Founder Alexis Carlier

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

The Tiny Model Revolution with Ronen Eldan and Yuanzhi Li of Microsoft Research

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next