The Future of the Transformer Pt 2 with Trey Kollmer

Watch Episode Here

Video Description

Trey Kollmer returns to discuss the latest AI research revelations with Nathan Labenz. They explore how new techniques will shave 10% off global compute needs, how analogical prompting beats few-shot prompting, and how compressive historical records can increase LLM memory and retention abilities. If you need an ERP platform, check out our sponsor NetSuite: http://netsuite.com/cognitive.

SPONSORS: NetSuite | Omneky

NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

LINKS:
🎬 The show outline: https://docs.google.com/document/d/1oiSu9X4EVNMf90aRnk4mrfogSmq3QUsRg4GtI95mMCw/edit
Think Before You Speak: https://browse.arxiv.org/pdf/2310.02226.pdf
SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking: https://arxiv.org/pdf/2306.05426.pdf
StreamingLLMs: https://arxiv.org/abs/2309.17453
Large Language Models as Analogical Reasoners: https://arxiv.org/abs/2310.01714
Ring Attention: https://arxiv.org/abs/2310.01889

TIMESTAMPS:
(00:00:00) - Episode Preview
(00:01:11) - Paper: Think Before You Speak
(00:03:13) - Multimodal models for combining vision and language
(00:04:19) - Backspace Paper
(00:06:25) - Chain of thought prompting for step-by-step reasoning
(00:09:14) - Backspacing in language models to correct mistakes
(00:12:05) - Attention sinks for expanding context length
(0012:41) - Paper: Large Language Models as Analogical Reasoners
(00:15:24) - Pause tokens for language models to "think"
(00:18:23) - Analogical prompting to recall relevant examples
(00:20:52) - Long context windows for language models
(00:23:20) - Markdown works best for OpenAI
(00:24:23) - Ring attention to break memory constraints
(00:26:15) - Paper: StreamingLLMs
(00:27:46) - Potential for superhuman performance with longer contexts
(00:31:01) - Dynamic context window adjustment at runtime
(00:33:53) - Retention and memory capabilities for transformers
(00:37:12) - Planning algorithms combined with memory and scale
(00:39:49) - Paper: Ring Attention
(00:42:35) - Executive assistant prompting and critique
(00:45:23) - Self-RAG for language models to find own examples
(00:48:02) - Timelines and predictions for future capabilities
(00:50:37) - Applications like analyzing long texts and scripts
(00:53:15) - Local versus global attention in transformers
(00:55:59) - Architectural changes versus just training adjustments
(00:58:41) - Pre-training strategies like random start points
(01:01:16) - Representing transformers for intuition versus efficiency

The Cognitive Revolution is brought to you by the Turpentine Media network.
Producer: Vivian Meng
Executive Producers: Amelia Salyers, and Erik Torenberg
Editor: Graham Bessellieu
For inquiries about guests or sponsoring the podcast, please email vivian@turpentine.co

Full Transcript

Transcript

Trey Kollmer: 0:00 No less than Imad Mustak from Stability said brilliant researchers like this literally knock 10% off of global training compute needs with these improvements, which are impossible to predict. 10,000,000 tokens starts to give you the opportunity to put whole bodies of literature into a single token. Right? I mean, the Great Gatsby famously fits into Claude's 100k. Now you're talking perhaps about 100 books with full attention considered for the next token generation. If this allows the models to make those connections at such huge length, this could be where you could start to see tipping into superhuman performance of learning things that experts don't know.

Nathan Labenz: 0:00 Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz joined by my cohost, Erik Torenberg.

So then the next one is Think Before You Speak is the name of this paper and training language models with pause tokens. So Think Before You Speak, training language models with pause tokens. This comes out of Google and, again, Carnegie Mellon collab. And this was a student from Carnegie Mellon who's interning at Google. It's amazing how many of these papers are couple month processes. And you can do this kind of stuff in the context of a summer internship these days. Amazing.

So what do they do here? They start off with an observation, which is a pretty simple one on some level. I'll quote this from the paper. Language models generate responses by producing a series of tokens in immediate succession. The k plus 1 token is an outcome of manipulating k hidden vectors per layer, 1 vector per proceeding token. What if instead we were to let the model manipulate, say, k plus 10 hidden vectors before it outputs the k plus 1 token?

So, basically, just in the early going, but kind of any time, you only have a certain amount of computational space to move information around. And maybe that's just not enough or maybe more could be beneficial. First thing that comes to mind for me when I hear something like that is I think that's a lot of what's happening in the chain of thought type prompting. Certainly, that's been hypothesized that by giving the model time to think, you give it time to kind of work its way through things and hopefully summon the right reasoning, and then you get better results. So certainly empirically, we see that you get better results.

Well, this is now saying, okay, what if we just gave it extra space, but we didn't make it do anything with that extra space? I'm not asking for reasoning. I'm just giving it a pause token that it can just literally put a pause in when it feels like it needs to, and how exactly that gets decided is a bit of a black box. And that's, you know, there's some trade-offs here, I think, for sure. But give it that opportunity to just take a pause when it needs to. Now it can just process information a little bit more. It could potentially do a couple pause tokens in a row if it needs to. They suggest, what about 10 k plus 10? So 10 extra vectors that it can kind of manipulate and move information back and forth between. Does that give us the opportunity for better performance? And, obviously, this makes the research round up because, indeed, they show that they are able to improve performance on this.

So first thing on this, I was like, boy, I saw that one coming. That was the first thing. We covered a little bit of the Backspace paper on a couple earlier episodes. That was a Stanford one where they had shown that and this kind of combines some concepts from the last one too. In the backspace one, when they got out of distribution as measured by not having high confidence on any next token, then they started to train the model to use the backspace to go back and be like, well, we must have gotten off the rails here a little bit because now we're not confident of what to do next. So let's instead go back, try that one again, maybe make a different prediction this time, and then maybe that will lead us towards something where we can feel more confident.

When I saw that, was like, well, it seems like you could probably have a, if you can go back one, you could probably just add a padding one too and just kind of quietly think to yourself. Sure enough, 60 days or so later, here's the publication. Was this inspired by that? I'm not sure. It was kind of on the border where there was just enough time for them to have done it in response to that. But I would guess, honestly, they probably had the idea before. So this is probably independent kind of parallel lines of thinking.

So I was like, okay, cool. That's good confirmation that I'm starting to build some intuition about this kind of stuff. But then as I was thinking about it more, I'm like, boy, do I like this? Do I not like this? I like chain of thought in that I can at least read what it's outputting. And when I can read what it's outputting, then I can be like, well, if the reasoning is wrong, then no wonder the answer is wrong. You know? So I can kind of maybe go back and coach it on the reasoning a little bit more.

Here, if you're just using these pause tokens, which is what they're kind of doing in the experiments, they show that it works, but you've kind of lost a step in terms of interpretability because now you don't have this reasoning that you can examine. Instead, you just have this pause and like, yeah, it improves performance, but what's happening there? We don't really know. We just, you know, again, we're back to, well, we can try to look at the activations and figure it out that way. But it certainly is nice to as a practitioner, definitely, to see this kind of reasoning that you can audit and get comfortable with seems to be approaching the problem in the right way.

Now, like many things, of course, not mutually exclusive. They do show that this it works. It works better if you incorporate it into pretraining as well. So there's kind of a, you know, going back and synthesizing some data, adding pauses into kind of show, at some scale when pauses are appropriate. That helps it even more. And then I don't see any reason you can't use both of these techniques together. You know, you could have the pause and then the chain of thought. Right?

So one thing that I do sometimes worry about a little bit with chain of thought, there's actually multiple worries with chain of thought. One is that there has been some research that shows that it's not always super faithful, which is to say the answer that you ultimately get is not always as determined by the reasoning as it may seem or as you may wish. So that can be a little bit complicated. You can't just totally naively trust the reasoning output.

So maybe here you could do something like first pause and then reason. Because I've often kind of thought, well, jeez, if my if I force it to give an answer immediately and that's subpar and then I can get better performance by reasoning, well, don't I still have a little bit of a problem where I'm like, it's immediately reasoning? You know? What if it's not reasoning right? You know? And I think, again, I don't want to analogize too much between human and AI, but in this case, I do feel similarly. Right? If I'm, bam, forced to answer some question, you know, okay. Wait. It'd certainly be advantaged if I could think it out. But the same kind of thing if you're like, you must begin reasoning immediately. You know, I'd be like, can I just think quietly for a second, then explain my reasoning, and then get you an answer?

So now we've kind of got the ability for the AI to do something similar. I don't see any reason you couldn't train for first the thinking pause, then the chain of thought. Hopefully, now your reasoning becomes better. Hopefully, it becomes more relevant. Hopefully, it becomes more faithful. And then, you know, maybe you get the best of both worlds with kind of combined tactics. You get the, you know, the best possible accuracy.

Nathan Labenz: 8:47 How do they train for the pause tokens to output pause tokens? Do they pick situations like with a backspace paper where it's not that confident, and then they say that's a situation you should output a pause token? Or do they they have some other method? Because that seems like a hard thing to do to put in the data. It's not automatically in the data.

Trey Kollmer: 9:09 Yeah. It's definitely not automatically in the data. So they do it with pretraining and also with fine tuning. It works best with the with both pretraining and fine tuning, although you do get some lift. Well, it varies, I guess, across datasets. When you do both, it's clearly the best. In some cases, it seems like depending on the dataset, it seems like in some cases, just fine tuning makes it better, and in some cases, it makes it worse. Because it makes it better more than it makes it worse, but there are a couple of datasets where just doing it at the fine tuning stage does make it worse. I'm not immediately seeing how they're preparing the data.

Nathan Labenz: 9:48 Is there they I guess they could just train to always start with 10 pause tokens. Nathan Labenz: 9:48 Is there they I guess they could just train to always start with 10 pause tokens.

Trey Kollmer: 9:53 Yeah. It looks like it might actually be as simple as that because in the inference, it says during inference on the downstream task, we append m pause tokens to the prefix. And as always, we ignore the output of the model until the last pause token is seen. We term this pause inference. So at least when they're testing it, it looks like they're presenting these benchmark questions, and they've got grade school math and common sense and web QA. There's 8 here just in the 1 graph of results. It seems like for those, these are just straight up questions you're supposed to answer. And so they just say, okay, you're gonna definitely pause and use these extra tokens.

And then it looks like it can add more pauses, and it's still not quite clear how it's learning when to do that or not. You could easily imagine kind of setting up, certainly at the fine tuning scale, a dataset that just shows when you would want to pause. There's more work to do, I would say, to generalize this beyond the current set of benchmarks that they're running it on. Because here, they're basically just saying, we're in benchmark land. We're in question answering land. We're just gonna give you some pause tokens at the start of your answer and see how you do and okay, cool, it helps. But when to pause doesn't seem like it is something that the model has fully learned here.

And so yeah, but you can imagine also scaling this up. You honestly could imagine probably GPT-4 helping you scale it up, annotate these examples with where thinking breaks would be inserted. It would do that perfectly well. And then, you know, next thing you know, it can kind of learn when to do the pause breaks. I would bet pretty well. Again, we've seen that with the backspace, so I would expect that something similar would be possible here too.

Yeah, it does it is still there's 1 section on appending versus prepending the pause tokens. So it does seem like this is still kind of in the manual manipulation realm as opposed to a fully learned tool or technique that the model can use at its own discretion. Also, again, this was a presumably a PhD candidate working as a student researcher at Google, who's the first author on this paper. So, you know, plenty of additional muscle there to take the ball forward a little further.

Okay. So that's think before you speak. Next, analogical prompting. This 1 kinda surprised me, honestly, but the more I thought about it, the more it started to make sense. So analogical prompting is presented as yet another improvement in how to just get the most out of the current language models that we have. They compare this explicitly to few shot, chain of thought and find that it can beat few shot chain of thought. So this might be the new best prompting technique, and it's also easier to do than some of these other prompting techniques.

What they do this time is they ask the model first, they present a problem, then they ask the model to recall relevant examples and then solve the initial problem. And that's in contrast to few shot chain of thought. Few shot would be, here's a couple of examples. Chain of thought would be, think step by step, etcetera. Few shot chain of thought would combine both of those where you have examples, and the examples show the reasoning that you want. Here, you're able to just say, here's the question. It's on the model itself to recall the relevant examples and to then cycle back to solving the original problem.

So the fact that this gives better performance initially, I was like, wow, that seems weird because it seems like you're relying on the language model kind of a lot. Right? When I give it a few shots and I show the kind of reasoning I want, then I'm doing my part. I feel like I'm guiding it to where I want it to go and showing it how I want it to behave. Here, it's responsible for generating the examples. This is not a tool use. There's no database here. It's just generating the examples with the same kind of generation as always. And yet, at least, we'll see as this gets into the wild, but this is somehow better than chain of thought.

How would I understand that? So what I came up with was maybe the right way to think about this is heuristic recall. Maybe what it's doing is it's sort of and we have these got all these results, right, that have this kind of high level conceptual middle layer sort of understanding. Maybe what it's able to do when you say find relevant examples is it's able to kind of load this problem into some high level representation. It's maybe able to do a better job than you are, at least for this particular problem of based on these high level, quite decoupled conceptual representations. Based on that, it seems to be able to then zero in on the a really relevant canonical example, and it's seen examples of so many things, obviously, right, that it has a lot there to draw on.

So it seems like it's maybe able it's better able to pick the most useful example than your few shot example, especially as you then take that to the diversity of whatever you're trying to do. So if it can locate a better example and it's kind of memorized or learned the heuristic that solves that canonical example that it's able to load into place, then it essentially could get better shot loaded from its own memory for this particular challenge than the hard coded few shot that you try to cook up for it, try as you might. Right?

So I almost think of this as kind of self RAG, RAG being retrieval augmented generation. And I think I'm gonna do another episode on kind of the state of RAG maybe in the next week or 2 as well. But the classic RAG setup would be you have a database. The query, the question, whatever gets sent into the database to find relevant stuff, that gets loaded into context, and then you proceed from there with the benefit of whatever was retrieved out of the database. This is like treating the model itself as the database and saying, you know, you could embed that and go hit some vector database that has a ton of examples. But the model itself kind of represents all those relevant concepts and can generate relevant examples. And then once it has the most familiar kind of happy example, then it can apply the same heuristic to the question in hand. And it works better. So it's like, wow, okay.

This is across a number of different models. So, yeah, basically, it doesn't it seems like up to the frontier models, this still helps. It's not like a dramatic change relative to other prompting techniques. It is a pretty dramatic change relative to just zero shot, nothing. And then I wonder, this is probably most competitive with a good RAG implementation. Because you think I've been building some of these types of systems. I built a first version of a prompt coach for executive assistants recently, and I just did a few shot chain of thought implementation there where this is kind of a meta prompt where the idea is the executive assistants don't necessarily have a ton of experience prompting. They may do wrong things. So can we identify things that they're doing that are suboptimal and provide feedback on how to better prompt the AI for better results?

It's a little bit like the improver, except we're trying to improve the humans' prompts to the AI. So my few shot chain of thought is just like, okay, here's a number of prompts, whatever. I kind of just grabbed some random ones. I wrote critiques of these prompts that had these very shortcomings and showed what kind of feedback I wanted. And the system kinda works. And then I've written out what are the next things I would do if I wanted to improve this further? And I was like, I think the best performance that we could probably get to would be with a RAG setup where I would instead of using the same examples for every prompt, I would go get specific examples that are most relevant to this. And then we could show, and how big does that database have to be?

I mean, that's where it starts to get a little bit scary, because I don't wanna have to write 1000 of those by hand, and that might not even be enough. I don't know. My few shot examples just 4 or 5. That was manageable enough. But 1000, that's a different scale of project for sure. Of course, I can have GPT-4 help me, but it's, we'll actually use Claude Instant for that, interestingly enough, because it's quite a few tokens. We need the long context window, and the instant is really good for the responsiveness and seem to be performing almost as well as Claude 2. Was at 10% of the cost and a lot faster, I decided to go with the cheap 1.

But, anyway, this now starts to suggest maybe a different approach where maybe I don't have to do this whole RAG setup and have 1000 examples and go find the most relevant examples. Maybe I can just have the language model recall the most relevant examples and then have it go from there. If there's 1 thing that that's probably not gonna work on quite yet, it might be meta prompting because it certainly has seen the types of questions that are being asked in this study, which are grade school math again and that kind of thing.

When you flip over to improving prompts, depending on the training data cutoff or whatever, has it seen huge numbers of prompt coaching in its training data? Maybe it has, maybe it hasn't. If it has, it's probably because they've consciously baked that in as opposed to it being out there on the Internet as of the training data cutoff. That does seem like something that Anthropic might be a little head on because they do have this kind of constitutional approach where there is this more kind of iterative internal sculpting. Obviously, OpenAI has a lot of that too.

But, anyway, I'm that's my next test on my prompt coach is to see, hey, maybe I can skip this whole complexity of the RAG setup by just having the AI itself recall the most relevant example. Heuristic recall is kind of my my own name for this. Because once it has that example and it knows how to solve that problem, then we kind of move pretty smoothly into applying that same heuristic to my particular challenge.

It was 1 of the examples that they show in the paper is finding the area of a square or something like that. And you think about this. It's fascinating. It's weird. But, again, it has a certain logic. You can kinda see how this would make sense for yourself. Right? If you're like, okay, I'm a I'm a middle school math student. I've been presented with a find the area of this square. And what am I gonna do? Am I going to look at 4 other math problems with reasoning and be like, now I understand what to do? Not exactly. Not if none of them are finding the area of a square. Right? What I'd really do is be like, okay, how did I find the area of a square? Let me recall that. Let me recall the simple example. Now I'm gonna apply that to the particular numbers and details of this particular problem.

And so it does seem like a little bit more human like way, human like behavior because that is that is more like what I would imagine myself doing internally if I was trying to solve a similar problem. Right? I'm gonna I'm gonna look for the 1 that is most analogous, recall how I did it, and then proceed in the same way. That's available for free everywhere you want a prompt right now, by the way.

Nathan Labenz: 22:40 That's crazy. It is I mean, it's interesting that performs better than giving it a few shot examples. It just seems like another data point of just because you don't find something in the model with an IEP prompt doesn't mean it's not in there somewhere. And that there's some work to coaxing out the full capabilities of the models.

Nathan Labenz: 22:40 That's crazy. It is. I mean, it's interesting that it performs better than giving it a few shot examples. It just seems like another data point of just because you don't find something in the model with an IEP prompt doesn't mean it's not in there somewhere. And that there's some work to coaxing out the full capabilities of the models.

Trey Kollmer: 23:02 Yeah. Another little tiny detail of this is they seem to be using markdown for the instructions. And I just saw some chatter online the other day from a couple people that are definitely super knowledgeable, who said that Markdown works best for OpenAI because that's kind of how they tend to train stuff in their own processes. And XML tags seem to work best for Claude, and that that is something that Anthropic kind of officially recommends is use XML tags. So just another little footnote. Markdown for OpenAI, XML for Claude. Your mileage may vary on that, but this is showing the Markdown format. And even for GPT-4, you're squeezing out a couple more percentage points on these math benchmarks. So pretty incredible. It works almost across the board. I think they show one thing, basically, where it's not the very best. And in that case, it's 0.4 percentage points lower than the chain of thought. But in every other thing that they show, it's a couple percentage points as much as 5 percentage points higher. Yeah. Go improve your apps.

Hey. We'll continue our interview in a moment after a word from our sponsors.

All of these things, right, I said kind of earlier, we're climbing a level of impact, conceptual importance perhaps, whatever. Now we're getting to a couple of things toward the end that I think are definitely pretty interesting, pretty notable. So there's 2 papers that specifically speak to the possibility of much longer context windows, and they do it in quite different ways. So the first 1 is called streaming LLM, and this is a paper out of Meta. The supervisor, Mike Lewis, you'll see his name on quite a few of the big meta papers.

Basically, what they do here is they manage to extend the not the context window exactly, but the length of text that the language models can handle dramatically with just a relatively superficial change in a way that applies to all the existing open source models. So they're able to show that this technique can be quickly retrofit to and, you know, all the other I'm not all, but a lot. They've shown multiple different major open source models that they apply this technique to, and hey, it works across these already existing trained models out there in the wild today.

So what are they doing? It starts with this observation that and it's another 1 of these things where you're like, wow. That is a pretty simple observation. Kind of surprised nobody noticed that before, but you made a lot of hay with it. So the observation is that there's this they call it attention sinks, but let's just start with the purely observational.

What they find is that as language models are making their predictions, the late tokens often attend to tokens that are very close to them often. Just 1 of the words immediately prior, which makes total sense. Right? Because if you're thinking, okay. I need to predict the next word. Well, the most important words are gonna be the ones that just came immediately before. Just for even simple things like part of speech and just general continuity. So you're gonna see this intense attending to the immediately preceding tokens.

Then you see kind of random, but not that much attending to things that are farther back because, you know, most of the time and, again, just kind of sanity checking yourself. If you were gonna try to predict the next word in a paragraph, you'd be looking at those last few. You'd maybe read the whole thing, but in a lot of the details in the middle paragraphs, they're probably not super relevant for what that next word is gonna be. And so there's not a lot of attention into these middle things, although key things do matter.

And then there's a lot of attention. For many tokens, there's a lot of attention to just the very first tokens in the sample. And so why is that? It doesn't seem like it's super influential, but what's going on?

So the hypothesis that they come up with is that because of the mechanism where the sum of the attention for a given token across all earlier tokens has to sum to 1, that's a constraint of just the way everything is calculated in the computation, the attention has to go somewhere. So if there's nothing for a given token that is super relevant, still the attention has to go somewhere because it's required to sum to 1.

They hypothesize that these intensive attentions to the very first tokens are what they call attention syncs. In other words, you've got to have this thing sum to 1, but there's not really much here that seems super relevant. So just kind of put it at the beginning, and then we can kind of somehow some other part of the network can not pay too much attention to that. Not to use attention in different ways, but I am. We're talking here about the proper attention mechanism, and I'm just saying downstream, it seems like that high level of attention paid to the first few tokens is somehow accommodated for in such a way where it's kind of expecting that, and it's kind of fine downstream performance wise.

So what they then do is say, well, what if we and there's been a lot of different techniques over time that have tried to figure out, well, how can I have an ongoing how can I have a long running conversation with a language model? Especially when it was, you know, 2,000 tokens limit, you'd hit that limit pretty quick. Now with GPT-4 32k, with Claude a 100k, you don't reach it super quick, but it's still nowhere near enough to have a super long running dialogue. And so people tried various things.

One thing that folks have tried is kind of a sliding attention window where you basically just only look back end tokens, and everything before that, you just kind of forget about. And that doesn't work for some reason, and I don't think it was necessarily clear before this why that wasn't working. I'm sure it's entirely clear even still with this. But as a empirical matter, what they find is that if they keep the first however many tokens, which they now call the attention sync tokens, and they experiment with some different things where they even just put some padding there, put some empty stuff, so that there is this kind of designated place. And, again, this is another thing where if you do it with pretraining, then it will work even better. You know, I don't think we've seen the end of this.

But keep those early tokens, those attention sync tokens there at the beginning, and then do the sliding window. Now it works. And now you can basically keep essentially the same level of model performance way beyond the actual context window that you have. Somehow the ability to put this extra attention that doesn't really have any other natural place to go back to these early starting attention sync tokens that don't change allows the models to kind of stay coherent.

Where, otherwise, what had been observed is as soon as you kind of get past and start to start to drop some of those early tokens, the thing kind of blows up and doesn't work anymore. Now it kind of stays coherent. You're still dropping as you get to a long enough thing, you're still dropping the middle stuff. So if you have a 32k model and you've got, you know, 40k worth of text, you've got the initial attention tokens, then you're skipping 8,000. And you're just looking at the last 32,000, but the ability to not have to force everything into that window and continue to put some stuff toward the beginning into the attention sync allows it to basically stay consistent perplexity score for super, super long transcripts.

And we're not talking about a small difference in transcripts here. We're talking going into millions of tokens. This goes a long way. So it is a little confusing. I mean, I think I sort of wanna understand this a little bit better. Notably, it seems like the perplexity is staying basically flat. It does not seem like it's getting better. And I think that's consistent with the general understanding that, okay. For any given time, we are only looking back at our 32k or our 8k or whatever our attention window is. So anything that is being dropped before that, we're not able to take advantage of. So we're still kind of operating at a, you know, whatever, a 32k or an 8k capacity, but we're able to slide that window out into the future.

This could create some odd experiences for users if you're like, well, when I had some interaction about this particular topic. Is it still in window, or is it now out of window? That could be kind of weird. You could have some situations where it's remembering, remembering, remembering, forgetting the perplexity score as measured on these super long text things, doesn't show a pain point there. It's just kind of humming along at consistent whatever perplexity it can achieve with its 8k or its 32k, look back. That's what it can do. And it can just now continue to do it on a rolling basis.

But from a user standpoint, that might be somewhat weird if all of a sudden, something that you did just know about, now you no longer know about. So I don't think this is exactly the form in which this is going to hit production. But I do think it could be a pretty powerful aspect of something that I think could be kind of start to look more like a next generation system.

Nathan Labenz: 33:52 So Why do you think the beginning tokens are such more effective attention syncs than just whatever tokens are at the beginning of the sliding window? Nathan Labenz: 33:52 So why do you think the beginning tokens are such more effective attention sinks than just whatever tokens are at the beginning of the sliding window?

Trey Kollmer: 34:03 I don't know. It's odd. Maybe they just don't look like a beginning. That's kind of the best thing I can come up with. I mean, I do think you could say their hypothesis was pretty minimalist. They were phrasing questions slightly differently. Why would the attention sinks be at the beginning? And the answer was, well, the beginning is the stuff that all downstream tokens can see, so that's a natural place for that to go. If you start with the notion that you need an attention sink, then it seems reasonably intuitive that it would be at the beginning. Why the attention sink behavior doesn't roll with you as you roll the context window through a long text, I don't have a great intuition for that, and I was trying to come up with something. And I think that's the best they came up with, just that maybe it doesn't look like a beginning and maybe things that look like beginnings are kind of what it's looking for when it's doing that. And if you're right in the middle, maybe it just gets straight up confused.

This may be a reflection of kind of how things are pre-trained. It does seem consistent across these models. They've published the code already. The code is out there on GitHub. So this is a framework that you can apply to existing language models. And they show that they are indeed applying it to Llama 2, LongChat, a few different ones in here. Seems like it's a lot of Llama 2. This is out of Facebook. But there were others that were not, yeah, Falcon, MPT. So quite a diverse set of different existing open source models that they've done this for. And it seems to work comparably well basically across all of them. The perplexity looks pretty similar. It looks pretty flat.

I was thinking, well, maybe it has something to do with how they're trained, the way that the text is run through them. If it's always kind of chunked in ways where it's starting with something that kind of looks like a beginning, then you could kind of understand this behavior where it needs the thing to look like a beginning in order to use it that way. I would be curious if there's a model where they specifically start in the middle in pretraining, and just kind of take random starting points that could be mid-paragraph, mid-sentence, whatever, and just try to chop it up very noisily like that. Maybe that could be a way in which the rolling window could work better, and then maybe you wouldn't even need the attention sink dedication tokens. Maybe that attention sink behavior could roll. But I don't know of any models like that, and I don't even know that we would know, even for these models that they study, because often that level of detail is just not disclosed. That's my best guess as to what is going on there.

And again, this works on existing models. Right? So now we're not far at all from people being like, oh hey, I'll take my Llama 2 whatever and just apply this framework. And now you can have running chat long-lived as long as you want. You don't hit that hard end. The kind of user experience trade-off would be: today you hit a hard end and it's the end, you gotta start over. Next generation, with just applying this, now you can run forever, and it'll at least stay coherent, but you're gonna have amnesia for stuff that has now rolled out of the window. But it's not catastrophic amnesia where you get totally insane right away, but you may have these kind of weirdnesses where you're like, when is that thing in or out? And how is that affecting me? And it definitely could create some strangeness.

But again, I think there's a synthesis of almost all of these things coming, keeping in mind that you can train a GPT-4 model in a week if you're in Inflection, once you get all your stuff online. Combining a lot of these techniques into one model is, I think, definitely not too far into the future and certainly doesn't feel like science fiction at this point.

So one more paper, then I'll kind of sketch that out. So the last paper, Ring Attention. This one is the deepest tech. It's really at the intersection of hardware and algorithm. And no less than Emad Mostaque from Stability said that the researchers are brilliant researchers, and that this literally knocks 10% off of global training compute needs with these improvements, which are impossible to predict. And I honestly don't even think that's necessarily an exaggeration. But for one technique to potentially chop off 10% of global compute needs means that probably just a lot more is gonna happen. Right? It's not, I don't think we're gonna make any fewer H100s. Instead, there's just more gonna get done with them.

So how does this one work? I caveat this one by saying I'm definitely not a big expert here. But one thing that is interesting to know about transformers particularly, and more generally models in general, is you can represent them in different ways. And that could be obviously, you could represent them in code, which is how they're ultimately kind of represented. You can represent them in linear algebra notation. You can represent them in diagrams. And different representations have very different trade-offs in terms of the intuition that they help you to develop on the one hand and then the actual compute efficiency on the other hand.

So I think Anthropic has made great use of this in some of their research work where they're like, nobody would compute with this representation, but we find this representation to be the most intuitive for how we want to think about what is actually happening in the transformer. So we're kind of decoupling how does the machination actually happen subject to the hardware and the RAM and the layout and all that stuff from the way we wanna organize our thinking about it in our heads for intuition building purposes. So just separating those for one thing is a pretty important conceptual move.

And what these guys have done is they have restructured the computation, seemingly identifying that the current bottleneck can be worked around. Now there's gonna be, of course, some next bottleneck, but the next bottleneck seems to be a much more appealing overall bottleneck. So they are restructuring the compute. They're notably, this is not a shortcut. It's not an approximation. This is still fully literal attention computed to the same level of precision with no shortcuts. It's just a more efficient way of structuring that compute.

And basically, it amounts to passing different things back and forth between devices instead of other things that used to be passed back and forth between devices as all this kind of information is flowing. Right? Because you've got the parameters of the model itself. You've got the data that's being represented and flowing through. You've gotta, if you're doing training, you've gotta keep in mind all the gradients as well. Like, if I tweak these things, how is that gonna change so you can do the back propagation? So you've got huge memory requirements and also huge data passing requirements between devices. So by optimizing this in a new structure, basically, here's a couple of key quotes.

Ring attention lets you scale context length linearly with device count, breaking free from memory constraints. Scaling context doesn't mean a quadratic increase in flops per dataset. For the GPU rich, you can go from 4,000 token context to 10,000,000 token context on a 175 billion parameter model for 150 times the training compute. What is that? That's a 250x increase in the context length, which we've all been kind of taught, well, attention scales quadratically. But with this reorienting of how things get passed around, it costs you 150 times more compute, which is not even the 250x that you're getting in terms of the expansion of the window, and certainly nothing like 250 squared. So it does cost more to train a 10,000,000 token thing versus a 4,000 token thing, but only two orders of magnitude more, and it's not running away with a quadratic function.

Then they also say for the GPU poor, if you have just 8 GPUs, then you can expand your context by 8x at just 2 times the cost. So if you were gonna train a 4,000 token context window on 8 GPUs, now you could train 32k, and it would only take you twice as long. That's a pretty huge difference in terms of utility between a 4k and a 32k for just 2x the compute. And again, not an approximation. Right? This is the full attention every token to every token.

So what I think is really incredible about this is you can only put so much into the current token limits that we have today, and that's important at runtime certainly because you can only pack so much in there. Beyond that, it just can't handle it. But I think it's also probably pretty important at the training layer. And I recall Ilya from OpenAI giving this example of the mystery novel where it's like, you have maybe hundreds of pages, thousands of pages, you know, it could be an epic. Right? You can have thousands of pages all leading up to "and the person who did it is blank." And if you could only read the last chapter, you're gonna struggle. If you'd read the whole book, you can do a lot better.

So the longer the window is, the more opportunity there is to, yes, infer effectively, but also to just learn connections between things that are potentially quite remote from each other. 10,000,000 tokens starts to give you the opportunity to put whole bodies of literature into a single context. Right? I mean, The Great Gatsby famously fits into Claude's 100k. Now you're talking perhaps about 100 books all being with full attention considered for the next token generation.

Now would you run that at runtime for inference? Probably not. I think it's probably more impactful for training. If I really wanna get into science, if I really wanna start thinking about how can I use language models for understanding DNA interactions, things that are really kinda data heavy or where there's just so many possibilities for connections that are just not obvious, then the 32k or the 100k is still just not enough to do that. You need more than one Great Gatsby worth of stuff to start to draw these really far-flung non-obvious connections.

But at 10,000,000, and there's no rule here that says it stops at 10,000,000, it's just that that's the one example that they gave. At 10,000,000, you can start to load up really pretty serious amounts of data. And my guess is that of all the things we've talked about today, this would be the thing that would start to drive overall lower perplexities. Right? When we talked about the sliding one, even with these attention sinks, the perplexity is basically flat. That just means model confidence, the performance is basically flat, but that's rolling. Now you really allow it to learn from 10,000,000 tokens at a time. There's just so much more input there that it can learn from, so much more opportunity to make huge different long-distance connections between things.

It seems to me this could be a huge unlock. And again, to put that at two orders of magnitude, let's say you've got your GPT-4 today, and we know that maybe that takes you a week. Two orders of magnitude up from that would be 100 weeks. That would be two years. There are a lot of H100s getting shipped. We're not necessarily that far from being able to train a GPT-4 kind of, I mean, it starts to get it to be a different class of thing, I think, at 10,000,000 tokens per thing for it to learn from. That's just so ridiculous that this feels like the thing where maybe you could see your way into superhuman performance just because I can handle the whole Great Gatsby. I can kinda remember what I read. I have a decent, I would put myself up against the language model for making that guess as to who done it at the end of the murder mystery. But you now give me 10,000,000 tokens. I can't hold that. I can't do 100. I cannot do anything analogous to full attention for a 10,000,000 token thing.

And so if this allows the models to make those connections at such huge length, it seems like this could be where you could start to see tipping into superhuman performance of learning things that experts don't know. And if you're learning things that experts don't know, now you're really into a whole new era. I mean, that's probably the biggest threshold that everybody's kind of wondering if and when we might cross. And I would say there's a decent chance in my very subjective estimate that this could be the thing that could unlock the ability to learn things that experts don't know.

Nathan Labenz: 49:15 Yeah. The paper seemed really cool. And I mean, I know for me personally, it'd be amazing to be able to load in an entire script and ask it, what are there any logical inconsistencies? What holes can you see? What should I work to improve? Which isn't possible right now with this current size of the context windows. I also wondered, I just glanced at the paper last night after you sent it. It seems like the key breaks the context up into these blocks. And then the key value pairs get passed around the full ring of all the different blocks. So you still calculate the full all-to-all attention. No shortcuts is my understanding. And I think it's what you just said.

Trey Kollmer: 50:00 Yeah. No. That's my understanding too. It's just, unless memory I think memory has been the current bottleneck with previous approaches, and now I'm not actually sure what the next bottleneck becomes, but they're getting around this memory bottleneck.

Nathan Labenz: 50:16 But I also do wonder if you really need to pass the key value around the entire ring. If each block could only see its neighbor, then each layer, each block, its field of view of the input grows exponentially. So I do wonder how much you really do need to pass it all around the full ring and how much you could just do attention somewhat locally in blocks. And then at each layer, each block would see more. Because when I read a novel, I'm not thinking how much does this word compare to this word, the name Justin 18 chapters earlier. But you build up almost hierarchically each section of context. Nathan Labenz: 50:16 But I also do wonder if you really need to pass the key value around the entire ring. If each block could only see its neighbor, then each layer, each block, its field of view of the input grows exponentially. So I do wonder how much you really do need to pass it all around the full ring and how much you could just do attention somewhat locally in blocks. Then at each layer, each block would see more. Because when I read a novel, I'm not thinking how much does this word compare to this word, the name Justin 18 chapters earlier. But you build up almost hierarchically each section of context.

Trey Kollmer: 51:04 And again, there might be a difference here between training and inference. If you're trying to get a model to learn things that experts don't know, then by definition, you have to crunch a lot of shit because you don't even really know what you're looking for. But at runtime, yeah, there's probably a lot of shortcuts that you could take.

So here's kind of my sketch for where this might be going, just based on all of these results. Like, what does the language model look like? And by the way, we've touched briefly on multimodality and vision toward the beginning, but this is not even to say everything, but just the stuff that we've marched through today. How does that end up looking if it's all combined into a single system that has all this stuff?

I mean, for one thing, it might be just way more capable if this 10,000,000 or whatever type of learning does allow it to learn a more nuanced representation of the world. Then you might just have straight up higher capabilities. My guess is, though, you probably end up running it with a smaller, more manageable inference window most of the time. From the paper route, the attention sinks, it seems like you can vary attention on the fly. They're padding a few more tokens here or whatever. It seems like we may be looking at something in the not too distant future where there's an ability to adjust the context window depending on exactly what you're doing. Sometimes you may need it to be longer, and sometimes you may be fine with it rolling because I'm not, if I'm having a single long running dialogue with my AI assistant, most of time, I don't need the way old stuff, but occasionally I do. And that might be something that could be used dynamically on the fly.

The thing that I think is most interesting, though, that isn't quite here yet, but really is, I think, strongly suggested is some sort of retention built into the transformer. Trying to combine these ideas, you've got these attention sinks at the beginning, then you can skip a bunch of stuff. We've seen that you can have a pause token that allows you to think more and just represents more ability to process and store information. Then there's the backspace. We get out of distribution. We're not happy about something. We can back up. It seems like the rolling window probably is the way, but that there's some sort of highly compressed historical record that comes to represent and allow you to retain information from stuff that is no longer in the context window, a way kind of like what you're describing, where it's not necessarily every word, but it's a higher level representation that you can carry forward with you. It doesn't take that many tokens, but is beyond the in the way that the pause is beyond a token and just represents space to store stuff or to process stuff. A place to store stuff that is the historical record, that seems to be the next big thing that I'm really looking for, that all these other tools and techniques would start to be able to take advantage of.

And you could even imagine the real self-RAG would be like, okay, I've got these attention sync tokens at the beginning. I've got these sort of historical compressive representations of history. Now I've got the conversation we're currently having loaded into memory. This is starting to sound a lot more like what I feel like I experience on a day to day basis. And then I kind of imagine, oh, I'm attending now to this slice of history. There might just be one or a couple of tokens in terms of its size of representation, but which then corresponds to the real self-RAG would be, and here's the whole history that underlies that.

So we see the pause, we obviously seen many tool uses. But can you imagine something where as this history gets compressed and added in on a padding basis toward the beginning, you could also start to identify, hey, I need that section of history in more detail for this particular thing. And I can see that because I can see the meaning, not necessarily at the token granularity level, but then I can call up the token granularity level, load that into my current context and get the granularity that I want based on recognizing that some portion of previous history, which has been compressed into a few tokens worth of space, is what I need to draw on to do the current thing.

Honestly, that does not feel far off. It feels like, if we're sitting here 6 months from now and we haven't seen something like that, I would be quite surprised. A version of it sort of was the RetNet paper, but it notably was like, there's a change to the architecture there, and we'll see how well that generalizes. But what I'm describing here seems like something that you could have even without any kind of additional fundamental architectural changes, but just a training for this kind of compression for future recall that you could still attend to in the normal way, pluck it out as needed. Not even necessarily have to go to the database all the time to have the conceptual backing, but the ability to go to the database sometimes when that seems like it's super relevant.

Nathan Labenz: 56:49 Okay, so you're saying a memory just in language that you could then treat as a database that you can recall certain slices from it. You're not even saying some hidden state in latent space like the old LSTMs that it's also carrying forward.

Trey Kollmer: 57:08 Yeah. No. Sort of. Like that. So by analogy to the pause and by analogy to the backspace, every so often, I think you could train the model to represent some token that would be like a memory token that would represent the high level conceptual content of that memory that then could just be included as context but superficially and not have to take up a ton of tokens, not have to have token level resolution, but which would be in still the same latent space as the model is operating, but not necessarily accessible via direct tokens, but only sort of the kind of thing like the pause where this gives you an extra space to put, manipulate, do stuff with data. Here, the purpose of it would be to say, I want to summarize this, not in text, but in some representation in the space that I can come back to later super efficiently at conceptual meaning, but low resolution in terms of the actual exact language that was used.

Nathan Labenz: 58:22 Very cool.

Trey Kollmer: 58:23 We know that we're not at the end of history, and that's kind of, if I had to guess what would be the next big thing that would unlock a ton of stuff, it would be that. It seems like we're so close to all those pieces being there with the rolling and with the pauses, with the backspaces, with the version of self-RAG that we're already seeing. Imagine that self-RAG thing had these memories that it could go back to as well. So now it's not just looking at its own pre-training, but also the conversation that you had and able to say, yeah, it looks like on this slice that represents this whole exchange, it looks like that is where this was covered. And so maybe that's enough. Maybe I have to go deeper into the actual transcript. But either way, to be able to compress and in a way that I think would start to represent, would start to look more like what we're doing. Right? Because as you said, you're not keeping every word of the novel in mind, but you have these vague, associative, high level, more representational, not token by token sort of things that are much more easy to recall. And of course, you don't have the actual token by token at your command.

That's another reason I think superhuman performance definitely cannot be ruled out because if we can figure out the high level representational thing that's more like our loose, but really relevant memories, then it'll be way easier to connect the computer's version of that to the actual raw transcript than it is for us to go back to our raw transcripts, which obviously we just basically usually don't have. Yeah. That's where I think this is going. Not super confident, but I'd bet we see some of this.

Nathan Labenz: 1:00:09 It makes sense to me. And it feels like, I mean, if we figure out that recall element and then add in some planning algorithm and a little bit of scale, more scale, major breakthroughs.

Trey Kollmer: 1:00:23 Yeah. The timelines are not necessarily very long. The more I think about it, the more it does seem like next couple years could get really, really crazy. It's been fun. Any concluding thoughts on your end?

Nathan Labenz: 1:00:37 No. This is very fun. And, you know, I'm working all day, this is a great way to catch up on what's been happening.

Trey Kollmer: 1:00:44 Well, that's the goal. It's too much for any person, and this is by no means everything, but trying to give that kind of middle depth where hopefully people walk away with some real understanding and some real food for thought, but can hopefully do it in a compressed way that allows people to get a survey of an increasingly crowded and noisy landscape.

Nathan Labenz: 1:00:44 With that, I'll say, Trey Kollmer, thank you for being part of the Cognitive Revolution.

Nathan Labenz: 1:01:13 Alright. Have a great day, man.

Trey Kollmer: 1:01:14 It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

Try this at Home: Jesse Genet on OpenClaw Agents for Homeschool & How to Live Your Best AI Life

The Future of the Transformer Pt 2 with Trey Kollmer

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next