Titans: Neural Long-Term Memory for LLMs, with author Ali Behrouz

Titans: Neural Long-Term Memory for LLMs, with author Ali Behrouz

In this episode of The Cognitive Revolution, Ali Behrouz, a PhD student at Cornell University, delves into his research on enhancing memory mechanisms in large language models through his latest paper titled Titans.


Watch Episode Here


Read Episode Description

In this episode of The Cognitive Revolution, Ali Behrouz, a PhD student at Cornell University, delves into his research on enhancing memory mechanisms in large language models through his latest paper titled Titans. Behrouz discusses the limitations of current models in maintaining long-term coherence and introduces the concept of a neural network as a memory module. Highlighting architectures such as memory as context and memory as gate, he explains how these innovative approaches can significantly improve long-term memory retention in AI systems. The discussion also touches upon challenges such as catastrophic forgetting and the need for more effective models in reinforcement learning and decision-making tasks. This insightful conversation sheds light on the future directions and potential applications of advanced memory mechanisms in AI.


Upcoming Major AI Events Featuring Nathan Labenz as a Keynote Speaker
https://www.imagineai.live/
https://adapta.org/adapta-summ...
https://itrevolution.com/produ...


SPONSORS:
ElevenLabs: ElevenLabs gives your app a natural voice. Pick from 5,000+ voices in 31 languages, or clone your own, and launch lifelike agents for support, scheduling, learning, and games. Full server and client SDKs, dynamic tools, and monitoring keep you in control. Start free at https://elevenlabs.io/cognitiv...

Oracle Cloud Infrastructure (OCI): Oracle Cloud Infrastructure offers next-generation cloud solutions that cut costs and boost performance. With OCI, you can run AI projects and applications faster and more securely for less. New U.S. customers can save 50% on compute, 70% on storage, and 80% on networking by switching to OCI before May 31, 2024. See if you qualify at https://oracle.com/cognitive

The AGNTCY: The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at https://agntcy.org/?utm_campai...

Shopify: Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive


PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(07:09) Introduction to the Cognitive Revolution
(07:33) Exploring Memory in Large Language Models
(09:10) Ali Behrouz's Research Journey (Part 1)
(13:47) Sponsors: ElevenLabs | Oracle Cloud Infrastructure (OCI)
(16:15) Ali Behrouz's Research Journey (Part 2)
(18:37) Understanding RNNs and Linear Attention
(20:39) Human Memory and AI Architectures (Part 1)
(27:54) Sponsors: The AGNTCY | Shopify | NetSuite
(32:16) Human Memory and AI Architectures (Part 2)
(32:23) Designing Effective Memory Modules
(44:15) Persistent Memory and Attention Mechanisms
(52:00) Queries, Keys, and Values in Attention
(01:12:41) Understanding Context and Surprise in Language Models
(01:14:19) Introducing the Momentum Concept
(01:14:28) Defining the Surprise Metric
(01:14:53) Momentary and Past Surprise
(01:15:37) Decay Mechanism in Surprise Metrics
(01:16:08) Optimizers and Test Time Training
(01:17:52) Memory Module and Runtime Queries
(01:24:01) Scalability and Efficiency in Training
(01:29:39) Strategies for Memory Integration
(01:37:23) Hybrid Approaches and Their Benefits
(01:39:08) Micro Skills and Task Performance
(01:50:59) Long Context Modeling and Titan's Advantages
(02:08:22) Future Directions and Applications
(02:11:39) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...


Full Transcript

Transcript

Nathan Labenz (0:00) Hello, and welcome back to the Cognitive Revolution. Today, I am thrilled to share my conversation with Ali Behrouz, PhD student at Cornell and lead author of the fascinating paper on integrated large language model memory, Learning to Memorize at Test Time. This paper represents another significant step forward in addressing what I've often called the missing middle in memory for large language models. We've got world knowledge baked deeply into the model weights, and we've got attention based working memory in the context window. But that missing middle layer the integrated, persistent, and ever evolving longer term memory that humans use to maintain coherence and identity over time still seems like a necessary piece for success in high context environments, and has been a frontier to watch in AI for some time now. It was once famously said of computers that you could see them everywhere but the productivity statistics. And to a lesser degree that's been true of AI as well, at least so far. In part this is because the technology itself is really only recently useful, and people take time to adjust. Tyler Cowen recently went reverse Oprah, pointing out to an influential audience and saying, you are a bottleneck. You are a bottleneck. But I think often the bigger barrier is best summarized by another famous Tyler quote. Context is that which is scarce. To be honest, I've never been entirely sure what Tyler meant by that in the context of humans, but in the context of AIs, it makes perfect sense. Today's AIs know a literally superhuman amount about the world at large. But out of the box, they know little to nothing about the individuals and businesses that they're meant to serve. Meanwhile, assembling and maintaining relevant context for them, especially since it's often spread out across Slack, email documents, GitHub, meeting transcripts, task management systems, you name it. That is tedious work. And the AIs have really only recently, I would say truly with Gemini 2.5 Pro, started to properly reward it. But for the sake of argument, imagine a world in which context is not scarce for AIs. A world in which an LLM trained specifically for a major company, say GE or 3 ms for example, knows as much about that company, including its products, its history, its team, its internal processes and debates, its finances. It knows as much about the company as it does about the world at large. Obviously such a model could be created. No company has more than 1% of the data on which the AIs are already trained. And if it were created, it would immediately know more about the company than any single person at the company. It might still be trained to search official records to ground its analysis and work, but unlike today's model it would know, in a way very similar to humans do, when it's actually found what it's looking for. And it would almost certainly pick up many the subtle patterns that constitute what I call how we do things around here, which could make it relatively easy to manage, especially as compared to the overall process of hiring, onboarding, and retaining human knowledge workers. Startup costs for a model like this could reach into the millions or perhaps even tens of millions of dollars, which is notably roughly where OpenAI has publicly priced their custom models offering. But for a 100 year old enterprise, that would still be a bargain for an AI that you can drop in and have do a significant portion of the work at the company, particularly since you'll be able to amortize those costs across as many copies as you need. For smaller businesses, meanwhile, which have much less data, using today's fine tuning prices as an anchor, I would expect costs to be more like tens of thousands, maybe into the low hundreds of thousands of dollars. Obviously, still affordable. We're starting to get a glimpse of this future as individual ChatGPT users with their new memory features these days. But I'm not aware of a productized version of this that works well at scale. And I think that's ultimately because there's still the gap in the foundation models themselves which scaffolding isn't quite enough to fix. The bottom line then is that it seems plausible to me that the main thing between where we are today and a future full of drop in knowledge workers that begin to very quickly and dramatically disrupt the labor market is simply a breakthrough in long term memory. And that's why I think today's topic, titans, is such a big deal. Unlike rag type systems, which store data explicitly and then make it searchable via a mix of traditional or embedding similarity or graph searches, or even more integrated strategies like MOMBA and other state space models, which encode memory as a matrix of numbers that gets updated as part of each forward pass. The TITANS architecture that Ali and his co authors propose uses a neural network, which itself is updated via gradient descent at runtime as the LLM's memory module. This is a qualitatively different approach and in my humble opinion represents a significant conceptual advance. So with all that in mind as motivation, in this conversation, Ali and I explore the technical details of Titans, including how he conceptualizes and takes inspiration from human memory systems, how the associative memory loss function works, the role of surprise and momentum in updating the neural memory module, and the various approaches they experimented with for integrating this long term memory module with the standard attention mechanism. Beyond the technical details, this conversation also offers a fascinating glimpse into how a highly original and obviously quite brilliant researcher thinks about pushing forward in a relatively new architectural direction. I was really struck by how many times in this conversation I wanted to dig in and understand the reasons behind the decisions that Alley and co authors had made. But his response was basically that they see all of this as very early foundational work, and so they just did the simplest thing possible for now, trusting and expecting that others will come along to improve and refine it later. Overall, I really enjoyed this conversation, and while it is pretty technical at times, I think Titans is 1 of those relatively few papers that is worth taking some time to grok. The intuitions behind it are elegant, and this line of work might just produce the last major technical unlock needed for AI to hit an inflection point in economic value and impact. As always, if you're finding value in the show, I'd appreciate it if you take a moment to share it with friends, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we welcome your feedback either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. And finally, a quick reminder. I'll be speaking at Imagine AI Live, May in Las Vegas, the Adaptive Summit, August in Sao Paulo, Brazil, and the Enterprise Tech Leadership Summit, September, again in Las Vegas. If you'll be at any of those events, please send me a message, and let's meet up in person. For now, I hope you enjoy this technical deep dive into the frontier of large language model memory with Ali Behrouz, PhD student at Cornell and author of Titans.

Nathan Labenz (7:09) Ali Behrouz, PhD student at Cornell and lead author of the fascinating new paper on large language model memory, Titans. Welcome to the Cognitive Revolution.

Ali Behrouz (7:19) Thank you very much. Thanks for having me.

Nathan Labenz (7:22) I am really excited about this. I think every once in a while, 1 of these papers comes along that just has me thinking and thinking and thinking, and this is 1 of them. So I am really excited to get into it. I guess just for a little background, I have kind of had a little obsession with the memory aspect of large language models for a while now. So when I saw the MAMBA paper drop roughly a year before the TITANS paper, I thought, boy, this is really a big deal because I think everybody kinda knows that large language models have this incredible world knowledge that's super vast, but it's, like, fixed. Right? And, you know, the knowledge cutoff date can be somewhat recent or it can be more in the past and whatever it is, that's sort of a crystallized thing that doesn't really change too much between versions of a model. Then, of course, we've got, like, the runtime memory that is the context window, and the attention mechanism, you know, is great for, like, connecting all the different tokens to each other token and making sense of what's going on there. But I've often called the sort of gap between those, like, the missing middle in memory. And, you know, you can fill the context window pretty quick, but it's really hard. It's proven really hard for the field to create a sort of sustained coherence where a model can like kind of know who it is, like know what it's done, you know, know where it's trying to go. And I think that's gonna be a really important piece of the overall puzzle as we think about, like, trying to create effective agents, you know, that might be long running in the real world. So Mamba was a a key moment where I was like, oh, it seems like that's a a major step forward. And this is, you know, with a different approach, another 1 of these things that I think really kind of shines a lot of light on the path. With that preface, tell us just a little bit about your background, how you got interested in this work. What angle are you coming to it from? I think the paper is is really interesting in that it's got some elegant math. It's also got some clear analysis of, just kind of the recent history of people's attempts to bridge this gap, and it's even got a little bit of touching on just kind of human memory and and takes inspiration in some ways from that. So love to hear your kind of, inspiration for this work maybe for starters.

Ali Behrouz (9:38) Sure. And so, honestly, 1 thing that I need to say is, like, my past this direction is very unusual. I I started doing research during my master's main mainly, and so I started working on some graph algorithms and and these kind of things, completely, like, very, very far from this area. So, like, when I did some research, I realized that some of the tasks that I'm trying to do are are much easier when we use some graph neural networks and these kind of things. So I started learning more about graph neural networks and do these kind of things for those kind of tasks. And so I mainly focus on some questions about anomaly detection and these kind of tasks in graph learning. And at the end of my master's, I realized that there are some interesting connections between something like anomaly detection in graph algorithms, in graph data sets, and also some neuroscience tasks. For example, let's say we want to detect some disease, or for example, we want to understand some of the disorders, brain disorders, or for example, yeah, generally these kind of tasks. And so I started learning more about some neuroscience concepts and also understand how different parts of the brain works, and so that was the part that actually made me to be more interested in these kind of approaches, like neuroscience inspired methods for deep learning, and, yeah, generally, my intuition is that humans are very effective. I mean, the learning process in humans is very effective, is very efficient. We can just learn with a small number of samples, and actually, that's the result of millions of years of evolution. So basically, it's really hard to believe that we are smart enough that we could beat millions years of evolution and then come up with 1 architecture that is even more efficient and more effective than humans without passing those steps. We can come up with 1 architecture that is very similar to humans and then improve it to, for example, build that superintelligence or whatever you want to call it, But I I really believe that before that, I mean, at this time at least, we need to, like, get inspired from how our brain works. So, actually, the the start of my PhD was was somehow, like, coincide with with so many great papers about, like, sequence modelings and then some alternative architectures. I think at that time, there were, like, s 5 paper, which is, like, the first paper that that, like, introduced the scan algorithm for the city space models. And also, there were ResNets and all these kind of great models at that time. So all of them were were somehow, like, some motivations for me to try to understand what we can do for some alternative architectures rather than transformers. But personally, I think 1 thing that is different I mean, you know, there are several people who are working on alternative architectures, and actually, each of them has their own perspective. And basically, all of them are I believe that all of them are really good and great. But there are some people who doesn't believe in attention, and they want to fully replace attention with RNN. And and there are some other peoples who believe in hybrid models. And I'm I'm 1 of 1 of the people who believe in hybrid model because I really think that the attention part is really necessary for some accurate modeling of, like, independence of, like, dependencies between different tokens in the context that we have.

Nathan Labenz (13:43) Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz (13:47) Let's talk about 11 Labs, the company behind the AI voices that don't sound like AI voices. For developers building conversational experiences, voice quality makes all the difference. Their massive library includes over 5,000 options across 31 languages, giving you unprecedented creative flexibility.

Nathan Labenz (14:08) I've been an 11 Labs customer

Nathan Labenz (14:10) at Waymark for more than a year now, and we've even used an 11 Labs powered clone of my voice to read episode intros when I'm traveling. But to show you how realistic their latest AI voices are, I'll let Mark, an AI voice from 11 Labs, share the rest.

Nathan Labenz (14:25) 11 Labs is powering human like voice agents for customer support, scheduling, education, and gaming. With server and client side tools, knowledge bases, dynamic agent instantiation and overrides, plus built in monitoring, it's the complete developer toolkit. Experience what incredibly natural AI voices can do for your applications. Get started for free at 11labs.io/cognitive-revolution.

Nathan Labenz (15:01) In business, they say you can have better, cheaper, or faster, but you only get to pick 2. But what if you could have all 3 at the same time? That's exactly what Coher, Thomson Reuters, and Specialized Bikes have since they upgraded to the next generation of the cloud, Oracle Cloud Infrastructure. OCI is the blazing fast platform for your infrastructure, database, application development, and AI needs, where you can run any workload in a high availability, consistently high performance environment, and spend less than you would with other clouds. How is it faster? OCI's block storage gives you more operations per second. Cheaper? OCI costs up to 50% less for compute, 70% less for storage, and 80% less for networking. And better? In test after test, OCI customers report lower latency and higher bandwidth versus other clouds. This is the cloud built for AI and all of your biggest workloads. Right now, with 0 commitment, try OCI for free. Head to oracle.com/cognitive. That's oracle.com/cognitive.

Nathan Labenz (16:11) Yeah. I'm with you on the instinct that the hybrid models are ultimately the way to go. So quick kind of review of just what is the problem with the attention mechanism. You know, it works great, but the memory requirements as you keep extending the sequence just get longer and longer because you are doing this all token to all token calculation. So the size of that memory footprint grows as the square of the sequence link. I think everybody watching this probably already knows that pretty well just in case there's, the kind of core problem that we're trying to solve. So that can't go on forever. Now I guess you could have, and we we have seen also some schemes where it's, like, not necessarily part of the model itself, but you can have sort of maybe arbitrary disk space. But arbitrary disk space is, you know, quite different from arbitrary like in memory, you know, computation. So it feels like intuitively we need some thing that is more like the human brain, which is like its finite size. Right? It's not our brains are not growing with every time step through our lives. And we need some, like, elegant updating mechanism for the memory that allows us to keep what's important. But knowing that it is a finite state, that means we also kinda have to let go of some stuff over time as well. And intuitively, we all know that our memories do this, but we haven't quite cracked the perfect way to do that in the context of an AI system. I kind of wanted to take just a couple steps for a little bit more foundation. 1 would just be to get your sense of kind of how you conceptualize the the human system. Maybe get a few thoughts on your like, how you understand the linear approximations of attention, and then maybe a little bit on kind of your how you understand the state space, you know, or kind of maybe slightly more generally, like, the prior attempts to create these sort of finite, you know, constant size, constant update time architectures like Mamba and RetNet and and others that have have come before. If you wanna take those 1 x 1, give us any it doesn't need to be a full lecture, obviously, but just kind of, you know, if there's anything that you feel is sort of distinctive about your perspective on those 3 topics, I'd love to hear how you conceive of them and and how they kinda motivate this work.

Ali Behrouz (18:33) Let me start with, like, some explanation about, like, traditional RNNs. So let's look at what RNNs are doing in in very high level. Basically, the data comes, we project the input or the data into the hidden space, and also we project our memory, which is like a vector. Usually people in the literature call it hidden state, and we project that, and we just add these 2 together and update the memory, and we repeat this process. Anytime that we want to extract something from the memory, we just multiply the hidden state or our memory by a matrix like C, usually h times c, or c times s, depending on how you define the dimensionality, equals to the output, which is like yt. So that's how linear RNN, or generally RNN with non linearity, we can simply apply that here works. Now, let's go to the next model. The next model is like linear attention. Linear attention in the causal setting can be, again, be written as recurrent neural networks. Again, everything is similar. Data comes, we project the data, there is a hidden state, and we update the hidden state, and so on and so forth. But now this hidden state is a matrix value memory. So if we want to somehow make some say that there are some differences between linear RNN, traditional RNN, and also, on the other hand, the linear transformer. 1 thing that we can see is that traditional RNN and also modern linear RNN, for example, Mamba, are using vector value memory, but on the other hand, linear attention are using matrix value memory. Now, let's say that we know all these things. The question is, what is memory in our brain? Is it something like a set of neurons that we define, or it's something like a neural network, larger sets of neurons that are interconnected? And actually, I think this is the second 1. We cannot say that there are like these 3 different neurons that are encoding all the memories and all this stuff and everything. So, you know, usually when I talk about Python, I start my talk with some explanation about different perspectives, so I think here it might be useful to, again, start from that 1. You know, there are some different perspectives that over time actually help us to design new architectures. When we are thinking about LSTM and that kind of model, they are inspired from brain, and actually that's long term, short term memory. We wanted to, again, design something very similar to our brain, but the tools that we have at that time, our understanding of the brain, and all these things are getting changed over time, so basically, that's the main difference from the traditional perspective on the human memory and current perspective. Another thing, when the time passes, we have the transformer draw, in which most of the focus is on efficiency, matrix multiplication. We have a lot of models that are designed based on, for example, efficient matrix multiplication algorithm, like MLP Mixer, and its variants, like structured metrics, and all these things. So basically, that's another perspective that we have. And, you know, each of these perspectives gives us some new intuition, some new passes that that we can, like, go and find new architectures. Probably starting from 2020, we have the perspective of dynamical systems, treating the sequence as a dynamical system and use status based models for modeling sequences. So, again, we have, like, some new intuition, some new insights about how we can design efficient and effective sequence models. And I really think that now we need to go farther and go back to the human memory perspective, but with some novel understanding of of how human memory works and how we can incorporate some novel techniques that we are learning. So in this perspective, the question is what are we missing in previous architectures that we need to address in this perspective? And the first thing is that our memory is not something that we can break into pre training and post training. It's constantly learning, and basically that's a very important part. That's very good, for example, RNN, because now from the TTT paper, we know that most of these RNN, or all of them, are, like, test time training, are doing test time training. So basically, it seems that we are, like, good in that shape with modern RNN, but probably we might face some I mean, we need to do additional things to make them more effective in test time training. Another thing that we might miss in this, like, perspective is, as I mentioned, the shape of the memory. Is it like a vector? Is it a matrix? Or is it neural network that might have different architectures, different, like, designs and all these things? So I believe in the last 1. I don't think that, for example, it's suitable to say we could design architectures with so much data and scale them, and all of them are done with only 1 vector you know, as as a memory of of your model. So that's somehow, I think, oversimplification of of this design. So probably we need to take more complicated architectures as our memory. Another thing that we need to consider in this perspective is, let's say that we have some recurrent neural networks. These recurrent neural networks actually have a memory that is fading over time. On the other hand, we have this attention which consider all the pairwise interactions between the tokens within the context window. And actually, that's very similar to, like, our short term and long term memory because it's really hard or or even impossible to hallucinate about the information that we just got in the past 30 seconds or 20 seconds, because we really have all the information into our memory. We we don't, like, hallucinate about any details, like, about for a piece of information that we just got. But it's very likely that we hallucinate about information or 1 event from 20 years ago. There are so many details that we don't remember. There are so many details that are even changed. For example, we remember something that is not true at that time. Know? We are just hallucinating about that. So if we think about RNN as as a fading memory, we can see that these decaying mechanism, this inherent decaying mechanism, not the forget gate part, because, you know, the the the in any design, the RNNs have this decaying mechanism. We are always adding new data into the RNN, and we might ignore some of the data that we have. So this RNN is is, like, very similar to our long term memory. It's fading. It's it has the ability to manage the information and all these things. On the other hand, we have this attention part, which is very accurate, and it's very similar to our short term memory. I think we need to use the combination of these 2 to design more powerful architectures. But the question here is how can we do that? Because even in the neuroscience literature, there are some controversy that how we are passing memories from short term to long term, for example, how we are doing these things. So basically, that's a part we do not have anything to get inspired from, so that's the challenging part, and probably an important research path in the future because, you know, we might, like, design some models that are more effective in passing memories from short term to long term and all these things. You know? So, basically, that's the architecture that I believe in. And so, generally, this kind of thinking led us to, like, the design of Titan.

Nathan Labenz (27:45) Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz (27:49) Build the future of multi agent software with Agency, a g n t c y. The Agency is an open source collective building the Internet of agents. It's a collaboration layer where AI agents can discover, connect, and work across frameworks. For developers, this means standardized agent discovery tools, seamless protocols for inter agent communication, and modular components to compose and scale multi agent workflows. Join CrewAI, LangChain, Lama Index, Browserbase, Cisco, and dozens more. The agency is dropping code, specs, and services, all with no strings attached. Build with other engineers who care about high quality multi agent software. Visit agency.org and add your support. That's agntcy.org.

Nathan Labenz (28:41) Being an entrepreneur, I can say from personal experience, can be an intimidating and at times lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just 1 of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right 1, and the technology can play important roles for you. Pick the wrong 1, and you might find yourself fighting fires alone. In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in The United States, from household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive. Visit shopify.com/cognitive. Once more, that's shopify.com/cognitive.

Nathan Labenz (30:37) It is an interesting time for business. Tariff and trade policies are dynamic, supply chains squeezed, and cash flow tighter than ever. If your business can't adapt in real time, you are in a world of hurt. You need total visibility from global shipments to tariff impacts to real time cash flow, and that's NetSuite by Oracle, your AI powered business management suite trusted by over 42,000 businesses. NetSuite is the number 1 cloud ERP for many reasons. It brings accounting, financial management, inventory, and HR all together into 1 suite. That gives you 1 source of truth, giving you visibility and the control you need to make quick decisions. And with real time forecasting, you're peering into the future with actionable data. Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic. NetSuite helps you know what's stuck, what it's costing you, and how to pivot fast. Because in the AI era, there is nothing more important than speed of execution. It's 1 system, giving you full control and the ability to tame the chaos. That is NetSuite by Oracle. If your revenues are at least in the 7 figures, download the free ebook, Navigating Global Trade, 3 Insights for Leaders at netsuite.com/cognitive. That's netsuite.com/cognitive.

Nathan Labenz (32:02) Cool. So I guess 1 thing just to really emphasize is, in some sense, the most, like, fundamental change here with this work as it compares to everything previously that I've seen is that the memory module in Titans is itself a neural network. So we're kind of moving from a and you said this, but just to really hammer the point home, the earlier recurrent architectures had either like a vector or, you know, a 2 d matrix or whatever, but basically just a bunch of numbers that information would be projected into and then also sort of projected out from at each inference time step. But within that, you just have these numbers and they don't interact with each other. There's not really much going on there. Right? It's it really is just sort of a place to store the results of these kind of projections in, projections out. In moving to a neural network as the memory module and you use an MLP, and and I definitely noted the part in the paper where you said this opens up a whole new research direction in terms of like what, you know, is the best possible architecture for that sub module because an MLP, you know, maybe hard to beat, but, like, probably beatable. And we've certainly seen it beat, you know, many other for many other purposes. But moving to this architecture creates a dynamism within the memory module and creates the potential for sort of information to interact within that piece that is qualitatively different and seems like, as you said, like much closer to the way that our own memory systems work. 1 thing I don't have a great sense for and I'm sometimes surprised by is how much of this is all sort of reformulatable if you're smart enough about the math such that things can become more equivalent. So this is not like intuitively obvious to me, but the linear approximations of attention you noted can be sort of reframed as a recurrent neural network. And, you know, 1 of the big things that we've seen through this whole progression of the linear approximations and then to RetNet and then to Mamba and Mamba 2 and and many others besides along the way is a much more granular management of that internal state. And also with Mamba in particular, a jump to input dependent ways of managing that internal state. So, like, early on, you would just have kind of, okay. Here's the state. We have, like, a fixed sort of kernel that's gonna do this projection. And no matter what information comes in, we always kind of compress it in the same way, and that kind of is what it is. With Mamba in particular, you had a input dependent and, like, highly granular way of updating. So if I recall correctly, each number in that matrix could be sort of updated with, like, a different strength. You So weren't just doing sort of 1 kind of overarching single decision for how much to decay and how much to to emphasize the new information, but you're kind of doing it in in a much more, you know, granular way. And again, that depended on the inputs. So that was, like, quite interesting. Another element of this that also always comes up is just, like, how much of this is sort of dictated by first principles and how much of it is dictated by the hardware that we have available to run it. Mamba versus Mamba 2, I thought was a really interesting illustration of that where Mamba 2 was actually a less granular method for managing that internal state. It was a coarser update function. And you think, well, jeez. Like, how do they get a better model out of a less granular or more coarse way of managing the memory? And the answer, as I understand it, is basically in making that sacrifice of that super granular, you know, Mamba 1 structure, they were able to make the whole thing run a lot faster. And so for a certain number of GPUs, they were able to just train a lot more. And so you get a better ending model, although it does have in some sense this, like, coarser internal structure. I guess to, you know, put that in the form of a question, how much of all of this do you see as being like very dictated by the hardware that's available? Is there a fundamental break here when we move from a matrix shaped memory module to a network as the memory module? Or is there some fundamental equivalence perhaps as well where this could all be kind of with enough insightful math that would could be sort of understood as being in some sense the same thing?

Ali Behrouz (36:58) So let me use the last part I mean, last page of the Python picture that we did some, like, comparison between, for example, Python and some recent modern architectures like DeltaNet or, for example, TTT, Longhorn, and all these models. So what we did at that part is to, like, saying how we can connect, for example, DeltaNet to Titan, and what is, like, changing in that sense. And now let's say we have the perspective of TTT, what is changing in that sense, and so on and so forth. So 1 thing that we need to consider is, you know, all these models are connected. You know? There are some mathematical formulation that we can use, I mean, at least in the sense of these test time training framework that, for example, unify all these things. And so basically, all of them are connected, and there are some small changes in in each of them. But I I don't see the contribution from that side. You know? I really believe that the value of the contribution is to show how you want to do some future stuff. Because probably new architecture is like endgame, or nothing is endgame. We are we are progressing over time, you know. It's not good even for science to say that this model is coming and it's the, like, end of the world. We are not making any progress over time. Even transformers are really great. Even when we have the transformers, we have better implementation of transformers like flash attention, all these things. We have better additional components to transformers to make them transformer plus plus and all these things. You know, we always are making progress, so the important part, I really believe in that, the important part is not just to deliver a model, because probably there are a lot of things that we can do to make a model better, but to deliver a new perspective, deliver something that leads to future studies. So personally, if I want to say what is the impact of 1 work, for example, from very old time, I would say that the impact is something that shows the future work, and those future studies led us to this state that we have right now. So I think that's a really important part. So yes, we can look at all these models that are very connected, but most of them are from different perspectives, and each perspective can lead us to different type of thinking how we want to improve current architectures and all these things. And the question is, we don't know actually the answer to this question, but the question is, which of these directions will, for example, lead to a good architectures that is, like, better than other methods that we know at that time. So, for example, let's say that we use a matrix value memory in Titan, and, for example, let's say we remove some layer normalization that we are doing inside the memory, which make the memory linear, but let's say that we don't do all those things. So basically, in that case, our approach can be very similar to, for example, gated DeltaNet or DeltaNet. So if you look at these things, you can say that there are all small differences between these models. For example, as I mentioned, for example, you can remove some of the components that we have, and also consider the special case of our approach, we use simple matrix format for the memory, then there are some connections to these models. Similarly for TTT, similarly for long corn, and all these models. But even a small change in these kind of approaches can lead to completely different architectures. I mean, regarding the perspective that it can bring and all these things. So it's really important to see what is the connection between these models, and also which type of approaches can help us to design future architectures. And in this direction, I can say that the focus, you know, we have a fixed size memory, and this fixed size is something like lifelong memory for the model, you know, we are adding data into the memory. And so the question here is which type of approaches do we want to take? 1 approach is to make memory management better over time, and what does the memory management here mean? It's just the recurrent formulation that we have. So different recurrent formulation help us to better manage the memory and better understand each information, like, is worse, like, memorizing or which information we just need to ignore. So that's the important parts in in my opinion. But, also, there's another perspective here that says we we need to do some stuff to have better memory management, but on the other hand, we need to see how we can use different architectures for our memory to make it more powerful. So we started from a vector, then goes to a linear layer or a matrix, and now we can have MLP with 2, 3, or 4 layers. So what's next? Do we want to use convolutional neural networks inside the memory? Do we want to make more deep architectures in the memory and all these things? So that's another way that we can approach this problem. And I I really think that these 2 different approaches are are very separate, so it's very hard to connect these 2 because they are answering different questions, and we need to see which 1 is more promising, and definitely we can use both in future architectures. We can have better memory management with better memory architectures, But, you know, 1 thing that I want to emphasize is that these 2 passes are separated, and they are trying to answer different questions in this dummy. So that's that's, I think, very important.

Nathan Labenz (44:00) Cool. Well, let's get a little bit deeper then into just the nature of the architecture itself, and there's a lot of little detailed choices that you've made that I understand are, you know, not the end of history. I always say transformers are not the end of history, and I totally understand that this paper, you know, represents more of the beginning probably of a new direction, certainly not the end. If if I could try to summarize the overall architecture simply, it's akin to a transformer in the sense that you still have, like, attention as a pretty core mechanism. But now instead of being, you know, truly all tokens that the model has ever seen in the episode, we now have essentially a sliding attention window, which that's something that we've seen, you know, in the past with various schemes. Then there's also this bit that I didn't have a great sense for, which is the persistent memory is I think how it's called in the paper, and it's also called learnable data independent weights. What I see in the diagrams is that there's this sort of persistent memory layer that kind of gets seemingly like always put at the beginning of the sequence at every time step to sort of have the persistent memory followed by the output from the long term memory module, which we'll describe in more detail in a minute, and then followed by, like, the current sequence that is the sliding attention window max. And then all of that kinda goes into the attention mechanism and gets processed through the larger super network, you know, in more or less a transformers like way. Tell me if I am missing anything important there. I didn't quite have any great intuition for why you thought that the persistent memory like, where did that idea come from? I have seen some things in the past. I remember a paper called attention syncs, or something along those lines where it seemed like they had found that there was basically just an overdependence or sort of over like, the initial tokens were sort of too important in some ways, like having these extra kind of early tokens that could sort of not be very important, but like not also allow the model to not overweight the actual runtime first tokens seem to be a performance benefit. But I don't know. Deconfuse me on that. Like, does that persistent memory bit come from? What role should we understand it is playing?

Ali Behrouz (46:29) So the the persistent memory part is the part that might not be very, like, necessarily, but that's the part that complete our architectures. You know? Because if I want to, like, explain why we are using that persistent memory is there are different reasons for that. So the first reason is, basically, we are motivating the paper as how we can design something similar to human memory. Basically, there are some long term memories. There are some short term memories, and also, we have something like persistent memory that encode the knowledge about the task that we are doing somehow. Like, it's not related about the data. It's about general knowledge of the task that we are and so that's something that can somehow make this design complete, and so we just put it there. And, actually, in practice, there are some improvements. You know? When we use that persistent memory, we we get very, very slight improvement in the architecture, so there is no harm in that. So that's the main reason in the from the technical perspective that why we are, like, doing these things. But, actually, on the other hand, 1, reason that we are doing, like, persistent memory is because we are having the concatenation of long term memory and the current context. And, actually, as you mentioned, attention can focus more on the initial. The causal attention can focus more on the initial tokens. So that's challenging. That might result in, like, dropping the performance and all these things. So having some data independent parameters that are learnable, beginning of the token, can help us to improve, like, the performance. So but, actually, 1 thing that I I need to say is that this idea of adding additional persistent memory to the sequence comes from an earlier paper than attention sync, which we we already discussed that in the paper. Basically, the the what they are saying is that we have this attention in the transformers, and after that, we have this MLP. So 1 way to see this MLP is to see it as the attention mechanism. So let's say that we have, like, 2 layer MLP, and so the formulation would be something like w 1 times w 2 times x, which here we have a nonlinearity between the w 2 and w 1. Now if you look at that, you can see that w times x is very similar to, for example, q times k that we have in transformers or in attention, and then another w is basically our value matrix that we have. So that's very similar to the attention mechanism. But what is the main difference here? The main difference comes from the nonlinearity part. We usually use, for example, RELU in the MLP, or for example, other nonlinearities, but in attention it's softmax. So what they say is that we can use softmax here. In that case, basically, what we are doing is just concatenating some additional learnable parameters to the beginning of the sequence, and basically, when we apply the attention on top of that, it seems that we are already applying that MLP part, so basically we don't need that part anymore. So that's the main intuition from that paper. So here again, as I mentioned, we can see that there are different perspectives. With with there is 1 concept, mathematical formulation, but there are different perspectives that all of them, like, lead us to this mathematical formulation. And, yeah, that's generally the main motivation for the persistent memory. I understand that, you know, some people might not use that persistent memory when they want to use Python, and that's understandable because it can save us some, like, parameters and also a simpler design. But on the other hand, we can gain, like, slight improvement in the performance.

Nathan Labenz (51:01) Gotcha. Yeah. That's really interesting. That is in a way, that is also why I have been just so fascinated with studying AI broadly because there's just so many different angles on it, so many different perspectives. And it feels like in some ways, you know, we're all trying to get at the same ground truth. But in some ways, like, the different perspectives do have quite different value to bring even if there is some underlying mathematical equivalence. If only I I take, like, a big part of what you're saying is, like, the different perspectives sort of get people thinking about different future directions in ways that underlying mathematical equivalence may, like, not actually be so useful for. That's definitely true for me. So okay. That's interesting. Let's talk about then the long term memory module. Obviously, is again the core thing. Right? And we are now moving into the world of having a network that is the memory. I don't know if you wanna take a minute and talk about the read or the sort of write and read paradigm. We kind of already covered that, but is there more to say about the sort of modern paradigm of writing to and reading from memory before we get into the, like, specific details of what you've implemented here?

Ali Behrouz (52:14) Basically, the the the intuition here is is very similar to what I mentioned about, like, simple paradigm of RNN and linear attention and these kind of things. So, basically, as I mentioned, like, data comes. We project the data into hidden space and project the memory, and then just consider their summation and, like, update the memory and so on and so forth. We repeat this process. So generally, what we are doing here is just writing to our memory. The data comes, how we want to write this data into our memory. The way we do in linear RNN is to project that and add it to the memory. There is another part. Let's say that we have this memory, and how do we want to extract information from that? The way we can think about that is to sending a query to the memory and asking to, like, give me, like, corresponding information. So how we can model that is to multiply your query by the input or, like, pass it to the input. If if your memory is is a simple, like, vector, you can simply just multiply that. It's a simple linear layer. You can just simply multiply that, multiply your query to the memory. But on the other hand, if your memory is like a neural architecture, this read from memory becomes something like forward pass. You need to pass your data into your memory, and the output of your memory is the corresponding information about your input. So that's how this write and read, like, intuition works.

Nathan Labenz (54:10) Cool. So let's get into the details of this. Because I found this really super interesting, and I have to give a shout out to Grok 3 for helping me work through some of the math and and develop my intuition for it. I would say Grok 3 definitely has been shown to have a number of interesting properties and some interesting issues, but it does perform quite well when given a paper like this and ask just conceptual questions about it. 1 fundamental decision that you made here is to create memory that you describe as associative memory. So basically, memory module, it's an MLP. Right? So it can take some input and then it's gonna give you some output. And the loss that you're minimizing there is the difference between what the memory outputs for a given input's key vector and then the value that same input actually ends up with on the other side of the attention mechanism. So I kind of interpreted this as being like an approximation of the attention mechanism where what we're doing is saying, okay. We want this finite MLP to given the key value for some input, be able to predict what the value output would be for that same input. And we'll do our runtime updates of the MLP. And again, it's, you know, probably hopefully obvious to to people at this point, but, like, 1 major, you know, change here from previous architectures is this MLP is actually undergoing a gradient descent process at runtime such that the the weights of the MLP are what is changing. Right? So in previous things, you had this like in MAMA, you have this matrix of numbers that is changing with every time step. Here we have an MLP that is changing with every time step, and it's changing with gradient descent as it usually does. And the loss function is this that is defining those gradients is this prediction of we want the memory to given a a new input and the keys for that input, we want it to be able to predict the values for that input. Maybe you could even just take a second and talk about, like, how you think about the queries, keys, and values structure of attention. I mean, people have encountered that in the past. The simple shorthand, you know, that I can kind of, recite procedurally is the query portion is like what an an given token is sort of looking for. The keys are what information it has, and then the values are sort of the payload of, like, what information, you know, then gets fed into the rest of the network. I've always kind of held that understanding, like, relatively loosely because these, you know, architectures are weird. Like, interpretability show is like a immature science and shows us a lot of weird stuff. And, like, even if that was sort of the idea that people had in their heads when they set up this architecture, does that mean it's really working that way? But you seem to really embrace that intuition or that understanding of what's going on and reuse that understanding in the design of this memory module. So, yeah, tell me how you understand the queries, keys, and values, and and tell me if I'm right about how you, apply that to the design of this.

Ali Behrouz (57:35) I think 1 way that we can, like, describe the attention mechanism, as you mentioned, is we can see attention as an associative memory. And, basically, we have, like, keys and values. They're connected. So we want to, like, pass keys into the memory, and the memory is responsible to, like, find value corresponding to that key and, like, pass it as the output. So that's how the memory should work. Now the main difference here is that we might not have the exact values of the key when we are doing some inference and these kind of things. So that's where we use, like, query. So we have some stored pairs of keys and values into our memory. So then we want to send the query to the memory, and memory, as I mentioned, is responsible to, like, find relevant information to the query and, like, pass it as the output. So how should we do that? Because let's say our memory has k 1 to k as our keys, and we have another query, is so much not so much, but generally different from k 1 to k l. How we can find the relevant information from this memory. And 1 way to do that is to see how similar this query is to the keys that we already have in the memory, and how we can describe this query as the combination of the keys that we have in the memory. So how we can do that? The way we can do that is using matrix multiplication, because the dot product can help us to understand the similarities between each 2 vectors. So we have a query, and we consider the dot product of this query to keys to understand how we can describe this query using the keys that we already have in the memory. And based on that similarity, we can extract information from the memory, because we already know the connection between keys and values. When we can describe this query as the combination of some keys, we can find the output of this query, the corresponding output to this query, as the combination of those values that we already have in the memory. That's exactly what we are doing in attention, So we consider the multiplication of q and k to find the similarities and how these things are similar, and then multiply that 1 with the value to extract the information from the memory. So that's 1 way to think about attention. But if we write attention as it's done in the TTT paper, if we write attention as a test time training, 1 thing that we can see is that attention is the nonparametric solution to this, like, loss function that we have. Basically, these keys and queries are interconnected, assigned to each other. The attention is the nonparametric solution to this assignment, But when we are talking about RNN, we are talking about something like applying optimization algorithm, gradient descent, gradient descent with momentum on top of this 1. So basically, probably RNN might result in weaker performance, because the attention part is the non parametric solution to that 1. But on the other hand, we have some efficiency gain that we can use.

Nathan Labenz (1:01:57) So let's just dwell on this for a little bit longer because I I feel like this is maybe the most important thing for people to develop an intuition for. And if they can grok this, a lot of other things I think will make, you know, intuitive relatively intuitive sense like pretty naturally. So we want to have a memory module that says, okay, for this given input, feed it into the memory module and get the most relevant historical information through all of time. And to do that, we say, okay, we've got the sort of query aspect of a given input that is like understood to be or was maybe like originally conceptualized as what that token is looking for. And we can feed that in And we want it to come out with all the right information. So how do we update that memory module over time so that it contains that information so that the query of the latest input at runtime actually gets the stuff that we need, we, at every previous step, need to have encoded it. How do we encode it? We say, well, let's take the keys of all those previous steps and train this memory module to output the values associated with those same steps. So the and this builds on the fact that in the attention mechanism, it's the similarity, the the dot product of the new token query versus all the previous token keys that determines, like, what portions of the values will actually be used in the downstream calculations. So at every runtime step, we wanna make sure that given something similar to the keys of this input, we can output the value so that we have that payload information to pass into the rest of the network. We'll update so as to be able to do that in anticipation of actually getting the query that says like, this is what I'm looking for. So we want to store essentially, we want to get the network to be able to predict what each token has. That's how we're updating it as we go so that we can then come later and say when a new token is looking for what those previous tokens had, we can return the something similar to at least the values that they that would have been produced had we done the full explicit attention mechanism. Anything wrong about that? I I'm No.

Ali Behrouz (1:04:44) I think that that's the correct way of, like, thinking about this mechanism. Yeah.

Nathan Labenz (1:04:50) Okay. Cool. So yeah. I think that's really super interesting. I mean, it took me a little while to get there, but it it was worth the work. And, again, appreciate Grok for helping me work through some of the notation to make sure I was understanding it correctly. I think this does probably also will help a lot of people understand a little bit better the attention mechanism itself, and it does give me a little bit more confidence. The fact that this all works gives me a little bit more confidence in the idea that, like, the attention mechanism sort of actually is maybe working the way people have sort of described it as working because it's really easy, feel like, to fool yourself into thinking that, you know, what's going on in these massive machinations and and number crunching, processes is like what you intuitively think it should be. But if you can actually make predictions based on that understanding and design new things and they work that, you know, certainly gives you a reason to be more confident. So just to say it 1 more time, at each step we have this memory module, we want to make sure that we can given a new token in the future, pull out the payload that we would have got if we were doing a full attention process. And we do that with an approximation that is facilitated by this MLP that says, I'm going to make sure that I can return given the key of the current token, the value of the current token so that because that's like what this what these tokens have leading to a certain payload. Now I can at runtime take the query what the new token is looking for. Those are definitionally similar. Right? The key of the previous tokens and the query of the new token. So it's that similarity that is then ensuring that you get the right value at the retrieval, from the retrieval process and can carry on as if, you know, approximately in a in a fuzzy sort of way, you know, in a more human memory sort of way, carry on from there with not exactly what you would have had with full attention, but at least with some close approximation of it. So that is, again, super, super interesting. I think definitely worth just kind of for for folks who wanna develop their intuition on how these things all work and what's really going on inside, this has been 1 of the best things to to really take time to understand for me in in quite a while. So let's talk about just some details of that. Sounds like these things aren't super big at this point. How what do we know about kind of how much information you can compress into an MLP? And, like, how do you think about sort of how big this memory module should be versus, you know, how long you want episodes to be able to run versus, like, how long the sliding attention window should be. Just so far, how have you been thinking about kind of how to size these things relative to each other?

Ali Behrouz (1:07:48) So, honestly, I think that, you know, the the number of tokens that we are using as as the the segment attention part, or, for example, the number of tokens that we are using for the memory and all these things are actually, are not something that's really I mean, the way that we want to tune these hyperparameters are not very challenging because that's directly about how how much do we want to spend on these things. You know? For example, do we want to train a model with hundreds of million parameters, or do we want to train a model with billions of parameters? 1 thing that is important here is there are some equivalent version that we can use to understand how we can set these parameters for our design. So, basically, let's say that we have MAC or MAG architecture, basically memory as a context or memory as a gate architecture. In that case, the only thing that because, you know, the the the memory part is is something like it's it's running in parallel to the attention in in some sense. So, basically, it's very similar to a head. Like when we have a multi head design, we can see that branch as a different head than the attention part. So 1 thing that we can do is to just use what resources do we have, like how many parameters do we want to use, and all these things, when we know the answer to all these questions, 1 thing that we can do is to just use half of the number of heads as the memory part and other half as the attention part. That's 1 way that we can do. Or, for example, in the memory as a layer architecture, we can simply use whatever configuration that we are using, but make the number of layers I mean, the thing that we can do is to use half of the layers as memory or other half as transformer or something like that. As you mentioned, it's not very clear that, for example, if we change these things, what would be the performance of the model? There might be a good point that, for example, using 3 memory heads and 7 attention heads works much better than 5 memory heads and 5 attention heads and all these things. So basically, I think there are a lot of space to explore all these combinations and see which 1 is better, But generally, if if you're talking about just to use Titans instead of, like, whatever model that they are using that is based on attention, the configuration would be very similar. We can just use half of the heads as the memory, half of the other half of the heads as as the attention. Or if if they are using something like memory as a layer, we can just simply use half of the layers as the memory. I mean, like, 1 memory layer, 1 attention, 1 memory layer, 1 attention, so on and so forth. So that's another way that you can do.

Nathan Labenz (1:11:04) Do you wanna just take a little more time and just describe the 3 different ways that this works? I guess I was gonna definitely make sure we talk about the momentum piece or the surprise slash momentum component to the update.

Ali Behrouz (1:11:18) Yeah. Let's talk about, like, the moment part. I think that's the part before the how we want to incorporate the memory into the architecture and combine that with attention. Yeah. So regarding the momentum part, I think we try to, like, explain the intuition in the paper. Basically, the intuition is everything that is, like, surprising probably is worth memorizing. But here, 1 thing that we need to consider is, for example, there is there is a specific token that is very surprising for us, and, like, the consecutive tokens are, like, describing what was in the, like, that surprising token. So basically, all of them are are important. So we cannot just ignore that 1. But those tokens might not be surprising to us. You know? For example, let's say that I'm I'm saying something like, I'm leaving, and you know, this sentence can have a lot of meanings, you know, depending on the context, so without like hearing other parts, I cannot say whether this 1 is surprising or not. Maybe I'm talking to a colleague, now it's the end of the chief, and I am saying that, why, I'm leaving. It's not surprising at all, you know? I I really need to understand what's going on around that specific sentence, around that specific token, and so on and so forth. But for example, assume that I I want to, like, leave the company, that might be surprising. You know? When when I I'm saying that I'm I'm leaving, depending on the context, there are some situations that is surprising, there are some other situations that is not, and basically that's 1 way to think about that, and another thing that we need to consider is that also other part of the context that we need to pay attention to that part is somehow important, because those tokens might not be surprising. They are just simple description of what we are saying about that specific surprising tokens. So how we can model this process, how we can say something like, you know, this momentum part, how we can say that it's very effective to not only model the momentary surprise of the token, it's also effective to understand how surprising the pass tokens are. So that's a part we introduce the momentum. So the way that we can think about that is to break the surprise metrics. The surprise metric is just a metric that says what information should be remembered and what information should be just ignored. And so 1 simple way to define this surprise metric is define that based on gradient descent. But another way that we can do to make it more powerful is to break it into 2 parts. The first part is the momentary surprise. How surprising is this exact token? So this is the momentary part. There's another part that somehow describes how surprising other tokens are. So that's the, like, pass part. And actually, that's decaying, you know. Let's say that there's a surprising token, for example, 64 tokens ago, there's a surprising token, and actually, we need to, like, forget about that token over time, and that's different from the forgetting part that we have in the memory module. That's the forgetting part for the surprise metrics. A surprise metric also needs to decay over time, and the reason is the context might change, or for example, when time passes, these tokens might not be relevant that much, so basically we need to also have this decay mechanism. But again, that's 1 way to motivate these kind of approaches. If you want to just technically discuss that, if you want to somehow focus on the mathematical formulation of that part. 1 thing that we can say is to we have this design of test time training, and now instead of gradient descent, we can use more powerful optimizer. For example, we can use gradient descent with momentum. Basically, these more powerful optimizer can lead to more powerful architecture. And now another way, for example, 1 can say, let's use ADAM. We use ADAM optimizer as, like, the way that we can do that that 1 will lead to a new architecture and potentially more powerful than what we have, for example. So that's another way to, like, think about these these specific formulation. But 1 thing that I found, like, you know, that's, you know, that's a trade off between whether you want to focus more on the mass side, or do you want to, like, motivate everything in the sense that, like, everything is, like, compatible and all these things, you know? So basically, I think the momentum part was a good spot that we can, like, focus on to say that everything just makes sense about how human works and how, for example, we can mathematically get better results. But, definitely, yes, you can use, for example, use other optimizers to make it work. And, again, that's that's different perspective to to this manner.

Nathan Labenz (1:17:32) Okay. So let me try to summarize this back, and tell me if I get anything wrong. So again, we have this memory module, the purpose of which is to allow at runtime a new query, a new new token to come in, take the query vector from that token, put it into the module, and output the relevant payloads that we had from earlier tokens. That is finite in size, so it can't just be more and more information in there forever. So, like, how do we update and and, and manage this? This is, looking at equations 13 and 14 in the paper. It's actually relatively simple. It's basically just saying we first decay the current memory state by a bit and then we add an update term and that update term is the is determined by the loss from the current token plus a momentum term from the previous token. It's the loss of the current token that constitutes the surprise. The bigger that loss is, the more we realize we need to update. Right? Because that that token, we did not predict well. So we realized, okay. Jeez. We really need to update significantly to be able to do a better job on this particular thing. And then the momentum term says we need to sort of maintain a sort of a significant update for some tokens to come. We wanna not just update on this 1 token, but this highly surprising token signals the beginning of a partial episode where we wanna make sure we gather the information from, like, that entire period of time. That upcoming sequence, we wanna make sure we pay extra attention to all of that. Attention not in the technical sense there, but in the weight of updating of the memory sets. That intuitively feels quite right. And it is interesting that it also looks so similar to other optimization algorithms generally, but it just on kind of an introspective basis, it does feel like I can think back on moments in life where I've been very surprised. And then there's this sort of modified mental state where I'm a little bit sort of dizzied off center, but you often come away from those moments with very clear memories of what you know, I remember where I was when a certain thing happened. Right? And I remember that morning. The the classic 1, of course, is at least for people my age is like, I was in high school on 09/11. I remember who told me that it happened. I remember the, you know, where I was standing. I remember the class I went to next. I remember, like, what we talked about in the next hour. And so it wasn't just that 1 token, but they did sort of create you know, I remember when I talked to my now wife then girlfriend about later in that day. I remember my dad came home from work. Like, that whole episode, that whole day is, like, way more salient in memory than the day before and a few days after. And it seems like you're essentially capturing a similar or creating a similar process here with these update rules. 1 question I did have about the parameters, there's sort of the rate at which the current memory state decays. And then there is the like, weights, the sort of strength of update. There's like a free parameter. You know, you have the loss, and then you have a free parameter that you multiply by that to determine, like, how much to update. And then there's also another free parameter on the momentum term. Are those all learned but fixed? Am I understanding that right? Or did you just, like, pick them?

Ali Behrouz (1:21:07) So just to make sure that I understand it correctly, do you mean the parameters like alpha, t, and, for example, theta and

Nathan Labenz (1:21:15) Yeah. Alpha, nu, and theta. Yes.

Ali Behrouz (1:21:17) Actually, they are learnable parameters, and, actually, they are inputs dependent. So, basically, we project the input. And so the model is learning how to project the input to parameters like alpha or theta. And yeah. So, basically, based on the token, we will decide which part of the past information is important, and based on the token, we will decide which I mean, whether we want to, like, use the surprise from the past or not. And again, based on the token, we will decide to whether we want to, for example, consider these specific momentary surprise or not. So all of them are inputs dependent.

Nathan Labenz (1:22:04) Yeah. Gotcha. I should have got that from a sub t in the notation, but I'm not always as great with notation as I'd like to be.

Nathan Labenz (1:22:11) So there really is, like, a lot

Nathan Labenz (1:22:13) of and this is kind of like Mamba callback too where it's not even something as basic as the decay of the previous state of the memory is dependent on the input at that particular time step. So you really have, like, a lot of ways for the input to influence how the ultimate state of memory is being updated at each time step. So it's a it's a super expressive, I guess you might say, set up in that sense. Cool. I think that is, again, really interesting. Hopefully, pretty intuitive. How hard is it to, like, make all this stuff work in, you know, a sort of scalable, efficient way, like, on the given hardware? Because you are doing something that hasn't been done much, right? Where you're like doing a gradient descent at each time step, making these updates. I really don't have a deep sense of how hard that would be to manage, and I I suppose you were using I don't know if you're using GPUs or TPUs, and I'm also not even entirely sure how much difference that would make in terms of how hard it might be. But how much wrangling at a low level of manipulating all the the relevant vectors and and matrices did it take to make this scalable? And, like, how scalable did you ultimately achieve in this portion of the work?

Ali Behrouz (1:23:47) Yes. So, actually, we have some experiments in the paper. 1 thing that we can see is that if we implement these kind of approaches, like updating the gradient descent, the teacher's step, and so on and so forth. If we want to do all these things, then the model can become, like, recurrence model. And so this recurrence can be, like, very slow in practice when we want to train the model. What's the good thing is that this process of, like, calculating the gradient and also, like, updates the model and so on and so forth can can be reformulated as a matrix multiplication notation. So, basically, this matrix multiplication can can be very fast using GPUs, CPUs, and so the good part is we can we can make this process faster and parallelizable. The TTT paper discussed the dual form of how we can, for example, write this recurrent formula for gradient descent in the matrix multiplication format. But we have additional modules that we need to consider. The first 1 is the weight decay part. When we add the weight decay, we need to multiply the the the memory by by a constant number, by by a scalar or, for example, a vector, if we want to do the channel wise decaying. So the interesting part is even using this decaying formulation, we can again have a matrix multiplication format, but we need to just add additional Hadamard multiplication or, for example, build or construct a diagonal matrix and do the matrix multiplication for that 1. So that's generally the way that we can make the process parallelizable when we use decaying part. But we have also that momentum part, which is much more challenging if we want to, like, handle that. But the interesting part if let's look at the formulation in the paper about the momentum. As you mentioned, the momentum is something like ST equal, which is the surprise matrix, equals to decaying parameter times the previous state minus the gradient part. And interestingly, this 1 is is again a recurrent neural network, you know, a linear recurrent neural networks. So there are different ways that we can, like, make this process parallelizable. The first 1 is using scan algorithm, parallel scan algorithm, because, this formulation, it can be very similar to, for example, MOMBA. It's a linear recurrent models, and we can use a parallel scan to calculate all the surprise metrics in 1 chunk. That's 1 way that we can do. Another way is to use the matrix multiplication format for this linear recurrent as well, because let's just expand the surprise metric part, the recurrence of the surprise metric. Again, we can see that the gradient part simply be reformulated as the matrix multiplication, and also the coefficient of theta can again be formulated as the diagonal matrix multiplication. And that's again another formulation that we can have to make the process faster. So there are different ways that we can incorporate these techniques to the training part. And, actually, you know, these techniques are very effective because in the experiment part of year 9 that we have, we compare, like, the training time of Titans, we did some modern recurrent neural networks and also attention. And we can see that when we increase the context links, definitely attention will drop. And so that's that's 1 motivation for using these linear models. And comparing to other models, we can see that, for example, Python is is, like, faster than Mamba. But for example, there are some other modern linear models that are somehow faster. But 1 thing that we need to, like, somehow emphasize here is in Type and Paper, we focus on, like, delivering a new perspective, delivering a new architecture, but we didn't, like, spend so much time to optimize the implementation. For example, use some kernels to make the process very fast and all these things. Basically, our focus was on the architecture side, was on the designing new memory modules and also delivering a new perspective in that sense. Definitely in the future, that would be very interesting to see how we can design different kernels to make the training process faster. In that case, I really believe that we can achieve comparable efficiency even to simple or linear RNNs that are very fast. So, yeah, that's generally about the efficiency part.

Nathan Labenz (1:29:16) Cool. Yeah. I mean, it's off to a great start, I would say. And, yeah, no doubt there's still room for further optimization. Let's talk for a minute, I guess, with the 3 strategies for integration. You've kind of alluded to this a little bit. There's memory as context, memory as gate, memory as layer. I think memory as context and memory as layer are probably the 2 most intuitive context basically being like do the retrieval, get the information out of the memory module, put that into context, do essentially like normal attention. And we've seen like many things like this, you know, even including like multimodal, right? Where you if you have like a a fusion of a sort of vision language a vision model and a language model into a vision language model, sometimes, you know, there's like separate encoding of the image into its space and then the text can go into its space and then later they can sort of have a attention that makes sense of them together. That's kind of how I read the memory as context. The memory as layer also is pretty interesting. A little less intuitive in the sense that I sort of feel like, especially as you were talking about kind of interweaving the layers, it's a little odd to think about accessing memory multiple times through the forward pass. And especially as you get to the very late stages of the forward pass, like, still going back to the memory, It's, like, not super intuitive to me how that would work or why it would work that way. And then there's also the memory as gait, and I probably have least intuition for that. But, you know, with that prompt for you, you can help me develop intuition for all 3.

Ali Behrouz (1:30:54) So let's say that we have this long term memory, and we are using attention as the short term memory. Let's just ignore the persistent 1 because, you know, the way that we are treating the persistent memory is is always the same, so let's just ignore that part. So we have this short term memory, and, also, we have this long term memory. As I mentioned, it's not very clear that how we can connect these short term or long term memory even from the, like, neuroscience part. So there there are different ideas in the late feature. And so, for example, the memory as a layer is is the most common way that people do these hybrid models of RNNs and attention. So, what they are doing is just simply, like, using each of memory or the RNN and also attention as a small layer and so on and so forth. And I think 1 of the important messages that we have in the paper is that, you know, most of the literature is focusing on on 1 configuration that is not great. You know, we have 2 different configurations or 2 different type of architecture that we can use, and they are actually more powerful than the architecture people are using in the literature. So that's an important message here, and and actually, there might be some some configuration that that outperform. You know? So so generally, 1 thing that I can say is is that these kind of approaches, like, were are worth exploring. So that's, I think, an important message here. But what is the intuition behind each of these designs? I can say that, for example, let's say we have something like short term or long term memory. The data comes. 1 way to, like, model this process to say that our short term and long term memory are interconnected. So what does that mean? It means that short term memory will help the long term memory, and long term memory will help the short term memory. So how they can do that? The data comes, goes to our long term memory, extract the relevant information from the long term memory, and then use that as the context. So when we use that as a context of the memory, 1 thing that we have here is that the attention part will decide whether it wants to focus on the current context or whether it wants to focus on the past information from long term memory. And then the output would be something that is somehow, like, combined or compressed by the attention. It's a piece of information that is, like, go through the attention, and, basically, we know what information is important and what information is not important. So the output of attention goes to the long term memory and, like, let it know that what information should be stored in the long term memory. So that's how we can, like, describe this MAC architecture. So as I mentioned at the beginning of the process, the long term memory is helping the short term memory by extracting the past information that can help the long term memory to decide about how to encode I mean, the the attention part, the short term memory, to decide about how how to combine the information or, for example, learn from the data. And on the other hand, the output of the short term memory will, like, help the long term memory to understand what information should be stored and what information should not be stored. And so that's the MAC architecture. Another way that we can think about short term and long term memory is to treat them as 2 different modules, 2 different branches of memory. So basically, in this design, the data comes and then goes to long term and short term memory at the same time. We have the information from the long term memory that is related to this specific input. And, also, we have the information that comes from the short term memory, how to understand the pairwise interactions of all these things in the context. And then we just concatenate these 2 or multiply these 2 at the end. So we are using these long term information and short term information, and then just concatenate them or multiply them to use both types of the memory that we have. And that's basically the main intuition for how do we want to do these things. And finally, the memory as a layer, the main reason that we put memory as a layer is as a unified way to combine the memory and attention part because most of the people in the research are using this configuration, and can help us to understand which of these type of configurations are more helpful. And the way that we can think about that is let's say that data comes, and what we are saying is that short term memory and long term memory are basically some modules that are sequentially connected. So the data comes goes to, for example, our our short term memory or the previous state of the long term memory, depending on how you want to order these layers. But, for example, the data comms goes to your short term memory, and, basically, the short term memory will decide which information should goes to the long term memory, and we repeat this process. The output of the long term memory, again, can help the next layer of the short term memory and so on and so forth. So in this case, each layer of our short term and long term memory is helping the next layer of long term and short term memory. So that's the main intuition we can think about this memory as a layer.

Nathan Labenz (1:37:09) Yeah. So it is interesting to observe just looking at all the results in the paper that, as you said, for the 3 sort of hybrid approaches, the memory as a layer is winning very few categories. And that is the I guess, again, it it also kind of speaks to the sort of motivation or, like, the sort of higher level of thinking being an important way to approach this because the sort of memory as a layer is, like, very similar to a lot of the hybrids that we've seen between, like, and Mamba state space in the past. But I think the way that that sort of came about was like, well, the original Mamba was just all Mamba. You know, there was no attention in it at all. And then people said, well, jeez, what if we can get the best of both worlds by just interleaving these things together? And that also, you know, seemed to improve on the base, but it wasn't necessarily a super principled thing. It was just like, well, I see that I can stack a bunch of attention layers, and I can also stack a bunch of mamba layers, and now I can you know, let's shuffle them in together. And lo and behold, it's, you know, gives me in some sense, like, best of both worlds. But it's interesting to hear, like, the memory as context and memory as gate both kind of have a more principled, like, higher order rationale, motivating them. And indeed, they do, outperform the the layer approach almost across the board, not maybe entirely. I'm just roughly counting. It looks like 9 different categories and 3 different scales at which these experiments will run. The memory as the layer wins, like, maybe 2 of, you know, roughly 30 of those different categories, and then it's between context and gate, it's like maybe 50 50 across all the rest. So 1 1 big question I have and 1 of the things that got me kind of most interested in the mob architecture and the sort of hybrids in general is that it seems like there are different what I have started to call micro skills across these different architectures. So transformers, for example, you know, are much better than Mamba was when it comes to retrieving, you know, like repeating past patterns. Right? Like, it can see the past pattern, which is not intuitive because it can see the past pattern explicitly, and so it can, like, repeat that pass pattern as needed. Mob has some other, though, microskills that attention kinda struggles with, including learning, like, really sparse signals or signals in really noisy environments. Sometimes the transformer can, like, kind of struggle to learn those, but the Mamba architecture seem to do better. Do you see any, like, micro skills?

Ali Behrouz (1:39:53) I don't know

Nathan Labenz (1:39:53) if you've had any any ability to study this, but are there sort of things that this this new, mechanism can do, like, qualitatively differently or or better or perhaps things that it can't do as as well as the I mean, I guess we probably would know that if there's anything it can't do as well, we'd probably have a pretty good sense for that being related to just not having the full thing in explicit context anymore. But are there have you seen any sort of microskills that are like new, different, exciting, or even just, you know, informative?

Ali Behrouz (1:40:27) So I'm I'm just saying 1 thing that's probably if you want to, like, answer that question, if I want to answer that question, 1 thing that we need to consider is that do we want I mean, are we talking about the neural memory module part of the Titan or the, like, the entire architecture of Titan? Because if we just focus on the, like, neural module memory module, 1 thing that we need to consider is that it's similar to linear models. It's it's very similar to RNNs. It's it's, like, just I mean, it probably has the properties of what we know about RNNs and these kind of approaches. But it's generally more expressive than the other RNNs that we have because, you know, the the memory is is more expressive. The architecture that we are using for the memory is more expressive, and also the recurrent formula is more expressive, so basically that's the main advantage of that raw memory. But the entire architecture of Titan, let's say, for example, MAC architecture or MAG architecture, these 2 are something like hybrid approaches. So they somehow has the ability of, like, having the best of both worlds, and they they can use, you know, the advantage of transformers and RNNs, at least on the paper, you know, because we can just ignore the output of the attention when task is, like, RNN specific or the there is, like, vice versa. So that's 1 thing that we need to consider. You know? So on the paper, as I mentioned, or theoretically, these hybrid approaches can have the best of both worlds. But, you know, there are some cases that we cannot have, like, better results than a pure RNN architecture, and there might be similar for attention, there might be some cases that we cannot achieve better results than a pure transformer architecture. And so the question here is that, is there any specific tasks or something that's hybrid approaches are better than transformers and RNNs? So we have another paper, and actually, it calls the best of both worlds. So we see this process through the lens of graph algorithms. So we want to see whether, for example, hybrid approach can have better results in some specific tasks, and actually, it turns out that there are some specific tasks that using hybrid approach is better than a pure RNN and a pure transformer based model, And so, you know, 1 thing that we can say is that TARTAN style architecture can be connected to the RNN models, and actually, theoretically, they are more powerful and expressive than existing models. For example, 1 thing that we can say is, like, models like Mamba or, for example, ResNet, s 4, s 5, and all these models are limited to t c 0 class of problems. And there is a paper called the illusion of state space models, I think. So, basically, they have shown that state space models or or more accurately, diagonal status based models and also transformers are are limited to t c 0 class of problems. On the other hand, for example, if we go beyond diagonal, like, linear RNNs similar to, for example, Namba, we can see that we have more expressive architecture, and we can go beyond a t c 0 class of problems. And so for example, I think that

Nathan Labenz (1:44:22) To to say more about the what it is what a t c 0 plaster problem is?

Ali Behrouz (1:44:26) That's that I that I can say is the state tracking problem. Let's say that we have we are in the, like, 0 0 point, and so there is a string of actions like go left, go right, and something like that. And at the end, the question is, what are you right now? And the the model is expected to, like, understand these things. But, for example, diagonal RNNs, including the status based models, and also transformers are limited and cannot do that task. But for example, nonlinear RNN can do that with only 1 layer. So that's a very simple task for them. But on the other hand, linear RNN and transformers cannot do that. Some models like TITAN, I mean, just in raw memory without the full architecture, are also on the nonlinear RNN part, and they can do this state tracking problem. And so basically, in these kind of tasks, they are more expressive and powerful than the other approaches that we know. So, basically, 1 thing that we need to specifically say here is that there are some other RNNs that that also capable of doing, like, a state tracking. For example, DeltaNet is also capable of doing that because it's not diagonal. But some models like Mamba or Mamba 2 or, for example, ResNet's 5 s 4. These kind of approaches are are limited, and they cannot do this state tracking problem. But this nonlinear RNN approach is similar to what we have in Titan. Basically, they they can do that. And also linear RNN meets nondiagonal transition metrics similar to DeltaNet, they also can do that.

Nathan Labenz (1:46:20) That's that's Quick interjection. Is that analysis limited to, like, a single forward pass? Because I would assume that, like, I could give o 1 or similar, like, reasoning models a a problem like that, and I would be pretty surprised if they couldn't, like, talk their way through it.

Ali Behrouz (1:46:41) So I think 1 thing that we need to somehow clarify is that some of these approaches might work practically good, and when we want to understand them in the theoretical side and with with some theoretical perspective, it's it's really hard to somehow model everything that we have in real world. And, basically, we need to have some assumptions that make the process simple so we could use these theoretical understanding that we have and use that theoretical frameworks to develop some results. So, yes, I think in practice, definitely, large models are capable of doing some of the tasks that I mentioned. But, you know, when we are talking about, like, some theoretical results definitely, that that it's not, like, completely about how how the models are working practice. Actually, the paper that I mentioned, the illusion of a city space models that I mentioned has some also experimental results that support that claim. Yeah. I think that's the main thing that we need to consider. Definitely, there are some other combination of models or, for example, additional techniques, reasoning models, and all these things to make the model more powerful to do a lot of stuff. But, say, you know, in these theoretical frameworks, are just talking about, like, simple model that we have. And yeah. So, basically, that's about the RNN side, the recurrent neural network side, which is our neural memory. But the attention part is also well studied in the literature, so we don't have any specific contribution in that side. You know? The attention is attention. We also have as I mentioned in the best of both work paper, I mentioned, we have also some results that shows hybrid models are better at doing some tasks that are hard for both RNN and attention, and their combination can be more effective. So that can also be another motivation to consider these kind of hybrid approaches.

Nathan Labenz (1:48:53) Cool. Yeah. That I'll need to go check that 1 out a little bit more deeply. It sounds like, if I'm understanding correctly, it's like there's a in some of these theoretical contexts and maybe this is, like, a good thing for people to be more aware of in general if I'm understanding it right. It sounds like some of these papers that sort of say transformer architecture can't do x or maybe better understood as saying, like, can't provably reliably do x under certain conditions all the time, but maybe still in practice, you know, can do them, like, much of the time, but maybe without a guarantee. But, you know, maybe with enough, like, inference tokens to burn, like, you don't necessarily need a guarantee. You just need, like, you know, a consistent success even if it's not, like, fully theoretically proven that you always get that success. So, yeah, those different frames are maybe why there's, like, a decent amount of people talking past each other, online about what exactly can and can't happen. Because it is often strange when you see these things where it's like, we've proven that a transformer can't do x and then somebody goes to chat GPT and does it. And you're like, wait a second. These 2 things can't both be right. And it seems like it's probably a question often of different assumptions and different levels of what they are considering to be proof, like the the sort of robust guarantee level versus the most of the time it works level can be, yeah, quite different. I guess just to follow-up real quick on the sort of microskills concept. Like when you look at just all the different things that you tested the new titans architectures on, what jumps out to you as, like, the most exciting thing that it is doing better than other architectures?

Ali Behrouz (1:50:45) I think the the most exciting part is the long context because, you know, for example, at actually, I need to, like, say that the long context part is some synthetic dataset, so we might not be as effective as what we are showing in the paper when we are talking about general tasks and a general model. But, basically, what we are doing in the paper is just comparing the model with some other counterpart approaches and then to see how they're performing in specific tasks so we could compare them. But the interesting part, I think, is that Titans can outperform other models in in long context, and this is is very, like, impressive, in my opinion, when we can see that they can scale to, like, 2 millions of tokens. And that's something that some models like GPT-four cannot do that. No. The the the performance in the benchmark for GPT-four can drop very fastly. But, for example, TITANs, with a very small number of parameters, can, for example, scale to millions of tokens and even go to 10 millions of tokens with the accuracy about, like, 70%. So I think that's the most impressive part in my opinion. And actually, that part is I mean, for that long context modeling, the MAC architecture is very important. You know? It's it's very important to have 2 different branches of long term and short term memories, and then, like, these 2 help each other to understand the context. When we are using some other approaches like MAG or, for example, memory as layer, we have, like, much, much less context links. I mean, with with high accuracy. So, basically, the Mac architecture for long context modeling is very important. I think that's that's the most interesting part. I think when we are talking about some some language modeling tasks, you know, the behavior of all these models can can be different when we use different size of models. For example, there is a very great plot in the paper called mixture of mamba, I think, if I recall it, correctly. So there's a very great plot that we can see is something like the plot is the perplexity of transformer, Mamba, and, like, the mixture of Mamba. And 1 thing that we can see is that the RNN approach or or, for example, in this case, Mamba scales better when when we when we have, like, less data. But when we increase the number of tokens that we are using for for training, we can see that transformers scales better and then just outperform when we have so much data in that sense. So that's a very important part to to to consider. And so 1 thing that we can see about, like, Python is that at least until once we experimented, we couldn't see that pattern. They'd actually, like, consistently have good results compared to, like, transformers and also other linear RNNs. That's definitely a good thing to converge faster, but when we are talking about using additional data for training the model, we might see some up and down in the order of models with respect to the perplexity. And so, basically, I think that's also very important, which might not be very clear from academic papers, because, you know, we need to have a fixed number of tokens that we are using to train our model, and basically, it's generally infeasible to train the model for larger and larger number of tokens. And, yeah, I think that's another good results that we can see in Taiton. They are very consistent to outperform the all models when we, like, increase the number of tokens that we are using in the training part.

Nathan Labenz (1:55:23) So as we look ahead, how much do you think, like, data is going to become the limiting factor on long term memory? I mean, when you show the, you know, still reasonable like, very few of the architectures even can attempt to get out to the 10,000,000 token context. And of those that can, the titans is, like, blowing away the other 2 that you show in the the experiments. It strikes me that just like we don't really have many sources of data that are really such long episodes. And so I'm I guess I'm wondering like because the Internet is, you know, full of short and medium and even long blog posts, but, you know, they're not reaching into the millions of tokens very often at all. If we think about wanting to have long running agents, it seems like an architecture like this could be really key to getting a lot of the sort of behavior that we might want out of a longer running agent, but we still don't necessarily have the like long horizon datasets to be able to train on. What's your take on whether we have the data that we need or we don't or and have to go create it if we wanna really take full advantage of these sorts of architectures at scale?

Ali Behrouz (1:56:50) I think related to 2 concepts in the community that usually is important about RNNs, and that's links generalization and links extrapolation. The question is whether our model can somehow works better or at least does not show any performance drop when we increase the sequence lengths. As we mentioned, most of the data that we have might not be very long context, so it might be challenging to train the model on millions of tokens. But we actually don't need to do that, so the generalizability in some models can be somehow important for effective training. Let's say we train our model on a dataset that might not have a very long context, but we expect our model to understand how to generalize to longer context. So when we have a single model like RNN, or for example, our neural memory, We can see that these models are struggling with length generalization. When we increase the sequence lengths, actually, we can see that the performance drop. And that part is is, like, very soon, like, k or something like that. It's it's not very, like, you know, 1 millions of tokens. Again, with transformers, we can see some similar results without doing any trick about how how we can, like, make them better at length generalization. But, you know, there there are some different architectures that can somehow, like, mitigate these approaches and make the model more generalizable with respect to, like, sequence lengths. And also, there are some other tricks that we can use, for example, the positional encoding, different positional encoding that we can use, or, for example, you know, there are different normalization techniques that people use for, like, length generalization and all these things. You know, for example, architectural side, I mean, regarding the architecture design, we can see that this Mac architecture is is, like, very good for the length generalization based on my my personal experience with with, like, with that. So my understanding is that when we have something like Mac architecture, the memory module is only responsible to, like, summarize the data for the attention into a fixed size, you know, number of tokens. So it doesn't have a very hard time to compress the information and and learn things. On the other hand, attention doesn't need to, like, attend to so many tokens. We can just use a fixed size local attention. So, again, it doesn't have a very hard time to learn from that part and understand how it can use the information from the long term memory. So in general, 1 thing that I can say is having more data definitely is is very good. Right? So it can help to have a better model, but not the end end of the story. We we need to have better architectures and better techniques that we can use in the training of the model so we could, like, make them more generalizable to, for example, make them more generalizable to different sequence lengths and make them, like, better at longer sequence and so on and so forth. So I think architecture and the model itself is is very important.

Nathan Labenz (2:00:41) I guess 1 question I have is how sort of retrofittable do you think today's open source models are with this technique? If I wanted to take a off the shelf lava model or r 1 or what have you and integrate a long term memory module of the sort that you have. It feels like I could probably do that without, you know, I mean, I don't wanna make it sound simple, but based on what you just said, it seems like I could probably, you know, hack the Lama architecture a bit, bring in the memory module as you've designed it, and then like do some amount of kind of continued training to get the memory to get the thing to like actually effectively pay attention to the memory as context, and I kind of think that that will work. Are you expecting that to happen?

Ali Behrouz (2:01:38) I think potentially it's it's possible. You know, the the the main thing that I like about the Mac architecture is, you know, no matter how large you want to choose your context links for for training your attention, you can always go farther with with the long term memory part. So you can use attention with 32 context links. You can use 128, 5 12, and go beyond that, like 8,192 context links, and and so on and so forth. You know? No matter how large you want to choose the context links for training the transformer part, you can always go farther with using additional memory. So it's with the memory as a context design, the titan slide model. I really hope that, like, these these kind of approaches can be used in the future and and basically become something that we can use because they are actually working, and and they can they can help with the performance in long context without, for example, performance drop or without any, like, damaging the performance of of pure attention. So I I really think that these kind of future work are really promising. What are the

Nathan Labenz (2:03:06) kind of conceptual question I had? This goes back to sort of how the memory module is predicting the keys. It's it sort of struck me that, like, if I wanted well, you know, 1 of the promises of long term memory, at least in humans, right, is, like, we can learn whole new domains of stuff that was just, like, we've never encountered before. If we spend time on it, it can sort of become part of our like background world knowledge. Here it seems like we still have kind of a the memory module output something that is kind of by definition in the latent space of the model as it was kind of originally pre trained. And because those, like, main weights don't get updated at runtime, the memory module is always outputting something that's in the space of that pre trained model. So that got me thinking like, would this ultimately be something that people might want to combine with sort of a continued pre training? If for example, you're like an enterprise and you're like, you know, let's say I'm 3 ms or I'm GE, I'm some company with like long history and you know, millions of products and you know, just some super deep history that I want to get my models to learn. It seems like I might want to do like continued pre training so that the model itself like has more representations of the things that I care about and then that might be needed to get that long term memory to be able to like output things in that sort of modified space. Does that I guess the the key question there is like, that disconnect between sort of the model's fixed latent space and that and like the the fact that the memory module is outputting into that space, does that suggest sort of a a frontier for, like, future elaboration or, you know, perhaps that that continued pre training approach would solve it? But I'm interested in your your thoughts on, you know, just how far could you push this. Could you get the model itself to, like, learn whole new domains of knowledge this way, or would you expect that you'd kind of bump into some limits and have to do something to overcome that?

Ali Behrouz (2:05:18) I think that if we want to, like, train the model for, like, long period of time, and by training, I mean, like, even test time training, you know, just updating the parameters of the model over time, and use that for different tasks or different type of things. We can face a problem which is very well studied in the literature, is called catastrophic forgetting. So, basically, 1 thing that we can see is that the model will learn a specific task or, for example, a specific context. And when we want to, like, go to the next task and, like, learn the next next task, 1 thing that we can see is that the model might completely forget about the previous task to, like, get adopted to these current, like, configuration and task that we have. So I think the the part that you mentioned is is very promising for future study, but actually, there are some challenges for for doing that, and this catastrophic forgetting is 1 of them. Because if you want to, like, have a very long, long, long time in the test time training part, In that case, the model might forget about, like, initial tokens or, for example, the previous task that's trained on and all these things, and that's really challenging. If we could solve that, actually, we will solve all the, like, robotic stuff and all the reinforcement learning stuff and all these things. So that that's a really challenging question. But, you know, I think we might face this issue, and it it needs to be, like, addressed. Yeah.

Nathan Labenz (2:07:13) Well, I know we're just about out of time. Any closing thoughts about, like, where you are most excited to see all this stuff go next? To me, this feels like a pretty notable step on the path to really figuring out that kind of continuity of memory, and I do think that's gonna unlock some of the most exciting applications that people are interested in. You know, whether it's the drop in knowledge worker that has that full context of your enterprises full history and everything that it's ever done or the long running agents, you know, that can go out and kind of make mistakes and maybe learn from those mistakes and, you know, not make the same mistakes twice. I don't know that this gets us all the way there, but it feels like it does take us a a meaningful step forward. So what to you is like the most kind of exciting or promising directions that you or or you hope others will go going forward from here?

Ali Behrouz (2:08:08) Honestly, 1 thing that I really think that I I think that it's really interesting is only using these kind of approaches for, like, other tasks and modalities because, you know, that we are trying to, like, do these long context for text, but sometimes there are other modalities or other tasks that are more suitable for long context. For example, as you mentioned, when we are talking about, like, agents or, for example, reinforcement learning, decision making, all these things require some long term memory to understand some different types of patterns in the data. I think that's promising to see how this long term memory style architecture like TITAN can work in those domains. As I mentioned, decision making, reinforcement learning, and, for example, other modalities and data modalities, I think that's a very promising future direction to see whether these designs that we just talked about are actually effective outside of language modeling, and whether they are just some good architectures for language modeling or they are just good architectures in terms of all areas of deep learning. You know, that's the important part, because I think 1 of the main reason that transformers were really successful, because, you know, they were successful in different domains, and actually, I think that that's really important to explore and see whether, for example, these additional modules that we are adding and and designing new architectures like Python, it can be also effective in other domains, other data modalities, and all these things. And definitely, are so many rooms to modify them for different tasks, because, for example, the design of, like, Mac mod Mac architecture might be very suitable for for language modeling. But, you know, for example, Mac architecture using, like, gated short term and long term memory might be very effective, for example, for decision making. Yeah. Yeah. We we don't know that. All of them are really interesting for future work to see, like, which 1 works better and which 1 is effective.

Nathan Labenz (2:10:41) Well, it never ends. There's always plenty more to do, but at least until the AGI takes over and starts doing it all for us. But for now, this is, a really fascinating development. Think great combination of high level principled, you know, intuitive motivation for the work and obviously some really deep work in terms of, you know, making it work on the mathematical and even the computational level. So it's a pretty impressive piece of work. I have really enjoyed studying it, and, I'll certainly be looking forward to what you guys come up with next. For now, I will just say, Alebe Ruz, thank you for being part of the cognitive revolution.

Ali Behrouz (2:11:23) Thank you very much for having me. Thank you.

Nathan Labenz (2:11:25) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.