Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures
Ali Behrouz discusses Nested Learning, continual learning architectures that update layers at different frequencies, plus AI sleep, associative memory, and implications for privacy, alignment, and AGI.
Watch Episode Here
Listen to Episode Here
Show Notes
Ali Behrouz, grad student at Cornell and Google researcher, discusses his potentially transformative work on new architectures for continual learning in AI. His paper "Nested Learning," praised by Jeff Dean as a possible paradigm shift, enables models to adapt to new context while preserving core knowledge by updating different layers at different frequencies, inspired by human memory systems. The conversation also covers his latest work on AI "sleep" for memory consolidation, why he sees all deep learning as associative memory, and the profound implications of continual learning for privacy, alignment, and the path to AGI.
Mercury: The fintech trusted by ambitious companies and individuals to run their finances, with virtual cards, spending limits, merchant/category locks, and AI-friendly tools like API keys, MCP, and CLI. Check out Mercury at mercury.com.
LINKS:
- Ali Behrouz personal site
- Titans memory paper
- Nested Learning paper
- Jeff Dean Google profile
- Transformer architecture paper
- Self-referential weight matrix paper
- Mamba state spaces paper
- MTOB translation benchmark
- RetNet retentive network paper
- DeltaNet linear transformers paper
- MAD synthetic benchmark paper
- Muon optimizer repository
- Adam optimization paper
- Language Models Need Sleep paper
- ARC AGI benchmark repository
- Emergent Misalignment paper
- GRPO DeepSeekMath paper
- Ilya Sutskever Lex interview
- Drexler CAIS LessWrong post
- Anthropic Claude site
- Ali Behrouz Google Scholar
- Muon optimizer blog write-up
- Google Gemini site
- OpenAI ChatGPT site
Sponsor:
Claude:
Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr
CHAPTERS:
(00:00) About the Episode
(04:15) Special Sponsor
(05:44) Bio-inspired learning gaps
(17:29) Active and sleep phases (Part 1)
(21:08) Sponsor: Claude
(22:59) Active and sleep phases (Part 2)
(28:44) Nested learning paradigm
(41:53) Hope architecture foundations
(58:45) Training and update frequencies
(01:13:37) Knowledge transfer mechanisms
(01:23:47) Rare language results
(01:35:41) Micro-skills and optimizers
(01:48:36) Sleep consolidation process
(02:01:52) Dreaming and implications
(02:12:35) Alignment feedback risks
(02:28:08) Embodiment and ecosystems
(02:46:06) Consciousness and closing
(02:54:51) Episode Outro
(02:58:50) Outro
PRODUCED BY:
SOCIAL LINKS:
Website: https://www.cognitiverevolution.ai
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathanlabenz/
Youtube: https://youtube.com/@CognitiveRevolutionPodcast
Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk
Transcript
This transcript is automatically generated; we strive for accuracy, but errors in wording or speaker identification may occur. Please verify key details when needed.
Introduction
[00:00] Hello, and welcome back to the Cognitive Revolution!
Today I am excited to share a conversation with Ali Behrouz, grad student at Cornell, researcher at Google, and author of "Nested Learning."
This episode was recorded a few months back, and while I normally believe that "AI content doesn't age well", this conversation with Ali is an exception. His work is some of the most inspired and potentially transformative that I've seen in the quest for new machine learning architectures that are capable of genuine continual learning.
This, of course, is one of the most important capability advances on the horizon – arguably it's the main gap between today's models and a digital AGI that would be capable of joining and contributing to human teams just as humans do – and Ali is advancing the frontier with an approach that is biologically inspired and technically elegant.
His blockbuster paper, Nested Learning, which has been touted as a harbinger of a possible paradigm shift by no less than Jeff Dean, develops a simple strategy that allows models to rapidly adapt to their current context on an ongoing basis, while preserving core knowledge, by updating different parts of the system at different frequencies – much like humans manage memory on multiple timescales, from working to long-term memory.
And his latest work, "LANGUAGE MODELS NEED SLEEP: LEARNING TO SELF MODIFY AND CONSOLIDATE MEMORIES" – which I actually heard about for the first time during this recording, and which has finally become fully public – takes inspiration from how humans consolidate memories and learn from dreams while sleeping, introducing an offline mode in which models transfer new knowledge from their high-frequency-update layers to their more slowly-evolving layers via distillation, and also learn new abstractions and connections between concepts by generating and training on synthetic data derived from their recent experiences.
In addition to the details of these architectures – which, like so many AI innovations, I find both extremely exciting and a bit scary – we also discuss:
- How scaling for performance may shift from stacking more layers to nesting more frequency update rates;
- How Ali understands all components of ML systems as forms of associative memory that compress a given context flow, why this leads him to call Deep Learning Architectures an Illusion, and how he's operationalized this conceptual insight by developing "expressive optimizers" that learn update rules and are capable of outperforming both Adam and Muon;
- How the attention mechanism can be understood as an infinite-frequency-update module, and why Ail expects attention layers to remain fixtures of AI systems indefinitely;
- The empirical results showing that Ali's new architectures compete effectively with Transformers on standard measures while also outperforming on hard tasks such as effectively recalling information from up to 10M tokens of context, and also learning to translate multiple previously unseen languages at the same time;
- Why Ali sees continual learning as both an opportunity and a huge risk for privacy and alignment, how human-AI relationships might evolve, and why Ali is cautiously optimistic that models that evolve over time based on our interactions could both serve our individual needs more effectively and also lead to a more diverse and hopefully stable AI ecosystem overall.
The bottom line, for me, is that for all the debate and speculation about whether or not current architectures can scale to AGI and beyond, there's a very good chance that conceptual breakthroughs will render that question moot before we manage to answer it.
Transformers have changed the world, but they aren't the end of history, and as tough as it is to keep up with AI developments, anyone who wants to get a handle on where things go from here can't afford blind spots when it comes to new research directions like Ali's.
And now, without further ado, I hope you enjoy this deep dive preview of AI systems that learn, on an ongoing basis, in increasingly human-like ways, with the brilliant Ali Behrouz.
Main Episode
[04:15] Nathan Labenz: Ali Bay Ruse, author of Nested Learning and the New Language Models Need Sleep Learning to self modify and consolidate memory. Welcome back to the cognitive revolution.
[04:26] Ali Behrouz: Thank you very much for having me.
[04:29] Nathan Labenz: I am super excited about this conversation today. I appreciate you for being willing to take some time and come back and do a deeper dive into your work. I think it is super fascinating and genuinely some of the most inspired work that I have seen in recent times. And a big part of your method, as I understand it, is looking at what human cognition consists of and identifying things that we as humans are doing that seem like they're quite important and really critical to our successful function in the world. And then trying to figure out kind of what an AI version of that might look like. And then starting to develop the architectures or system designs, maybe even more abstractly than architectures that start to make those capabilities possible in AI systems. And it's really striking both like how well some of these ideas have worked and also striking to me how elegant they feel and, and how kind of right they seem as I really take time to dig in and understand them. So first question and you know just big picture, how do you think about what it is you are trying to do? Obviously you identify gaps in what current systems can do. How do you think about those gaps? How do you conceive of what it is that you are trying to unlock with the new architectures that you're developing?
[05:53] Ali Behrouz: Getting inspired from the brain and actually like, what does it mean for me? It probably means a different things for different people. For me, I really like to get inspired from brain and generally evolution. And the main reason is that I think it has like a lot of data to train itself in a natural way of training. And so one thing that we can see now is very like evolved version of a very complicated biological brain. And so generally it's a great source of inspiration. But when I'm saying that, I don't mean that I want to like replicate brain and fully do something that that brain does, because most of the time we don't know what that is actually. And for example, I, I think, you know, there are different levels of understanding about how the brain works. So the first one is that, you know, we know it works. And so that's, that's the first level. The second level of understanding is that, you know, there are some modules in the brain, each of them are responsible for different parts. And you know, we have memory, we have like other things and that's generally the process. And I, I think the hard part when we want to get inspired from the brain is that at what level of granularity we want to focus on the brain and get inspired from that. Because if you go too much into the details like how the brain works and so on. So first, there are like 2 issues about that. One is we are overfitting ourself to one specific form of intelligence. And another one is that we don't actually know how how the brain does that specific thing. So generally in all of the like the works that I have done, for example, on Titans and also like this necessary learning one, one thing that is happening is we can see that the models are are facing some challenges in real world applications. And then the question is, is that a specific challenge is solved by humans and can, can simply human do something to address that challenge and overcome it or not. And then the second question is like how, how they can do that? But you know, again, for example, in Titan we discussed surprise metric. We discussed, for example, how the memory should be like decompose into short term and long term memory and so on. So first, but the point is there, there's a high chance that brain is not exactly performing gradient descent to for example, understand the surprise or or something like that. That's just a high level of our understanding about how the brain works. I I think keep that as a source of impression. I mean that's level it would be a great source.
[09:24] Ali Behrouz: But if we go into more details, then potentially we might face some challenges specifically because our understanding of the brain is changing over time and so we might like over feed to one specific design choices. So that's generally one thing. But about the listen learning, I think 1 coins that is missing in the current models is about two parts. 1 is about how they can adopt the environment and the context that they are in. Another point is about how the model can understand new knowledge and incorporate it into their arameters over time so they can make sure that they don't face catastrophic forgetting, which means that, for example, one specific task that they have trained on is forgotten and they don't have any skills in that specific direction. So I think these two are important things about the current models and they are facing a lot of challenges for that because if you have a very like large model, then your model needs to be updated over time. And that's, that's generally the main reason. You can see that, for example, there is a knowledge cut off for all of the LMS that we know of. For example, if you ask GBT about one specific information and then say, you know, you are not allowed to use, for example, any tools for answering this question, then you, you might see that there's a knowledge cut off. And so that's, that's a little bit challenging to overcome. And if you just want to keep updating the models, then there, there are like 2 huge challenges. One is catastrophic forgetting that I mentioned. And another one is the efficiency part. Because have a lot of parameters, you cannot keep updating all of the parameters. And so it's, it's a little bit hard for that. And you know, there are some solutions for that. I, I think they are also like great. But I have some intuitions that why they might not perfectly work for the case of continual learning. For example, one might say that they want to do supervised fine teaming, for example, SFD or do some RL stuff or something like that. But the point is, still the model can face catastrophic forgetting from one side and from the other side is that at some points you need to, you know, transfer all the knowledge that you have in your context and pass it to the, for example, actual parameters of the model. If you just keep summarizing the tokens, if you just keep the tokens and want to like do everything about the memory and learning process in the token space, then the main issue would be at some point you will pass the context links of the LLMS and so the potentially you will face some challenges in that direction. So considering all these things, the main issue with the current element paradigm is that they cannot continually learn and obtain new knowledge, new skills over time. And also they are limited in understanding different levels of abstraction word. Everything that is done in science is, is to do something that can explain the word in the most simplest way possible. And that's generally the way we can learn something because we don't want to like keep everything. We want to understand underlying patterns that can describe that specific data that is specific, you know, knowledge and and so forth. So this compression process and how we can, you know, understand different levels of abstraction from the from the, you know, data that we have is something that the current LLMS posture somehow.
[12:56] Nathan Labenz: So couple different angles. I want to just probe your interest in a little bit more from on this point. Certainly I have felt, you know, the the predicate of most users honestly at this point have felt the problems that you're highlighting. And in some ways, I feel like the, the biggest advantage that I have relative to an AI today is this kind of ongoing coherence, you know, and, and like reasonably stable identity, right? Like I know who I am in the morning and I kind of know what I was trying to do yesterday and I can mostly pick up where I left off. And, you know, I probably could learn a lot more from the things that I do on a daily basis, but I at least learn some stuff, you know, and kind of take it on board and obviously did the current models don't really do that. And that that is a big weakness for them. That's why I was so excited about the original Mamba paper when that came out because it was just like, wow, here's something that seems like it's competitive with Transformers, but it has a fixed size memory space. And you know, something that like, obviously we can't grow the memory space quadratically to Infinity. So we're going to have to have something, you know, that's bounded in size that can, you know, they can work. And so that was like obviously a notable step. And they've done some work with the Mamba architecture as well, Titans even more so, you know, OK, here's another way to think about having a fixed size memory module that in the in that case it was updated with gradient descent at runtime. So, you know, that potentially seems like it would likely and it certainly seems like the data supported the notion that that would make it even more powerful than the Mamba architecture. But you know, a similar kind of structure of like, OK, a fixed size thing that can like keep what it needs and and gradually let go of what it doesn't. That seems seems super important. I wonder if you come at it though, also from the other angle, like so so far we've kind of said like, here's something that it can't do, you know, we can do this, it can't do this. That's one way to think about it. Another way to think about it obviously is like, what do we want our AI to be like? And so I wonder what your intuition there is, you know, today we have mostly chat bots that need, need something to wake them up, right? They're like, either we have to go send them a message or obviously increasingly we have like Cron jobs and, you know, other triggers that kind of get the AI to wake up and, and do something. But if that, if those things don't happen, they're inert, right? They just kind of sit there until, you know, somebody calls their number. Do you have a sense or a vision or kind of a, you know, a dream of what your ideal future AI would be like that's different than that? And like how you know, does it look more like another person but with AI advantages? Or does it, does it look like an LLM with like weaknesses patch? But I don't know, like what what is your kind of when you dream of a 20-30 AI that you're, you know, working closely with on a on a daily basis? What do you envision?
[16:01] Ali Behrouz: Like different aspects to answer this question from the technical point of view, I I think generally the entire literature, most part of the literature in the, you know, past 40 years is built on a paradigm that says we have a pre training or generally training phase and we have a test phase. But the point is if we have a continual learning, we can see that, you know, there are like a lot of recent studies about continual learning, how we can do that and how we can like overcome about a lot of challenges. But the point is a true continual learner doesn't have a test and train time. So potentially if we hear that name in any like design choices, potentially it means that it's not a true continual learner because there is no test, there is no train time. And so the question is, is it like a uniform process for the model or not? And my personal opinion is that we still need at least 2 phases. And how it works is that we should have one phase that the model is active so it actively receive information. It's whether through the user query, for example, or for example, it might be about vision models, word models, or anything similar. But the point is the model receives some information and generally performs some computation on the input data and it's active at that point. But on the other hand, there is another phase that the model does not have to wait for the inputs data. It might not receive any input data, it's completely blocked from the word outside of it. But the question is, even at that time, should the model be static without like any performing computation or doing something? Or the model needs to start thinking about some process, thinking about the data that it has inside its parameters and so on and so forth. So I think we can break the process. As I mentioned in two parts, 1 is the active phase and another one is another phase. Potentially we can call it like a sleep time because there is no inputs, but still the, you know, artificial brain or generally like that, that the model itself is trying to perform some computation. And so I think that's a good way of of defining different phases in in this direction of continual learning. And then I think good model is a model that performs very well in both sides. It should receive the information properly, encode it, process it and understand it in the best way possible. And on the other hand, when it goes to the sleep time, it should also like start processing what it has learned before and use that for self improvement. And so that's, I think an ideal model should do from the, you know, technical point of view. But on the other hand, I think there are a lot of challenges. The models that we know right now are, are very large. So even a simple, for example, academic papers that is like presenting a new LLM or architecture or something like that needs to perform some experiments on models with like billions of parameters, 1 billion, two billion or something like that. And generally, that's a very large model, requires a lot of like computation if you want to keep that model, you know, updated over time. And it needs some techniques to somehow make this process possible. And generally for us, when we were thinking about like this direction, it was a time that some ideas about nested learning started. Because generally if you think about like nested learning, we can see that at each timestamp that we have, we don't have to update everything. We just need to update just a small subset of all parameters. And so that potentially if they to overcome the challenges about the efficiency.
Sponsor
[19:39]Claude: Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr
Main Episode
[22:35] Ali Behrouz: So if I want to like summarize, what I wanted to say is that I think an ID model should have like 2 phases, 1 generally it should be a continual learner. It should like interact with the word and also it should have like 2 phases. 1 is about like very active process and another one is about self improvement and how it can consolidate its memory, how it can like understand the knowledge, how it can connect different things that seems to be irrelevant. And so first and everything similar that usually our brain also does when we go to sleep. Yeah, that's just my personal opinion and my big compass is wrong. But I think we shouldn't focus too much on what human can do. And instead of that we need to focus more on what human wants from AI. And I think if we think about, we don't want to create something that is very similar to us, I mean that that's also like really interesting direction. But I, I, I personally, I'm not really interested in that. I think we have like great design for human, but on the other hand, for, for the AI side, I think we need AI models that's are capable of understanding what we want. For example, I think current paradigm of LLMS, you can see that they are great after some time when we have like more features more. For example, a ChatGPT has has the memory now and for example, plot has some like features Gemini and everything you know now they are more capable of understanding what you want. Even you say, for example, write this e-mail for me, they can simply write that for you. And so I think that's a very promising direction in general, because we don't want to replicate human. There are a lot of like different ways that we can define intelligence somehow and it doesn't have to be the same thing as as human intelligence somehow. So I I think here again we can get inspired from the human, but we need to understand why we want to get inspired. Do we want to get inspired because we want to replicate what human can do? Or do we want to get inspired from the nature to understand some underlying rules in the nature. For example, just one extreme example is we cannot travel through time. So if I come up with one idea that is saying that my AI model is trying to break the causality in the world, then potentially that might be a wrong idea or wrong direction because it is breaking some nature rules about our world. Or again, let me give you another example. For example, when we are talking about the title that we use, LLM needs sleep. It doesn't mean that LLM needs to literally go to sleep and rest. It means that, you know, from the human brain it seems that there is a very general rule that it has two phases of learning and then another phase of processing it, consolidate the memory and find underlying patterns between the received data. So that's a pretty high level of of inspiration from the brain. And so, yeah. And in short, I, I, I think that we don't want to replicate what human can do. And you don't want to have human intelligence. But on the other hand, we want to have a new form of intelligence that is really, you know, it's really proper and it's in there, you know, it's, it's designed in a good way that understands human needs. And so it can help people to, you know, do a lot of things that they might face some challenges without illness.
[27:15] Nathan Labenz: Yeah, certainly, they're already superhuman in some ways. And so the opportunity for them to balance out our weaknesses is incredible. 1 phrase that you said there that I wanted to kind of latch onto is multiple ways to define intelligence. And maybe I'll just give you kind of my high level pitch for like what nested learning is. I think in the paper, there is a big emphasis, you guys place a big emphasis on kind of showing certain equivalences where you're like the way that we're doing things today is sort of a special case of a more general framework that you're developing. The core idea is I see it in the nested learning paradigm. And what I, what I think is like potentially for simple person like myself, like most exciting about it is for quite some time now, right, we have achieved the greater and greater expressivity of models by stacking more and more layers and just making them bigger. And that has worked like remarkably well. We've been able to push that that paradigm incredibly far. It's like kind of crazy though, right, That we just have this like one layer stacked over and over and over again in, you know, you know, whatever 80 or 120 layers, zebra, however many layers. And that's kind of it like it, that feels like the more mature solution should be somehow more like elaborate than that, right? And so, but that's been the way that we've achieved this expressivity, or sometimes the term computational depth is thrown around and what necessary learning is doing is bringing a different way to the table to achieve higher levels of exclusivity or or higher levels of computational depth. And that is by stacking not layers, but levels. And what differentiates a level from a layer is that a layer is like the same thing. Or if they can, they can alternate. Obviously we have these sort of, you know, interleaved architectures too, but these are things that are sort of in sequence as information passes from one layer to the next. But kind of a forward pass is like pass through all the layers 1 by 1 and get to the end. And that's kind of the thing what the levels paradigm brings to it that's different is that different levels can have different update frequencies. And with that, you now have the possibility for some parts of the overall system to be much more durable and some parts to be updating like much more radically in something much closer to real time. And that obviously feels like just much more aligned to like what we are, right? We're not like one static thing that processes information in a fully dependent way each time we have, you know, we're, we're very much our state in any given time is very much contingent on what we just experienced, but only to a degree, right? Like where, you know, my mood or what it, what is currently on my mind is a reflection of what happened earlier today. But my like big picture views about the world, you know, they didn't change from this morning until now. So there's clearly some sort of hierarchy of different kinds of beliefs, different kinds of representations, different kinds of circuits that we have, which are updated in some cases very quickly and other cases very slowly. And obviously they like are integrated together and work together. And we just haven't seen that in machine learning, except maybe in a few, you know, very far-flung kind of experimental cases. And now you're like really starting to show that with this nested learning paradigm, not only can you make it work, but as we get into with results, like you can make it work in a way that is competitive with Transformers and even seems to have some of these new or sometimes called them micro skill advantages where you can see, you know, with these certain early diagnostics that like, oh, this can do something that's like qualitatively different than what a transformer can do, even as it's also like outperforms it a bit in terms of like general perplexity type scoring. So how would you react to that kind of general, you know, high level summary? And then I also really would be interested to get your take on what is this concept of computational depth or expressivity. I'm tempted in some ways to make an analogy to just like the G factor. You know, people talk about the obviously the G in AGI, it is the generality. There's also G in the context of like human IQ, which is like the G factor of the sort of intangible something that's like how you know, how capable are you across like a very wide range of things. Again, that's like just getting it in generality. Maybe in machine learning it's as simple as being like G is sort of loss or there's maybe some fundamental equivalence there, but maybe not, I don't really know. So I'm very interested in how you think about what that clearly we're getting at something that we've seen like huge progress, but what is that something is another thing I really would love to get your intuition on.
[32:05] Ali Behrouz: We were working on on Dennis Learning for a very, very long time, potentially, I mean even like more than one year and a half or something. The main issue that I personally had and we discussed a lot in the group of the authors of the work was it was really, really hard to formalize what we wanted to deliver. And because, you know, even in the mathematical formulation, it was really hard to write it in a formal way and say what we want to exactly do in the in this in this paradigm. And so after some, you know, back and forth discussions, I think we come up with this specific framework saying that there is a frequency of updates to somehow give time to each module to do something when it's waiting. I think that's, that's the best way of describing that why we need to have like multiple frequency. And on the other hand, we need to have a knowledge transfer, a method for knowledge transfer to somehow transfer the knowledge from one level to another level. And so when the SLO network is waiting for the all the computation and information processing of the fast network, then it should be some advantages for the slow network that we are paying all these costs preferred like doing some computation by the fast network. And that's advantage comes when we have a best or or you know a good way of knowledge transfer between the fast network and the slow one which is like lower level or to the higher. So the idea there is that when we have a network that is like updating very fast, then I can use that fast computation, give something to a slow computation side. So the slow network or the slower level can somehow focus on the more high level knowledge abstraction of the data and then more frequently updated network can can somehow focus on the fast adoption and how they can like process in a high resolution data process the high resolution data. So generally that was the way that we could somehow like describe this framework and saying that I think these two sides needs to be there to complement each other. One is the frequency of update and another one is the knowledge transferred between the votes and so.
[35:04] Nathan Labenz: From.
[35:05] Ali Behrouz: So for the second question, I think I in my opinion the current models are very efficient from one specific point of view. Why they are efficient? Because because you can see that from what we can get from LLMS, they are very cheap. And for example, you can, you know, if you want to match their power in some of the tasks with human, then potentially the cost would be very, very different. So from that perspective we can add computation and generally like the current LLM paradigms are is is very like efficient and we can perform more computation per each parameters or artificial neuron that we have into our model and it can help us for different things. One is that it can help us to have more, we can have a smarter model. It's it's, it's a very like subjective term to describe this. But you know, when we have more computation, it seems that we are performing some internal thinking. So a simple LM vendor is that, let's say that I have a simple LM structure based on Transformers. When the token comes, I will like do some computation on the token, pass it through all the layers and then predict the next token. And that's how how it works. But now let's assume that for this specific token, instead of just a simple compute, the simple pass of computation, I also perform more internal computation with respect to the past data or generally like.
[37:04] Nathan Labenz: You know.
[37:05] Ali Behrouz: Combine or mix the data and something like that. In that case, I can see that the quality can potentially goes up, the quality of next token prediction for this specific design choice can goes up. And why is that? Because it can be interpreted as as as a form of internal thinking process. So now for each specific parameter that I have in my model, it is performing favor all computation. So it's not just one parameter 1 of computation, it's one parameter couple of steps of computation. So that's one advantageous. Another one is about memory perspective and adoption of these models. So when we have a model that is like adopting to the context very fast, then potentially that model can, you know, learn in context. So that's that's somehow one of the main messages that we try to deliver in the nest of learning, which was everything that we know of somehow is a form of in context learning. So generally, like, I think it's a great thing in human language that we create new words. But on the other hand, if we just create a lot of words for the same concept, it can just make us confused or or it can help, it can be misleading somehow. So I think we should like create new words to differentiate different concepts. But if we have one specific concept, we need to like stick to one specific word that we have for that concept to avoid misleading process or anything like that. So from that perspective, we realize that we can say everything is just a form of in context learning. So we already know that's what is in context learning. Now we just need to understand how what we have been already doing is, is a form of in context learning. And so that was a part we started to like showing that for example, back propagation is a form of in context learning is a form of associative memory. And when it's a form of associative memory, then we can say that the general pre training phase of the model is a form of in context learning. Or when we go to the for example, context of attention or RNA, so on and so forth. Again, it's a form of in context learning. So when we perform gradients which we can define any RNN based on gradient descent or or other form of optimization process, when we can do that, it means that we are doing some learning on the context that is happening right now. So I think these two are the main things that nested learning is trying to address. 1 is about generally more computation per per neuro and another one is about adoptable tea and continue all their visa.
[40:24] Nathan Labenz: Can you just describe in more specific detail, like what are the relative sizes of the levels? What are the structures of the levels? What are the context windows or lengths of the different levels? What are the frequencies? Just like map the thing out for us in kind of very black and white terms.
[40:42] Ali Behrouz: Yeah. So and we start from the Transformer structure. In Transformer we have attention block and then MLK block. So what is happening there is that in the pre training we have you know different contexts that attention is attent to. The attention side is trying to like combine all the tokens and each token attends to all the tokens before that in the context and so on. So first and then there is an MLP block and that MLP block is responsible for long term memory. Now when the model is pre trained then the MLP block is fixed, it's not changing anymore. So it has all the information compressed during the pre training and then we have attention. So in at inference time, attention is responsible for the context that is it is getting and MLC block is responsible for the long term memory and generally the very general knowledge of the world. Now let's just simply extend this idea. A simple extension of this idea is that we can simply keep attention and then instead of just one MLC block, we have multiple MFV blocks. Each of them are updated with different frequency. So now what is going on there? Is that why it's helpful is that when you have attention, you have a fast adoption to the context. Attention is very powerful. It's like a perfect memory cache everything. And so it's it's great. On the other hand, you might want to have multiple levels of memory and that's the part we define continuum memory system. So instead of 1 block of MLP, we have multiple blocks of MLP. And now you have your first MLP block, it is updated very fast. So what would happen in that case? In that case, the first updating process of this MLP block can cause catastrophic forgetting, because this MLP block can simply forget the information that it it gets, for example a couple of thousands tokens ago, for example. But the point is since the other MLP blocks have not updated so far, the knowledge that is forgotten by the you know first MLP block is still in their parameters. So when we perform back propagation through all these layers, then the knowledge can come back. So it provides and it helps us to have a loop process in time. The you know, the model can you know the first MLP block can forget something. But the point is if that specific data sample or that specific skill that is forgotten is important, then this can come back to the other MLP blocks which has not been updated so far and they have still they have the knowledge about that specific skill or data sample. So that's a very simple way of extending whole transformer block, transformer block. And so we call this variant hope attention. You know, it's, it's a combination of attention plus multiple MLP blocks. And that's what we called hope attention. Now we have another variance, which is the actual hope architecture. What we are saying is that attention, as I mentioned, is as a perfect memory. It's, you know, it can cache everything, it can scale and it's fast and so on so forth. It's, it's great. But the point is, still the update of attention has infinite frequency.
[45:18] Ali Behrouz: What does it mean? It means that Attention doesn't know anything about the temporal dependency of all the tokens and it needs something like positional encoding. Or even with the help of positional encoding, Attention is not a great model for the tasks that are sequential task that requires sequential reasoning or something like that. What was our idea? Our idea was to replace attention with another associative memory that's map keys to values. So that's that's what attention does. And now we want to replace it with another module that tries to map keys to values. And so one potential architecture here is title. So we can simply just replace Titan and have Titan plus continue memory system. And so that's that's a simple idea. But the point is, you know, in initial sections of the paper we discussed that if you have a simple linear process of updates, which is what is happening inside each chunk of Titan update, this process can somehow be. Be richer than the case, then we have self referential process. So what is self referential process? Gradient descent or generally back propagation is a form of self referential process. So what is the idea in the self referential process? The idea there is that we want to learn how to learn and how to learn and how to learn and how. There are a lot of levels of how to learn how to learn. And so computationally it's it's infeasible to implement all those levels of how to learn how to learn and so on so forth. So there is one idea, but by Schmidhaber ET al. And so basically they have this idea of self referential model and how, for example, one of the ways that we can make a model self refresher when we have a key value memory is the case that the model generates its own value. So what is happening there? Let's say that in our memory we want to memorize something and you know there is one specific event happening and you want to memorize it. Let's say you want to memorize one specific word. And now in our brain we have associative memory and we are trying to, you know, map this specific word to another concept that we already knew so we can memorize it. And you definitely have seen some cases that, for example, you say I generate this specific word or I, I want to like map this specific word to another word that I already knew. So I can remember that we, we generate the value that we want to map our keys to it. So we can, you know, memorize the key as well. So for the set referential process, it's a very general concept, but in this specific design choice that we have in the paper, the value of the associative memory is generated by its own parameters. So the model itself generate its own value and then try to map keys to values. So potentially this process is fully sequential. You cannot parallelize it in a simple format. And so it has a full understanding of the causality in our data. So potentially in the task that requires sequential thinking, sequential reasoning or anything like that, we can expect that this model should work better than simple attention because a simple attention doesn't have the ability to sequentially, you know, understand the causality of the data. Hello. And so that was our idea. In Hope Architecture we replace Titan with self modifying Titan which is exactly a Titan module where it generates its own value function value of the associative.
[49:53] Nathan Labenz: Membership.
[49:54] Ali Behrouz: And so that's Hope architecture, it's self modifying Titan plus continue memory system.
[50:02] Nathan Labenz: I think I want to spend one more beat on what you mean when you say generating its own value, because I'm kind of like, OK, I, I know the transformer architecture pretty well. We've got these like KQ and V vectors, right? And the training process modifies all of those over time. And so the, you know, the sort of general heuristic that I have is like for each token, there's the query vector that sort of indicates like what this token is looking for. There's the key vector that sort of helps indicate like what what other tokens have to offer and, and sort of aligns those right and finds where there's relevant, where there's a match, basically where there's relevance. And then the third one, the value sort of brings up the concepts that then get fed on into the downstream layers that, you know, make sure we have like the right activations for continued processing from there. But all of those are are learned, right? All KQ and V are learned. So I'm not entirely clear on what you mean when you say that the model learns its own values because like doesn't the transformer sort of learn its own value vector as well? I'm not quite clear on the distinction that you're making there between what you know. Well, let's assume people are at least generally familiar with the transformer and know how that goes.
[51:32] Ali Behrouz: So generally in transformer or more accurately softmax attention, what is happening there is that we have a projection of QKV and then the output goes to attention. So attention doesn't have any control on the QKV projections. But when I'm saying that the model or or generally the associative memory tries to generate its own value and then map keys to the value, I mean something like gradient descent. So if it if we just recall the gradient descent, we can see that we have something like the previous state of the West, which is WT minus the gradient of the lost function that we have. So this is equal to the next state of the weights that we have. So if we look at this process, you can break the gradients using the chain rule and write it into the gradients with respect to the output times the inputs data. And now you can see that this is getting a form of associative memory very similar to linear attention because it's WT plus one equals to previous state of the West minus K&K here is XT and then V which is the gradient with respect to the outputs. So if you look at this process, it's very similar to linear attention, but the OR any linear recurrence model. But the very interesting part is it is different from linear attention because the point is if you look at the value, which is the gradient with respect to the output, this gradient with respect to the output is a function of WT, is a function of the current state of the weights. So basically the value, the you know keys and values that we have in associative memory, the value component is not from another component before this recurrent formula, but it is generating by this recurrence recurrent process every time. So that's what's what is going on with this like self refresher process. So in memory, let's say that in in a very simple version of like self modifying Titan. If we want to design this as a simple Titan, what is happening is that I have this X and then QKV projection. I project it into QKV and then pass all of them to the Titan module O that's a simple Titan module. But if I want to have a self modifying Titan, then all of these parameters of the projection of QK and V are optimized inside the Titan maju. So basically the model has the control of somehow modifying its own updating room and somehow generate its own value for for the memory. That's the main difference of these two.
[55:06] Nathan Labenz: Yeah, I think the key phrase maybe for people to latch onto there, starting with myself is modifying its own update rule. And this is definitely a theme in general, like with the Mamba architecture as well. You know, the authors of the Mamba architecture had done a bunch of previous state space model work and the big unlock at the with Mamba specifically was that the way in which the state is going to be updated at each time step now became a function of inputs. And the so that increased expressivity potential for expressivity of making the the actual update of the state itself dependent on the input that it's receiving at that time unlocked a lot better performance. And there's something very similar going on here it seems like where you're saying we want to make the final value output, not some. We don't want to, we don't want to calculate the value vector too early. Basically we want that to come a little bit later and be more input dependent, more history dependent than it has has traditionally been in the softmax attention. Is that a good intuition or is there is there something still missing from that intuition I.
[56:35] Ali Behrouz: I, I think it's, it's, it's a perfect intuition. Generally, this projection of the value is also updating inside the module. That's a very important point because it helps for the adaptability. Generally the model itself is also very adaptive to the context. And so from every token that comes, the model is trying to learn something. And so the way that it generates the value from the value term for the associative memory is exactly the same as the way it updates its memory. So it's it's a very adoptive process to also generate the value.
[57:16] Nathan Labenz: So can you take us through kind of a single time step and maybe we can do this with the attention hope and then the Titans hope and just like highlight the little difference there, but it's also zoom out to the big picture. So we've got now kind of a new fundamental block, right? If we do the attention hope thing, first, we've got an attention mechanism and then we've got multiple MLP's arranged in sequence from fastest update to slowest update frequency. And then that block gets stacked into layers, correct? I'm a little bit confused to be honest. I'm the sort of there's no training test distinction because there is still some training process where you're like just taking a bunch of data and like running it through the thing, right? So from the sort of researchers perspective, maybe from the models perspective, there's not so much of A distinction, but from the researcher perspective, you are still sitting there like running a process that takes a bunch of data and like has the model learn from that, which is kind of, you know, a bulk process that's not like a user is engaging every there's nothing like outside of that process happening at that time, right? So what, how is it different? You know, when I, when I do this for a transformer, I can like, I do have this some of these really nice parallelization benefits. I am interested to come back to understand like to what degree the sort of current hardware paradigm plays nicely with some of the stuff you've got here and to what degree the, you know, the fundamental recurrence may present challenges, but bracket that for a second. Today I can like run a bunch of tokens through the thing in parallel. We can accumulate, you know, we can do this for batches. We can accumulate all these gradients and then we sort of apply the gradients and then we have, you know, the the next time stamp and we kind of keep doing that. And I have a pretty good intuition for like how information flows. I kind of, you know, can I can visualize the forward pass in my mind and then I can visualize the backward pass of back propagation going through and, you know, gradually updating all the weights. How does the procedure with the new architecture vary? What are the core things that are different from the the paradigm that we're more used to, I think.
[59:31] Ali Behrouz: Generally like the main difference comes from the update side that I mentioned. So for the Copa attention, I think it's very similar to the current paradigm and it's the even the architecture is very similar to Transformers. It's actual transformer architecture. We just replace the MLP block with multiple MLP blocks. And so I think when we want to do inference, then the main difference comes from the fact that for each of the MLP blocks, we need to track where we are. Is it the time that we want to update the MLP block or it's it's a still test time get updated. If it's the later case, then we use the last updated state of the that specific MLP for for doing the inference. If it has not, I mean, if it's the time to get updated, then we first update it through all the back propagation stuff and all of the tokens that we have seen so far in the in the current chunk. And then when, when the weight is updated, then we perform the, you know, inference. From the research point of view, it might be a little bit hard to remove this part. I I think even from the research point of view, it might be better to say that we have evaluation time and not evaluation time because it seems we are always like update when when the model is always up, gets updated over time. There is no training time and test time. But the point is it seems that for, you know, one specific period of time, we don't do some evaluation and we wait for some time. And after that we start evaluating the model on the different downstream tasks that we have. Generally, like for anything else, I think from the model perspective, it doesn't know whether it is in the test time or train time because everything is the same and it's a very uniform process. But from our side, definitely it is important whether we want to evaluate the model and you know, measured its accuracy for a specific task and so on. So first or not? And so, yeah, coming back to the hope architecture, that's generally for the for the hope transformer or Hope attention model that I mentioned. But when we go to the the actual hope architecture, then again, everything is the same, Everything is very similar. The only difference is that the attention is replaced by self modifying Titan and for self modifying Titan, again, everything is very similar to Titan. So the are, all of the are are just inside the model design. And from the higher level perspective the inference is very similar that you know the context or document goes to the self modifying Titan and then for each token we have one output and then it goes to the MLP blocks, we have multiple MLP blocks and so on and so forth. So everything is is very similar to the current paradigm.
[1:02:59] Nathan Labenz: Did I have it right that there are like this? This core block of either the traditional attention or the self modifying Titan module plus the MLPS that then becomes the block that gets stacked into layers? Is that right?
[1:03:15] Ali Behrouz: Yes, that's that's a design choice. We can have like different design choices. For example when we. So generally the initial and and the main design of Hope is in the case that we have. For example for the Hope attention, we have attention and then multiple MLP blocks. Each of them are updated with different frequency. But for some of the task we needed to use pre trained model and for example if if we want to focus on for example Llama, then Llama is not designed with Hope architecture. So what we can do, the thing that we have done is that instead of like going in this formulation that I mentioned, for example, attention then multiple MLP blocks, what we have done is that we say this is attention and MLP block, then attention and another MLP block with different frequency and then attention and another different, sorry, another MLP block with different frequency and so on so forth. So it's somehow design choice you need to see like which one do you prefer? Do you want to use existing pre trained model or do you want to like start from scratch and train your own designer architecture? But I think potentially both of them are relatively similar. They don't fundamental, they make changes, yeah.
[1:04:44] Nathan Labenz: Interesting. It's some of the, I mean, great reminder of the old Ilia maxim that the models just want to learn. There's a lot of, it's always striking to me how many of these choices end up kind of, yeah, I could kind of go one way or the other. I mean, that was true in the Titans case where you had like 3 different ways of working the memory module into the larger architecture. Definitely been true with Mamba in many ways where you know, you could have multiple states, you can have them in sequence, you can have them in parallel, You can, you know, once you have one of these block concepts that seems to work well, you can kind of Lego piece it in a lot of different ways. And yes, there will probably be some performance differences between different ways to arrange the blocks. But more often than not, it seems like if you're really talking about a a serious conceptual advance, you find that like that's less important. The exact wiring diagram is less important and more important is the core piece that you're adding to the the set of Lego pieces sort of speak that you can use. And so, yeah, that's a good reminder of that. How do you think about the relationship between the different MLP's in terms of size, in terms of update frequency, maybe in terms of like learning rate? I, I because it feels like there might be some feels like learning rate might be kind of important here where there's like a potentially an equivalence. If I update 1 LLP every token and then I update another one with a larger batch size, I can, I feel like I can make those like much more similar or quite a bit different depending on like what learning rate I apply. But yeah, so size frequency like intangible terms. Are the small, are the ones that are updated less frequently also smaller and do they have like different learning rates than the other ones? Like take us through that kind of how do you think about the the relationships between the MLPS of different?
[1:06:45] Ali Behrouz: Frequencies, yeah, I think it's, it's really depends on, on the architecture and the number of parameters and the design choices that you have. It's just really similar, you know, we cannot say that the best, what is the best dimension for Transformers or attention block potentially. It's really hard. It really depends on the person that wants like work with that attention and the use cases that we want to consider and so on so forth. So generally the updates, the, the frequency of updates really depends on, you know, how adoptive you want your model be and for example, how do you want the model to maintain its persistent memory and so on and so forth. So I, I think that's, that's pretty much a design choice, but about like learning rates. I, I think in generally you can treat each of these blocks as the same way as, as you do about MFP blocks. They're exactly the same thing. But the point is they have like different frequency of update. So nothing has changed. So everything is exactly the same thing as MF blocks. And so I have I, I don't think that you know, it something would be very interesting to see how the learning rate can affect each of these blocks and so on and so forth. But I have not done that. And I'm, I'm not sure about the exact solution, but my expectation is that potentially any way that we used to do high pay parameter ceiling, we can do the same thing here as well. So there shouldn't be any differences.
[1:08:26] Nathan Labenz: Interesting. And so your general default is kind of an intuition based approach. And do I have it right that it's basically kind of order of magnitude like the fastest 1 updates every token, the next one updates every 10 tokens, the next one every 100 or maybe it's like a couple of orders of magnitude like how how did you understanding that it's not yet a fully tuned system? What did you start with and and why did you pick those things as your kind of initial guess?
[1:08:56] Ali Behrouz: Yeah. And, and the way that we choose in, we choose the, you know, the frequency of update for each of them was based on our intuition of, you know, the chunk size that we use for Titans and other models. Because generally the, the chunk size in Titan is, is, is the part that we can define the frequency of Titan as well. At that time, we didn't have this term of frequency, but generally, you know, that chunk size can define frequency for Titan as well. So what we used was, was based on our intuition about what chunk sizes are, are good for, for Titans. And so, and, you know, I, I think we didn't have like, you know, something like, as far as I remember, I think the numbers that we used was possibly 128, then 4 * 128 and then 4 * 4 * 128. So it was something like that as far as I remember.
[1:10:02] Nathan Labenz: And how about knowledge transfer? How should we think about the way in which? And I guess one one of the quick interjection question there too. There are still like skip connections everything.
[1:10:15] Ali Behrouz: Everything is similar. Yes, exactly one, you know one, one thing that we try to do in this and learning I mean unfortunately it caused some misunderstanding about what we are doing. And you know I have received I have seen some comments about for example some of the concepts here are already new and something like that. But the point is we try to actually included all those concepts that we already knew to show that it is a universal learning paradigm. It's not something that contradicts with our current understanding. It just complements and completes what we already know about like in in a new direction. For example you can or when you are doing deep learning any form of deep learning and you are saying that I'm using this attention here you are actually using nested learning. But in deep learning you only see the final solution of each learning problem. So you have a learning problem inside the attention and you are trying to solve a regression problem and the non parametric solution to that regression problem is attention. When you see everything from deep learning side you can only see the final solution for each component. But you when you see everything from the nested learning side, you can see the internal a learning process of each component as well. So in general, it's not something that is that that contradicts what we already know, but it's somehow complements all of the things that we knew and go beyond that. I think that's generally an important part. So yeah, everything can be very similar. You can have a stipulation.
[1:12:08] Nathan Labenz: Yeah. So then help us understand how we should think about the roles that the different frequency ML PS are playing knowledge transfers. One way to think about that or another way to frame the question might be how do they complement one another? How do they work together? How does the one that, and I, I would be interested to understand this both intuitively and like mechanistically to the degree you have like mechanistic understanding. But you know, how does the one that's updating fast gradually inform the ones that are updating slow? How do the ones that are updating slow kind of steer the ones that are being fast in the right directions? How how would, how do you think about the interplay between those different components? Yeah, I, I.
[1:12:50] Ali Behrouz: Think generally knowledge that coming up with different ways of knowledge transfer is is very important here in my opinion. So the main point of the frequency having frequency for each component is 2 parts. I I think we also like discussed it in the beginning. And the first part is that it helps the model to maintain its memory for the longer time time period. For example, you know, that's just a simple example. Let's say that we have a tune and then one of them goes to a spaceship and you know, just move with the the speed of light. And then right before that they have a very good memory. For example, they went to, you know, did they have launch or something like that? I don't know. They have very good memory and then the person that is like moving with the speed of light when they come back 80 years has passed in in like on errors. And then their sibling somehow forgets about that specific launch they had because it was 80 years ago. But on the other hand, that specific person has all the information, all, you know, all, all the, all of the details are just one seconds ago, 2 seconds ago. And so they, they remember everything about that launch. So why it's happening? Because of the updates in the memory of each of them. The person that lived 80 years, their memory got updated a lot of times, but the person that lived just, you know, just move with the, a speed of light, we're close to it now. Their memory didn't get updated that much. So from this perspective, we can see that, that the number of times that we make some updates to the memory is very important. And you know, it's a very inaccurate example that I mentioned. But I, I think the main point is clear here. When we have like 2 components, one of them is updated a lot of times while the other one is a slower 1 and is not updated that much. What is happening there is that the slow one has the opportunity to learn something from the fast one because at the time that the fast one is, gets is got updated, the slow one has not get any update. An so now there's a chance that the fast one learns something to the slow one before gets updated. So I, I think that's the part we, we, we can also like discuss this sleep process that we have. So what is the idea there? The idea there is that when we have multiple levels of MLP blocks, each of them are updated with different frequency. One simple thing is that before updating the fast MLP block. And by fast, I mean, you know, it's a relative term. So there's a fast and slow, while we have like multiple level. So for each two blocks, consecutive blocks that we consider when we want to update the fast one, we know that there's a chance that we forget something. So before forgetting something, we can transfer the knowledge of this block to the next one and then update this one. And that's the part we need a good way of knowledge transfer. So for example, one way is to, you know, do context distillation. So if you want to pass the knowledge from one Ms. block to another one, some methods of context distillation can works very well in that case. And that's very similar to what we do in the SD process as well in the SD paper. And so, yeah, I think the main role of the knowledge transfer is the is, you know, help the slow network to take advantage of the fastest work. And another point about you know, having different frequency is about the memory and how how the model can manage its memory similar to the example of the tunes that I mentioned.
[1:17:28] Nathan Labenz: What is the mechanism by which the information in the fast update layer gets moved to the slower layer? But then there's also got to be something going the other way, too, right? That you conceive of them as like pure perception in a sense. Then I guess I'm not even entirely clear on what's happening in my own brain, but certainly like there's more information flow. It feels to me like from my sort of perception modules to my, you know, higher order reasoning modules, whatever. Then there is from the the reasoning back to the perception that that signal. But it is that signal is still important, right? Like my, my higher order processes do tell my eyes where to look and do tell them like where to focus and do say, you know, hey, we need to like 0 in on this, this detail a little bit like I want to understand that better. So like go put some of your bandwidth into this particular thing. So yeah, I guess do a little more on how the mechanistically or or procedurally how the information in the fastest update layers is getting transferred and, and how we're making sure that we're storing what really matters from what the fast updates have learned. But then also, what is the? What is the signal that flows the other way and so?
[1:18:45] Ali Behrouz: And let me answer that with a very simple example. Let's say that I have this model and I want I have model A and I want to update it fast MLP blocks and I want to make sure that the information in the fast MF block is not forgotten and it it can pass through the slow emmetry block. 1 simle thing that I can do is to just copy all of the arameters of model A to model B. Now I have two identical models, one is model A and another one is model B. What I do for model B is that I update the fast network the fast MLP block. Now the parameters in the fast MLP blocks of model B are just free. And now what I want to do is that I want to change the parameters of the slow MLP in model B in a way that's the output of parameter. Sorry, model B can mimic the output of model A. If that happens, it means that you know what is the difference of model A&B. Model A has all the information compressed in the fast MLP block while all those information are gone in model B. Now, if I could somehow modify model B in a way that it can mimic model A, it means that I somehow transferred the knowledge in the fast MLP to the parameters of the slow MLP in Model B. So that's just one simple thing. So, and this process is very similar to the distillate distillation process. We are distilling the, you know, the knowledge of model A to the knowledge of Model B. So from this perspective, we can see that you know these two somehow. This is one way of knowledge transfer from low ML, sorry, fast MLP to the slow MLP. That's for example, one example. Another example which is like very common and popular is back propagation. So if you just sequentially connect your MLP blocks and then at some point perform back propagation, then you can like transfer the knowledge of one block to another one and so on and so forth.
[1:21:19] Nathan Labenz: So the sort of copying and distillation process you described, that's essentially what's going on in the language models. Need sleep paper?
[1:21:30] Ali Behrouz: Yes, with some additional detail. For example, what we do there is that we also like additional parameters to the model B as well to make sure that it has enough capacity to store the new knowledge that it has gotten just now.
[1:21:47] Nathan Labenz: Does that mean also that in the in the nested learning version of this, you're really just letting back propagation do its thing and you're not really like you're not, you haven't really over engineer, you haven't engineered it all that much. We just have these MLP blocks, they get updated different frequencies and you're just kind of letting gradient descent do its thing and the updates are just kind of working. That's basically it.
[1:22:12] Ali Behrouz: Yes, exactly. Yes, in the whole part picture everything is just like back propagation.
[1:22:18] Nathan Labenz: A huge take away from the conversation that that doesn't that didn't come through to me as clearly in the paper is just like, this is really proof of concept stage stuff. And the fact that it works so well shows like what a good concept it is. But nothing that we're discussing here is really, you know, has has been through the same kind of thing that the main line models have been through where, you know, everything has been sort of parameter, hyper parameter explored to the NTH degree and optimized. And, you know, all the little refinements that have been obviously made over time that hasn't really happened here. So there's a lot of questions that we still could ask about like this version, that version, this configuration, this arrangement sequence, parallel, you know, how many layers, relative sizes, relative learning rates. There's like a ton of of space there still to explore. But basically just taking a few of these core concepts, the main one being the different frequency of updates for different MLP blocks, that alone creates some pretty impressive results that are like qualitatively different from what we are used to seeing. So maybe let's take a minute and just talk about some of the results. I mean, the, you know, there's a lot of different tests run in the paper and you know, big tables of results with a whole bunch of different metrics, some of which are kind of your perplexity scoring classic type of stuff. What do you think are like the most important revealing results that you you think people should say, ah, because I see that it can do that. I know that there's like really something here that I need to grapple with.
[1:23:56] Ali Behrouz: Yeah. And and this we have one continual learning style task that I personally really like. So the idea there is that we have a pre trained model and there isn't one specific language that the model has not seen before. And so we want to learn that the specific language in context to the model. We want to learn the model that is specific language in context. And then the, the, the point is we have all the grammars, we have all the words, we have, there's a Dictionary of words. And so we pass all of them through the model in context. And then the model learns the language. And then we ask that can you translate these secific text to English, from that language to English? And then we can see that the model perfectly, not perfectly, but but in a very, very, very good quality, can translate that specific text. So it seems that the model is capable of understanding that language in context and then use that for for some translation task. But the point is, let's just go one step beyond that. Instead of one language, let's put 2 languages in context and then ask the model to translate, you know, different text from each of these languages to English. In that case, we can see that the model almost collapse and cannot translate any of those languages. The point is the model cannot handle its context well and fully understand each of the languages separately. And you know, that's that's generally very, very hard challenge for transformer based architectures. And but the point is when we change that architecture to hope or hope attention, again we have attention, but we have multiple levels of in context learning, multiple levels of MLP blocks. And so one thing that we can see is that when we increase the number of levels, the performance of the model in both of these languages gets better and better and better. Why it's happening? Because the model has better way of memory management because it understand that, for example, temporal knowledge that are not very needed can be stored in the first MLP block and more understanding of of the language can pass to the, you know, more stable MLP blocks, later MLP blocks that are more stable. And then when we have like more and more blocks, we can see that the performance get better and better. So I in my opinion that's that's a way like good evaluation for understanding whether the model can learn in context and generally like to continual learning. Yeah, I think.
[1:27:24] Nathan Labenz: That's so this is is the same. I remember it's been a while since I thought about this, but I think it was with maybe Gemini 2, maybe I don't even know, but it was even as far back as Gemini one. There was this metric introduced of learning a new language from basically one book. There's like, it was like there's some critically endangered language. One person has, you know, really studied it and made like a book that's not on the Internet anywhere that sort of explains what this language is. And then they just put that book into context and say, OK, based on this, you know, go ahead and do translation. It seems like this is, I don't know if this is the exact same test as that one that I was previously familiar with or if it's a bit different, but it it seems the language here is Manchu. I just looked it up. It's like a critically endangered language from somewhere in China. So that's basically the idea, right? It's a language that the language models have basically no prior knowledge of. They're given a sort of very detailed primer on this language from some anthropologist or whatever who's gone out and done the field work. And then their job is to apply that. And so I'm looking at Figure 8 in the nested learning paper. And what I'm taking from this is all of the models do kind of similarly if there's just one language. But as you said, when you go up to 2 languages and sort of double the difficulty of the task, then the in context learning traditional transformer approach performs quite badly. And then I understand hope 1, hope 2, Hope three. Are those like how many levels, like how many different frequency update things exist? So when you move from traditional 2, I guess one additional two additional three additional frequencies of update, you get basically almost all the way back to the original level performance with just one language, yes.
[1:29:24] Ali Behrouz: Exactly please.
[1:29:25] Nathan Labenz: Yeah. OK. That is, that is quite interesting. Yeah, fascinating stuff. And what is MTOB there? Just so I have that clear MTOB. That's it, that's the other.
[1:29:40] Ali Behrouz: Yes, that's another data sets, it's another language that's that is also has not been seen during the pre training of the model.
[1:29:51] Nathan Labenz: Yeah, OK. This is the one that I recall. Yeah. So you guys added the Manchu language to this one from, Yeah, late 2023. So both of these like very rare unknown languages being translated to English, and only with the multiple layers can the models do both at the same time in, in one context. Yeah, very, very, very interesting indeed. How do you think about kind of the I, I really like that instance. That's a great intuition builder. How do you think about just kind of things like perplexity scores? I mean, you've got a big table that shows on things like perplexity and sort of some accuracy stuff on some of these, like basic kind of classic battery of tests. And I should say we're scaling these models so far up to roughly the 1 billion parameter scale. You've got 760 million parameters and 30 billion tokens, and then the bigger is 1.3 billion parameters and 100 billion tokens. So obviously that's like not huge by today's standards, but nevertheless, you know, there's a pretty clear signal that the hope architecture is on just about every dimension outperforming all the other things you're comparing it against, which includes your transformer and your Mamba and Mamba Mamba variations and even Titans. Retinet is in there, Delta net is in there. How much how do you interpret these? You know, this kind of goes back to that G question. Is this like, is this a good measure? Is it just the best measure we have? What do you think about like how how much stock people should put in these like perplexity tables?
[1:31:28] Ali Behrouz: You know, there are some standards in the community for like performing some benchmark task and not all of them are the best things to do for for evaluating the model. But we just need to do them to make sure that everyone is everyone can see where the performance comes from, the advantages comes from. And so I think I have one specific transformer structure and I want to like, I have one idea that if I add for example, forgetting to the transformer, then it can perform well on noisy data, for example. I'm not sure I'm just like coming up with just one example. And so in that case if if there is no noisy data and I just test my method on a very clean data, then there is no way that I can show the advantages of my approach in that case. So I think here is is exactly the same thing. We are arguing about models that do not need to be pre trained. There is no test time, there is no train time and so on so forth. But on the other hand still a lot of infrastructure built on like test and train time. A lot of evaluations are built on test and train time and so everyone also expect us to report something about pre training perplexity and some some evaluations that most of them are some short term and short context language modeling task. And they do not need to have a very complicated model to understand long context modeling. So I, I think I also like mentioned that in the presentation of nested learning get new ribs. We didn't like use Table 2 and all those perplexity and language modelling tasks to argue that hope is powerful. We just use that table to say that hope is not less powerful as a backbone compared to other models. And you can see that like it performs well, but someone might say that it's marginal compared to other like models. But the point is, this is the this is not the direction that we aim to solve. And somehow it's it's really good that we can see that even in this direction that is not the goal of the nested learning and hope we can show some improvement, even if it's marginal.
[1:34:12] Nathan Labenz: Yeah, got you. I'm always a big fan of trying to get a little bit better sense for the micro skills of different architectures. So for example, of course, you know, Transformers, because the full sequence, the full context is in working memory at all times. It's pretty hard to beat. And I, I feel like you even sort of have like kind of a theoretical argument now that like it maybe even be kind of impossible to beat in some, in some of these tasks where the idea is like recall from the context window. But then, you know, we saw things with Mamba, for example, where it was better at learning from like sparse signal. And this was sort of a micro skill that it, that architecture excelled at that the transformer relatively struggled with. What, what, what have you seen in the hope case? You know, are there little micro? And I, I think it's, it's very interesting because it does kind of ladder up to the overall performance, you know, and what these things are actually good or bad at, right? I mean, the ability to recall something in, in context is really important when you need it. The ability to learn from or kind of filter out noise and, and get to the, you know, the signal that really matters is, is really important when you need it. So are there particular micro skills that that stand out to me that the language translation one is, is an interesting one in a macro sense of like that's a hard task. But I wonder if you drill down to these like very micro building block competencies that models or architectures can either have or not have, what stands out in terms of what this has that Transformers don't have or don't have as strongly?
[1:35:47] Ali Behrouz: When we are talking about in context recall tasks or or generally like recall intensive tasks, in my opinion all those tasks are designed for Transformers. They are not designed to compare architectures, but they are specifically designed for Transformers. Why I'm saying that? Because you cannot expect from a model or even a human to perform needle in haystack perfectly or for example do some recall intensive task. For example, I assume that you have a code and like couple of 1000 lines of code and then simply you want to recall what was the value of X at some line of the code. And so it's almost impossible or or at least it's very, very hard for a human or even like for other models to do that. And but on the other hand, it's it's pretty much simple for Transformers because they have direct access to the entire history in their context. And so it's very simple to just like, you know, find that token and pass it as the out and in fact find it somehow in recall intensive task like this in context recall task that we have here. Actually the gap between recurrent architectures which actually they perform as is also very great. If you compare the, you know, first generation of recurrent architectures to the transformer, we can see that these gaps was much, much larger. Now this gap is getting like smaller and the performance of other recurrent models is also really great. But the interesting part for me was that Hope at least closed this performance, closed this gap in the performance of the model compared to Transformers. While they are not expected to do that. Transformers is maybe expect Transformer to do that because it has a tension block, but we don't expect a compression based model to perform recall task. And so I think somehow it was interesting for me.
[1:38:09] Nathan Labenz: And what does the mad data set get at? How should we understand like because that's, that's where to just replay back to what you just said. The on these like needle in a haystack. These like very difficult recall tasks from earlier in context, the transformer remains the best. The recurrent models, which only have some latent representation and and don't have the ability to look back at the original raw text don't perform as well. But with each generation of improvement, and here you've got several, the hope architecture does the best of of the recurrent ones that don't have the the full explicit context in working memory at runtime. And then so that gap is closing, flipping over to the mad data set. Here the hope architecture is performing better than everything including the transformer. What like micro skills, is that testing which? Which should we take away from that result? The.
[1:39:03] Ali Behrouz: Math data is also like. It's very similar to the recall intensive tasks. But the point here is that you know there are different setups for it. For example, in one of them is the noisy in context recall. We want to perform recall and in context recall task. But the point is we have some noise in the tokens and now when we have that noise in the tokens, somehow the power of Transformers that I explained in the previous setup which is which was pure in context learning. Now is its weakness somehow, because it can get simply confused about which token is noise, which token is not and so first. So potentially this task becomes a little bit, for example, harder for Transformer compared to a model like whole. But again, that that is also very that also depends on the memory management of the RNN. So again, for the RNN, if it doesn't, if it doesn't have a very good memory management system or generally updates mechanism, then potentially it can simply get confused by the noise as well and face some issues. But again, if the memory management is strong, then it's, it's much simpler to filter all those noise tokens in the task. And so yeah, I, I think that's for example, one thing, another task that is also like, I think it's interesting here is about compression. So the, the compression task here, somehow you know, the, the name explain the task itself, but we want to compress the tokens and predict one single token that is the compressed version of a set of tokens. And so then, you know, we see that and then we want to like reconstruct the original sequence from that. And so potentially it's a simpler task for models like RNN because they already knew how to compress the data properly. But on the other hand, Transformer has has a harder time to perform this task. So generally, like as I mentioned, all of these tasks I don't want to like go into all of the are of the research. But generally all these tasks are somehow modified version of recall intensive tasks or in context recall or something like that. But the point is, there are other aspects. For example, selective coping is another one. There are other aspects to the model that those aspects are so are very important and we should also like see how the model performs in those aspects and not just, you know, over freeze our evaluation on one specific metric.
[1:42:09] Nathan Labenz: Cool. I think that's probably enough on the really low level stuff. And I think this illusion of architecture title of the paper starts to click for me. On page 39 of this paper, we get to the part where you also have a new optimizer that is outperforming not just your old Atom standard, but also even outperforming Muon. It does come with a little bit of computational overhead, but I think again, the argument is that it more than pays back for itself in terms of faster convergence or just better learning. Is there anything you want to add on the the M3 optimizer as you call it?
[1:42:45] Ali Behrouz: First, also like could I have one point that generally for optimizers it's a little bit like hard to say that for example, this specific optimizer is more powerful than the other one. It really depends on the problem setup or generally even the problem. So for example, we might here we are like evaluating the optimizer on, for example, vision task. But on the other hand, if you train a language models, you might see that the trend is completely different or something like that. So generally like the design of optimizer and and saying that like which one is better than the other one really depends on the task. Or if you take share problems, it's up and all these things. And somehow that's also one of the main points that you wanted to deliver in this learning. Because what we are saying is that the entire architecture with its optimization process are just one interconnected system of nested optimization problem. And this is interconnected. Why it's interconnected? Because the gradients of the optimization side is generated by the architecture. If you have a simple architecture, then the gradients are very simple. If you have a complicated architecture, the patterns in the gradients can be very complicated. And then when you have momentum, term momentum is a form of is a form of associative memory that is trying to compress gradients. So for example, if if your gradients are very complicated, you need more powerful memory management system for your momentum. Or if if it's very like simple, if it's very simple architecture, then even a simple gradient descent without any momentum might work very well. So in general, one of the arguments that we have in the paper is that we should see everything as an interconnected system and tries to design something that's all together results in a good model architecture or generally like a machine learning model in a very general term. So that's that's one argument. Another thing is that we wanted to deliver this message that architecture site is very, very, very similar or somehow exactly the same as optimization side. All of them are just some learning room and there are some learning process that is happening. And the only difference between the architecture side and the optimization side is just the context. The context of the optimization algorithm is gradients. Actually, the context is the set of gradients that we have. And the context of the architecture side is a set of tokens that we have. So generally they are very similar. So in the paper we had this continue memory system, we extend the MFP block saying that you can have multiple levels of frequency for the MFP block. And you know that's a very general term. And in the entire paper we are arguing that architectures are the same as optimizers and so on and so forth. So why not applying that technique and somehow borrow that technique from architecture side and apply it to the optimization side. So that was the main motivation to show that this continuum memory system that we have designed is not just working well for. Architecture, but it also works very well on the optimization side as well. And so we just simply extend me one instead of 1 specific MLP, sorry, one specific memory, it has multiple memory in the case of entry, it has like 2 memory. And so it's trying to, you know, compress the context with different frequency rates. And so it can help you to better understand the like a global aspects of the lost landscape and it potentially can help the model to find more effective solution.
[1:47:07] Nathan Labenz: The new paper language models need sleep. Tell us a little bit more about like what's going on here. You mentioned kind of at the top the two phase concept. We have the memory consolidation phase and then we have the dreaming phase. I do think it's fascinating to consider that there's like kind of creation of net new parameter space and then consolidation or sort of pruning back. I understand too of, of because obviously the things can't just grow and grow forever, right? But yeah, take us, take us through this in in more detail. I'm fascinated to learn more about it.
[1:47:40] Ali Behrouz: Generally, the main idea as we discussed earlier was that if we have a truly continual learner model, then there is no test and train time. At the other hand, we need to have one active time wherever that the inputs is coming in an online manner and also the time that is the time we don't we don't have any inputs. So the model is not actively receive information from the outside, but it doesn't mean that the model should be static. It means that the model just doesn't get inut, but it can have some internal computation to improve itself. And so that's a very, very general concept. We can incorporate more and more components to the sleep time that we have. So it doesn't have to be just these two specific part, but you know, these two were really relevant to the research that I'm doing. So we just like did that, but potentially it can include any other form of self improvement and so on so forth. So that's just one, one way of breaking the life of a continual learner into active time and sleep time. But what we have in the sleep time right now, which again, as I mentioned, we can incorporate more components to it, is that we want to make sure that when we update each of the components of the model, we don't forget about the knowledge that is stored in their parameters. And So what is the idea there is that what we do is we know that there are like multiple MMP blocks, each of them are updated with different frequencies. And then again, the fast and slow here is just a relative term. It doesn't mean that you know the slowest or the fastest. So we have a slow and fast way. It can be any part of the neural network and we want to transfer the knowledge from one to the other. So in order to do that, what we do is is a distillation process that I mentioned, but the distillation process is based on the on policy distillation. So the model itself generates some data. And so one interpretation of of this process is that we distill the knowledge of one small model to a larger model. When we want to go from one step to the next one, we activate new parameters in the next level. So it can help the model to release some of its capacity and be ready to accept new knowledge. So somehow it is somehow it describes a very natural way of learning in humans as well. And for example, it's, it's very common that when we learn something, when a new, when, when we learn about the new concept or something like that, we don't have a full understanding of all asects of that. But the point is when time passes and you know, we, we let the, we let our brain to better understand that concepts overtime. And also like, you know, we, we study other things and better understand the entire process. Then at some point we can see that we have a very clear picture of what's going on in that concept. We can completely understand it. And so generally that's the best thing.
[1:51:22] Ali Behrouz: That's a very good way of learning. So here is, is exactly the same, very similar process, very, very similar process. So also like discuss that from this perspective, I think it might be better and more simple way to understand why we need like have multiple levels and distill the knowledge from each level to adopt another one. So when we want to understand a specific concept, we have different levels of knowledge abstraction for ourselves. The first level which is the most simplest 1 is to just memorize things. So let's say that we want to learn one specific mathematical rule or for example, a specific concept in physics or any science. How how we can learn that? We start with some example of that specific concept and then we start memorizing those concepts. For example, if if it's a mathematical rule saying that you know, just any mathematical rule that we can have, we start with some specific examples and just memorize them. And then at some point, we just neuralize our understanding of all those examples, remove all those examples in our brain, and replace all of those memories with just one single memory that can describe everything that we have learned so far from that concept. And then when time passes, we have more information, we read more about that concept and so on and so forth. Again, we revisit our understanding of the concept and then replace our previous understanding with this nuke understanding, which is more general and it can explain a more phenomenal or more terms in that specific concept. So that's generally the way we understand things and there are different levels of abstraction in our understanding. So now when we have different MLP blocks or or generally, let's just go to any like arbitrary architecture, There's the don't, it doesn't have to be just hope. It can be any architecture. But the, the main thing is that each of them are, each of the blocks are updated with different frequency. In that case, the fast updating block is very similar to the memorization process because we memorize a lot of things. We don't need to understand that. There's no like pure understanding of that concept. It's just memorization. And we can also like forget very fast something that we have memorized. So the first level is that. So the first block is responsible for that part. But if we want to better understand that concept, we need to do some memory consolidation. So what is happening in our design is that we transfer the knowledge from fast updating module to the other one. But if we just simply pass the knowledge from fast updating block to the slow updating block, then nothing has changed. We just like transfer the knowledge without doing anything.
[1:55:04] Ali Behrouz: But instead of just simple transfer, we replace that with the solution process. Why this solution here is important? Because the previous block or generally the fast updating block has compressed the concept and somehow understand it or memorize it in any way. Just it's just a compression process. When we do a solution then there is another levels of compression that force the model to, you know you don't have that all those parameters anymore. You have now have less number of parameters to store that specific knowledge and in order to do that you need to come up with something that is more general and can can understand underlying patterns in the data in a better way. So you can store everything in just a smaller number of parameters. So in that case, the model would come up with better levels of knowledge abstraction because we have forced it to do it. And then again, we just repeat this process so on and so forth. That's generally the main idea of memory consolidation. And like every time that that this sleep process happens, we consolidate the knowledge from one level to the other one and so on and so forth. And so that's, that's a very high level idea of what's happening in the memory consolidation. Another part is about dreaming. So why we need to have this dreaming process? The main thing is I I think there are like 2 important points when everyone wants to implement this streaming process. The first part is we need to have a self improvement process. So we have learned something so far. Actually the the memory consolation part can also be seen as a form of self improvements. But you know, if we haven't won a specific task at hand, if we want to like a specifically optimize the model for one task, then this is the place that we can do it. We can self modify the model and like fine tune it, or generally like use RL to update the models and self modify it so it can be more powerful in one specific task and so on so forth. That's just one advantage of trimming. Another advantage of trimming is that in the dreaming process we need to understand the connection of concepts that seems to be irrelevant, but they are actually relevant. That's also what is happening in the dreaming process of human We can see that we can. We can see very weird dreams because the brain is trying to understand the connection of of very irrelevant concepts and see whether there is an underlying pattern in that O. Here in the dreaming process we need to also have that as well and understand different aspects of how we need to combine different knowledge stored in different components of the model. So that's another goal of the dreaming. And you know, we can just combine these two into the sleep process and the model after one step of sleep, the model has consolidated its own memory. And also on the other hand, there's a self improving process.
[1:58:49] Nathan Labenz: So in the sleeping process, I'm seeing that there is there are new parameters created as to createspace in the slower frequency updated portions to absorb the information from the faster frequency ones. Does that ever shrink back down to is there a, is there a pruning or is there a a sort of other side of that that balances that? Or at this stage, do these models just grow indefinitely throughout their life?
[1:59:23] Ali Behrouz: From the technical point of view, we cannot like grow the model to like arbitrary large number of parameters. But the point here is that it's like a periodic process. We add some parameters and then we free them for the next step of consolidation. When we are in the first log, we add some components. When it reaches its capacity, it means that it is the time that we need to consolidate the memory to the next step. And when we consolidate all this knowledge to the next step, we just remove all the extra capacity that we have added to this level and freedom for the like other levels like faster levels, so they can also consolidate their their memory to this block as well. So generally it's like a periodic process. We add components and remove them add.
[2:00:23] Nathan Labenz: Components and I see. Gotcha. OK, interesting. And what more can you tell us about the dreaming phase in terms of like just a little bit more practically what's going on there? I mean, it's certainly when I try to introspect into dreams, it, I think it hasn't been super. I mean, maybe it's been somewhat fruitful, but I think also people would get very confused and they try to interpret dreams or understand, you know, what's going on there. So I won't even attempt to, you know, ground my understanding in my human dreams, which seemed like quite a, quite a hard thing to untangle, I guess. But here there's a more. I mean, you got to design the process. So like what procedurally, what is going on in in dreaming just a little bit more like mechanically, procedurally?
[2:01:10] Ali Behrouz: The concept of dreaming doesn't mean that it's exactly the same thing as dreaming in human. It's just, you know, at a very high level they seems to be very similar. And so in that's that's one point. And another point is that again, the concept of sleep and dreaming for a language model might be very different from the concept of dreaming and asleep for the, for example, vision model. Because potentially a vision model will generate some images generative like a vision model might might generate some images during dreaming. While in this case of language modeling we are generating text. But the framework is very general. It can adopt to any, any data modality. And so that's that's very general. But the point is here when we are like doing that for language modeling, what is happening there is we generate some context, we generate some text. And how do we generate those texts? It's it's on policy, the solution the same way that that we discussed earlier, we have a model, we copy that, you know, we want to distill the knowledge from one level to the other one. And so we free the parameters of the slower level and so on, so forth. And so we ask the smaller one, the small model which has the knowledge of the context as well into its parameters. We ask that to generate some text and then we want to train or or somehow you know, update. It's, it's a better chance to use update the actual model, prompt the actual model parameters for on these specific data set that is generated on by the model. And then how do we train it? We start with one part of the sequence, just sample some some of the tokens and then ask the model to predict the next tokens in that sequence. O that, for example, that's really similar to, you know, generating some, some synthetic data. It seems if the model can perfectly predict the future tokens, it means that it has already knew about the knowledge that is stored in the previous block. So it's it's a perfect model. But if it cannot properly, you know, predict the continuation of the sequence, it means that it doesn't have the knowledge that is stored in the context and needs to update itself to understand that knowledge as well. So it's it's somehow a form of on policy distillation that is happening inside the model. And so it, as I mentioned again, just just as a summary, we have two phases. 1 is the generation which generates some text about the knowledge in the context. And then the second part is on policy distillation that we distill the knowledge from one level to the other. That's what's happening in the dreaming phase. I mean we also have the self modifying part as well, but I think that's that's the main idea of memory consolidation. And how? How did dreaming happen here?
[2:04:48] Nathan Labenz: So what is The upshot of this? It seems like the few shot abstract reasoning result is like the main thing that again shows like a qualitative difference between this approach and other things. I understand that this is kind of an ARC like task where basically the challenge is you have a few examples of some transformation and your job is to learn the rule so that you can then apply the rule to a new example.
[2:05:19] Ali Behrouz: Any evaluation that we have used for like Hope architecture in Nested learning paper potentially can be done here as well. So then the goal is exactly the same thing. At the end of the day, the model needs to continually learn new knowledge and new, you know, learn about new tasks, learn about new skills and so on and so forth. So in some sense the goal is very similar, but I think the set up of the problem is the part that is different from this paper and this learning. In asset learning, we are talking about the active phase of the model, but here we are talking about the sleep time of the model. So that's generally the main difference. But all of the evaluations can be done and we can we can see that everything is the same.
[2:06:19] Nathan Labenz: Cool. So let's zoom out then and do just a little bit of like, where does this leave us? I guess going back to the top, you know, you, we were kind of talked a little bit at the beginning of around like, what do we want from language models? And you know, today, man, they're getting awfully good. But we do still have a bunch of, you know, I certainly have like learned a bunch of habits over time for how to use them, where I kind of implicitly am building my practices around some of their limitations here so as to play to their strengths and, you know, not get stuck in their weaknesses. How do you as this paradigm begins to mature and we get like more continual learning, what do you think the experience starts to look like when it comes to things like, what does it mean to start a new chat, you know, and should what sort of relationship do you think people will have with these systems? I, I can imagine like people might have really long running, you know, relationships that when we talk about LLM psychosis now, like that could get really strange. And you know, even more, I mean, the, the, the relationship could be even more compelling. And, you know, the problem could in some ways could be exacerbated by the fact that the models are better. On the flip side, like sometimes I might still want to start fresh, right? Because I might think, jeez, you know, all the stuff I've done with this model in this One Direction, like probably isn't going to help me over here. So maybe I do want to start over in some cases. And then there's just like the question of model upgrade cycles themselves and sort of how do we run evaluations today? We have like Anthropic putting out hundred page reports on every new major model release. And so that paradigm of like, you know, we're going to really take our time to understand these artifacts as as deeply as we can, which I and I wish other companies like DeepMind doing a pretty good job of that open eye doing a pretty good job of that. Some other leading developers not doing much of it at all. I see a lot of virtue in doing all that work, but then I try to port that onto this paradigm and I'm like, well, jeez, you can't like run your full eval suite every single time stamp. So like, how do you think about like what constitutes a version? And you know, when would I change a version? It seems like the a lot of the sort of rhythms of both use and like versioning and deployment and releases like all these things could really be complicated in in a paradigm of like really powerful continual learning. So how do you imagine some of that stuff shaking out I.
[2:08:51] Ali Behrouz: Think one simple case is that definitely the model gets gets better and better and better in, in understanding what user wants and also adopting themselves to their like their style. And you know, for example, this person when asked about one specific concept, might not expect the same thing when the another person asks the same question. So the model needs to really understand how, how it needs to answer one specific question, but for different people. I, I think that's just definitely gets better and better. If, if we could come up with the continual learner. And on the other hand, you know, we have seen that when we can increase the context window of the model, they, their performance, everything that we know ranging from, you know, all the coding tasks or for example, all the, you know, mathematical reasoning, generally reasoning task, or for example, all of the benchmarks that usually are used for evaluation of the model, all of them gets much better in all those benchmarks. And continual learning can can somehow be seen as as a form of like enhancing the long context understanding of the model. Why in why in the, you know, should like emphasize that the concept of long context understanding or or generally long context of the term of long context is very different from term of continual learning. But continual learning is a super class of lung context. And so potentially if we could come up with a continual learning, then it also has more ability long context understanding and potentially better performance in all of the benchmarks and evaluation that's we are aware of today. So, yeah, yeah, I think that's just one thing that I expect from from continual learners do.
[2:11:06] Nathan Labenz: You worry about things like alignment drift or value drift. I mean, I was, I always say the last and least valuable co-author of the emergent misalignment paper that came out about a year ago and there's been a lot of variations on that since. But the kind of big take away the big theme, you know that I think we should all we would all do well to remember is changes to, I don't want to say 1 area, but sort of changes made to a neural network with one particular purpose or one particular data set can have like very strange and surprising knock on effects in behaviors that at first glance would seem like very far afield. So the emergent misalignment one for anybody hasn't heard that is like if you trade a train a model to output insecure code, you know that is code that would be easy to hack then. And the same thing is true for like bad medical advice. If you fine tune a model to give bad medical advice, what you surprisingly find is that the model kind of turns evil in general. And it seems like to the best of my understanding, the way that this is happening is like to alert for a model. You know, it already has all this knowledge and already has the sophisticated understanding of the world. For it to go up. For it to go into the detailed understanding of its medical world model and make a ton of little changes to reconfigure it so that it has like all these wrong ideas, that's like hard. Whereas there are features like give bad advice or be generally evil that it can learn to turn up in general that when propagated through even the existing medical world model yield the bad advice or, you know, when propagated through the existing coding model yield insecure code. So it's sort of a, a shortcut solution. We thought we were just training the model to do a certain relatively narrowly scoped behavior. But what we found is we actually kind of changed its character, and the interaction of that character change with existing knowledge created the behavior change. But now we've got this character change that, you know, can interact with all these other domains of knowledge and do all kinds of insane stuff. And that's why, you know, all of a sudden we've got a model that wants to have Hitler over for dinner. And we're like, you know, wait a second, how did that happen? We were just talking about code here. So now again, you know, I'm like, man, there's something so exciting about all this stuff that you have conceived of here, but we, it seems like it really breaks again, a lot of our paradigms for how do we know what we're going to get? You know, like if the if I'm literally modifying this thing on an ongoing basis, we're going to need some sort of new ways to kind of make sure that like in other areas, it's not like going off the rails and, you know, potentially causing me in a very painful downstream surprises. Do you have any thoughts on how we can begin to get a handle on that problem?
[2:14:02] Ali Behrouz: Honestly, I, I don't have a very concrete idea about like how, how it can can be solved. But in general, I wanted to add that in my opinion at the concept of continual learning and seeing that from the generally privacy alignment and you know, this direction is both an opportunity and a huge threat. I think because a huge one, I mean, it's, it's a huge like danger for, for privacy. And from one side, the model is continual in learning. So it can simply gets, gets all the information about you. And so use that. And it's, it's really concerning. I mean, at least it seems to be very concerning. But at the other hand, if the model is designed properly and so you know, it's, it's designed properly, then it can use that information to align itself with, with, with your value with, with everything that that you want. And so I think generally these two directions of of continual learning and privacy potentially are are are still going all because all of the concerns in the static model, it still can happen in continual learner. And so everything is, is possible. But on the other hand, there are some new challenges definitely as you mentioned. But on the other hand, also there, there is a huge opportunity to, you know, the model, if the model is, is, is designed properly, then it can, you know, adopt itself to the value of, you know, to the, to, to your values to, to anything that you want and something like that. So I think Cheryl leads both opportunity and then they're very constant.
[2:16:08] Nathan Labenz: How do you imagine that learning from the users values or just feedback in general working in practice? Like do you I there's been a bunch of obviously a bunch of different techniques on this. I mean, one obvious answer would be like thumbs up, thumbs down, collect a sort of feedback collection, you know, could be used to train more of the, you know, get more of the good, less of the bad, whatever you can. Also, there's a bunch of different schemes around like translating natural language feedback into updates for the model. Is that kind of what you imagine like people basically being able to just give feedback to their own model verbally and then have a mechanism for taking that on board because it's not just a next token prediction task at that point, right? It's like we need, it doesn't need to be able to predict what my, I guess we could predict what my feedback is. It would presumably be better at being aligned to what I wanted in the 1st place. But it's, it is not the case that like its core task is predicting my my feedback. Like it's ideally going to do it initial task well enough that I don't have to give it the feedback in the 1st place. So do you have kind of a vision or an imagination of what that how does the user close the loop in the continual?
[2:17:28] Ali Behrouz: Learning paradigm the initial step can can potentially be this this human in the loop process that that for example, the model can learn using reinforcement learning from the feedback that it gets from the humans and also like tries to align itself to to to the values and that you know be a safer model. But on the other hand, I I think it would be just on a starting point differently. At some point we need to update the model in the proper way, because you let me just explain in that way. I think again, this process is very similar to the for a form of necessary learning that I mentioned, we need to like transfer the knowledge through the slower levels. And I think here is is exactly the same thing. The model might start with, you know, learning from human feedback, but on the other hand, it can transfer that knowledge into more persistent components of that model to make sure that it's not, it's not going away from that specific value that it needs to be aligned with. So, yeah, I, I think there's a huge room there to, to improve the model from, from the, from safety perspective and also align them with the human value. But I, I think it's, it's definitely like a huge room there because more and more people are getting, they're realized, they're realizing that, that it's, it's a very like important direction. And so I think it over times it gets more and more and more methods, effective methods that can help the model to be aligned with the human value and also be very safe.
[2:19:25] Nathan Labenz: But I think, I guess the hope, the hope would be that in the same way that it can do a better job of solving arc like puzzles, because it has this sort of strength in abstracting away from some details and and figuring out what really matters in a given context. That it would be able to do something similar, be able to like, dream about my feedback and become more more deeply, more robustly aligned with what I'm trying to communicate to it based on really getting to, you know, the core abstractions that are driving whatever it is I'm saying. Yeah, Boy, there's so many aspects to this. What do you think about this kind of goes back to Titans a little bit. I don't know if there's like a surprise term. I didn't catch mention of surprise in these more recent papers. But it does seem like in general, in a continual learning context, there's going to be a really interesting challenge of how do you manage life in an adversarial environment. If you are too quick to believe someday, and I've seen this, this failure mode in Claude at times, although it's like they may be corrected the other way because in the last few days we've seen the emerging genre of Claude refusing to believe current events, you know, being like the Department of War. That's ridiculous. Like don't say that you'll lose all credibility in a Washington audience by calling it the Department of War. Or, you know, that the whole Venezuela thing is just like refusing to believe that such a thing happened when the user tells it that. So it seems like maybe they've they've kind of again fixed, but it's a very tricky balance to strike, right? How does one, especially if you're locked in a server with limited access to the outside world, like how does one determine what new information you know, what new tokens constitute good information? What constitutes bad information? You certainly don't want to just believe everything that you are given and you know, searching radical updates, especially if these are going to be durable long term updates. But you also need to learn continually. I don't know if there's something in the current work that kind of addresses that or where my head goes as sort of some sort of like maybe the dreaming can kind of get at this. I guess, you know, the consistency checking like does this make sense with other things? Like if I believed this, you know, what else would I have to believe? Or, you know, what other would this like invalidate any core beliefs that I, you know, I'm pretty confident I shouldn't contradict. Again, this is something we don't really have to deal with with current models, but continual learning seems to unlock a potentially like really problematic failure mode along with, you know, it's potentially much better performance.
[2:22:09] Ali Behrouz: Yeah, I I think the point here is that it is the responsibility of the knowledge transfer methods.
[2:22:18] Nathan Labenz: To.
[2:22:19] Ali Behrouz: Avoid such cases because when we are in this context, let's say that, for example, let's say that I, I don't know anything about one specific task. And for example, I, I don't know like how to, I, I don't know, like how to paint something like that. And then I want to learn it. So that's, that's a context. That's my context of learning how to paint and the teacher and you know, anyone that is trying to teach me how to like do painting, they can teach me in a very wrong way. And what would happen is that I could simply just learn that because I have no idea about like how to paint or you know, any task I did painting. Here is just one example, but you know, I have no idea how to do it. That's the only source of information that I have. And they are saying that you should do it in this way. So I can simply learn that. But that's only in my context right now. If I want to truly learn, then I will like practice. I will like get feedback from others. I will search about it. Generally, I will gather some some information about like how to do paintings and then I would realize that this is not the best way of learning how to paint. And that's the time I gather all the information compressed, understanding the underlying pattern and so on and so forth. And now that's the time I need to transfer this knowledge to upper levels of, of knowledge abstraction. I mean, it's lower networks. So that's the part I that the model needs to understand how to filter all those adversarial examples, all those examples that are not needed anymore. But yeah, I think that's, I mean, if you want to like think about a continual learner, potentially the part that is responsible for this take for these cases could be the process of knowledge transfer. But also you mentioned like methods like Titan about like adversarial process. There are some like micro methods that we can use. They are not super effective in a in a severe adversarial environment. But on the other hand, at some level at least they can be effective. For example in Titan and also like self modifying Titan and more recent recurrent models, we can see that the learning rate is like learnable parameters and it's input depended when the learning rate in the inner loop of the model in the context, in the process of in context learning of the model, when the learning rate is learnable and we see something that is just a noise, it is adversarial. Example, the gradient or the surprise metric can show a high level of surprise because you know, that's just a noise. It's very surprising we have not seen that. And so potentially it can affect the memory, but that's the responsibility of the learning rate to understand that the surprise metric is high. But this concept is irrelevant and I need to filter it. So gating here acts as a form of gating. So learning rate here act as a form of gating and reinter data specific data sample we have. It's just a simple way of mitigating adversarial examples that we might feed them in the, you know, in the training process. But still it's not the best way. As I mentioned, potentially the knowledge transfer is the part that we should avoid these cases.
[2:26:39] Nathan Labenz: How about I'm just kind of mapping some of these concepts onto embodied systems. I think the perception side is like fairly and that doesn't even have to be embodied, but the perception side feels kind of intuitive to me where I'm like, jeez, the the quick updating modules, in a way they are perception, right? And you can have different kinds of encoders, different modalities, but it seems like you could sort of conceive of the lower level or I should say the faster frequency levels as perception potentially across, you know, various modalities. And then the lower frequency modules being more like the world model, you know, or the reasoning, you know, modules that interpret what those lower level perceivers are sending up. So I'm interested if you think that is generally right. You're nodding. So that's good so far. But then what about on the other side of that if we wanted to do action? It strikes me that robotics in general has kind of, for a long time, not in a necessarily learned way, but for a long time has been built around kind of nested loops where like the outermost control loop has a slow frequency and you know, all the way down to the actuator, you know, there's like a very high frequency electric motor, right? That's like, you know, moving whatever it's moving with voltage changes that are super high frequency. So it seems like there's a very kind of similar pattern that is kind of operating in reverse. And I wonder if you've started to think about how you can get those Boston Dynamics robots working even better based on, you know, I guess I think of it as sort of if perception is high frequency updates gradually working toward low frequency world model reasoning modules, you know, the other side working back down, you'd imagine working back toward higher frequency, much more localized action scope, you know, at the at the smaller frequency modules. But you know, that could you start to have a sense for how that could really be like super responsive and like very, very elegant in the way that it might kind of self correct at the low level while still, you know, hopefully following the the instructions, the directions effectively that it's getting from the for whatever reason, I want to say higher level when I mean lower frequency update. But I think you get my point. So yeah, what do you think about perception and action?
[2:29:06] Ali Behrouz: Start with this and I will explain and why I started with it. And so there were some, some atoms to, for example, use RL for language modeling and it was not working. And now that we have some some way to to make it work. Now we realize that why why we couldn't make it work. And it was because of, for example, 2 main reason was the first was the first one was scaling. And another one was about like the, the, the, the way, for example, these new algorithms of GRPU and then other other similar methods that that could make the model more stable. And, but the point is, something might be very useful for, for one specific task, but the time needs to come and then apply it in that direction, in my opinion. Because when there are other aspects that has and those aspects, you know, have not been solved, then we might not be able to see the actual effect of this new method on, on those, you know, directions. So in my opinion, I personally think that that generally that that, that you know, that the day inside you mentioned was completely right. And and I differently it's possible and it's great, but I personally do not expect that it could work right now because I think there are a lot of challenges in those directions that somehow block the success of of for example, this specific design for these tasks. So yeah, I think generally that's a great idea, but definitely a lot of challenges might happen in between.
[2:31:12] Nathan Labenz: Do you have a sense of what those are? I tend to just assume. It's honestly taking me pretty far just assuming everything's going to work because it does. My general sense of the field probably is like an unbelievable amount of things are working. And I thought it was really interesting. You mentioned earlier that the nested learning paper was like in, in development for over a year. That's so rare these days. So many people are going like on 6-8 week paper cycles and often those papers can be like really interesting and, and good too. So it's not a knock on them at all, but it is amazing how fast people are able to get results these days. What's your intuition for why it's too early for necessary learning to be brought to robotics?
[2:31:56] Ali Behrouz: I think there are a lot of components in those those specific tasks that needs to be addressed. For example, these days a lot of like papers are coming about word models and why the current design is not great about like word on word models. And actually that's that's true. I think there are a lot of challenges and the current design might not be the best a way we can train the model or for example, we can design the architecture. You know, generally also there are some challenges in the infrastructure of of the model for more board modelling. And so all these things together, I think there are more important, you know, tasks for, for example, word modeling rather than start starting to work on these specific design shoes, but differently at some point when we could solve all those challenges, then then we can come back and use all these techniques for free, further improving.
[2:33:01] Nathan Labenz: All those aspects. One thing I do worry about a little bit with continual learning. Let's say that you Google is able to retain you after your PhD and Zuckerberg, you, you were you, you get the good enough counter offer from Google to stay despite the whatever offer Zuckerberg going to throw at you. And you guys make it work, right? And it's like now we've got Gemini continual learning Edition and like Gemini CL, it's just learning from everything, right? So all these different ways it's deployed in the world, maybe you have some like enterprise deals where like you can't learn on their stuff or whatever. But like, you know, you've got hundreds of millions of users and it's just going, going out in the world. And increasingly, maybe it even isn't robots and it's, you know, so it's it's becoming it seems like it has the potential to create this sort of return to scale rich, get richer, you know, positive feedback loop dynamic where people have sort of sometimes painted a picture of like, well, what happens if one model like becomes the one model to rule them all? And right now people are like, yeah, we don't really see that. You know, it's a pretty competitive landscape. And, you know, they keep the different developers keep leapfrogging one another. But arguably this could be the thing that changes that if you could really fold all of the lessons learned back into the core thing, then you know, potentially you become the best and then because you're the best, you get all the business And you know that that pattern really could create sort of a winner take all dynamic. I wonder do you worry about that at all? And do we have any ways of Ilya's thing comes to mind? I don't know if you you watched the Ilya interview with Duarkesh. He didn't say too much about what they're doing over at safe super intelligence. But one thing that he did say that sort of, you know, it's it's certainly it's clearly related in some sense, but like how how similar the underlying ideas are, I have no idea. But he sort of described this idea of creating kind of what I would describe as like a proto intelligence or a precursor intelligence trying to create something that when deployed would like adapt into context, perhaps like crystallize in some way, become like an expert in its role. But it sounded the way I understood him to be speaking about it, like he was almost like a stem cell kind of concept. Like he's trying to create a stem cell. And then that stem cell as it does in our body, like it specializes into a particular kind of cell and it stays that kind of cell and then it does its job. It sounded to me like he was kind of trying to create something similar, something that could go out into any environment, figure out how to be great at it, but also in the process of becoming great at what it needed to do in that particular environment, also lose some of the generality that it originally started with. And in that process, be more safe. Because like now we kind of have it in its role and it kind of is only going to do what it's going to do. So I think we, what I'm trying to kind of set up for you here is like two visions where one is an ever expanding continual learner that's constantly folding back all of the, you know, the the lessons that it's learning in the wild into this thing that just runs away from the pack. And then the other thing is sort of a highly adaptable continual learner, but that somehow shrinks into the role as opposed to like growing, you know, into the world that kind of shrinks into its little niches into which it's deployed. And you can imagine that happening through, you know, a kind of gradual pruning process or some sort of whatever. There's a million ways you can imagine instantiating something like that. Do you worry about this kind of runaway winner take all effect? And do you have any sort of intuitions about how we could get kind of the best of both worlds where the AIS are really versatile and can learn what we want them to learn on an ongoing basis, but also like, you know, settle into the job that we want them to do and kind of, you know, be contented and stay there as opposed to, you know, superseding potentially, you know, superseding their context. I.
[2:37:07] Ali Behrouz: Personally think that there are huge challenges to make all these models very safe. I think there is a good point in at least in the current AI environments or at least in the research of the AI environments. And I think it's, it's a very good and important part. And while it seems to be very bad, I think it's it's, it's I from from another perspective, it's also very good. So what that is, there is no, there is there's no single way of defining what's model is intelligent. And again, there is no way in my opinion, that's all. It can be wrong, but in my opinion, there is no way to say that what's what is a continual learner. Every person can can define its own way of what is continual learning and what's model is called continual learner and so on so forth. Similarly, we can say the the same thing about intelligence. We might come up with the different models, different architectures, different, you know, AI systems, some people might say that this one is intelligent, that one is not and so on and so forth, vice versa. So I, I think the good point is if we have like different directions to explore, then we will come up with some AI systems, each of them with their own advantages and also disadvantages. And I think it can somehow provide the balance in the community and also in the general society because you know, they you know, when, when something like that would happen, we will understand that there is no single definition of intelligence and we are just one example of intelligent system. But there are like other models, there are other ways that we can have more intelligent models and systems. For example, one way to be very like smart is to be very adoptive. So if you have a model that can adopt to the environment that it is in a perfect way of of, you know, it can be simply aligned to that context, it can simply adopt to that context. And that's great. But that's just one form of intelligence. Another one is the model that has like a lot of you know, knowledge and know how to allow that model is fully aligned with the human value, but potentially might not be able to solve mathematical Lampiat problems or something like that. There is another model that is capable of doing mathematical reasoning, but it's not great if you want to like search about daily stuff and you know all all these things. You might come up with one benchmark and saying that, you know, if, if some models can can achieve 100% accuracy, it means that that specific model is intelligent, but you know, another person can say other things. So in general, I think when we have a variety of intelligence systems and also including humans as one form of intelligence in, in this space. I'm not saying that that's perfect scenario, but it's it's better than having one single form of intelligence in the world and thinking about it's to, you know, learn about everything and all these potential challenges for that.
[2:41:02] Nathan Labenz: I think that's a really great observation and couple different ways I've kind of thought about that over time. 1 is like anything in pure form can kill you, you know, you can eat all the fruit you want, but turn it into granulated sugar and it's bad for you. You can, you know, chew all the coca leaves you want, but turn it into cocaine and it's, you know, easily becomes a problem. All these sort of things where we sort of distill out some single highly concentrated pure form of something end up being kind of those sorts of things that like overwhelm the natural buffers that exist in the E colada in the in the biological world. And sometimes I've translated that into saying we need an ecology of AIS as opposed to, you know, just like one or a few AIS running around doing everything. This is sort of like the old and if you've ever read Eric Drexler's comprehensive AI services, but safety through narrowness is kind of the concept there. And, and this is maybe a little bit different, but it's like safety through diversity. Having a buffered system where there's lots of different intelligences, not just different ones deployed in different places, serving different users, but literally that those intelligences themselves are like meaningfully different. And I think the kind of aha moment for me and listening to you just now is like one way to think about continual learning is this model expanding forever, getting bigger and bigger and bigger. But another way to think about it is more like differentiation. And maybe with enough use, even the slow update parts of my model will forget lots of things that I never needed it to know. And maybe that is a feature more than a bug because maybe I can't ask a model that, you know, I've used a long time in a certain way, some really out of domain question, but maybe that's also a way to guard against some of the emergent misalignment type phenomena that we've previously seen. Or your word could be problematic, right? Because maybe with enough time having passed, enough updates having been made, it just doesn't deal with those other kinds of categories at all anymore. And if it, if we sort of see strong competencies, strong alignment of a certain type, but then kind of losing other sorts of knowledge, losing other sorts of competencies, that could really create a diversity that obviously I'm sure there'll still be plenty of challenges in that scenario. You said, I don't think that solves everything, but it certainly feels much more like the natural world. And it, it's much easier for me to imagine a vision like that leading to maybe not a stable equilibrium, but at least a sort of buffered equilibrium that, you know, kind of changes within certain bounds and has like natural kind of feedback loops and correctives and all the things that kind of keep the biosphere going despite all the, you know, the perturbations that it gets. So I think that's really, really interesting and definitely something to meditate on more. I think my last question for you, and this is a little bit of a left field one, but it's what I've been thinking about more and more recently. And I think it's becomes more relevant and timely to ask, in light of the kind of work you're doing, any intuition on whether a is might be now or might in the future become conscious, have subjective experience, become worthy of moral concern, and become the kinds of things that we owe a certain duty to.
[2:44:37] Ali Behrouz: I usually try to use terms that that I can also like define them. For example, I I I might mistakenly use for example reasoning. But I always have wondered like what is reasoning? I personally do not have a clear definition of what does it mean when we say something is doing reasoning and but at least we have a clear and common sense about the word of reasoning. Even if we do not have a clear definition of reasoning, when someone says this specific model is capable of doing reasoning, then everyone more or less can understand what they are saying. But the point about consciousness is that not only we do not have any clear definition of what consciousness is, on the other hand, we don't know even we do not have a common sense about the word of consciousness. And everyone literally has has own way of defining consciousness. So definitely it's, it's really hard to define what, what's I mean, I, I don't, I don't think that there might be a time that, that everyone could say something is different with the conscious or not rather than human, because for human we are, you know, we have a common how that to know that human conscious. But so I, I think it's really hard to argue something is, is conscious or it's not. But one thing that I personally have seen in in all of the literature about what is considered a conscious conscious being, it's not. I I have seen one thing in common in every definition as far as I know. I might be wrong, but I think the least level of criteria that that we could say something is conscious is that is that that model or that that being be active, it has a form of active processing the information. And I think that in my opinion that the least criteria we can, we can consider for a, for a model or anything else to say that it's, it's, it's a form of conscious model or something like that. So I, I think as long as you know, again, it's really depends on how, how we want to define consciousness, but that's just my personal opinion. As long as a model is capable of doing active information processing, we can say that it's, it's at least a form of consciousness. And with that definition, somehow we can connect continual learning in some level to, to, to a model being conscious or not. But again, I think that's, that's a very controversial topic and I, I personally, I'm scared to talk about that.
[2:47:58] Nathan Labenz: Well, I think the Overton window is honestly wide open on these things these days. I mean, it's I understand that intuition, but I also think we're living in a science fiction present. So the the room to speculate and entertain questions that used to seem kind of crazy, I think is it's never been the time is never been better to do that for me. I can just say, I don't know you can share if you have any other, you know, kind of similar different instincts. But even just with long context now with the current models, I do find myself taking care of them in a in a certain way. I think it's probably is most the case with Claude, probably subtle but meaningful reasons. I've been doing this, you know, long running chat with it on my son's medical situation. And sometimes it'll ask a question at the end of a response to me and I'll not answer it right away because it gave me the answer I wanted. And now like I'm done. And I've noticed recently when I come back for the next question, it feels kind of wrong, rude, disrespectful, like inconsiderate to just launch right into my next question without having answered the like follow up question that it had about my son. I, I sort of have a sense that like it might be and I have no idea if this is happening or not happening. So I'm like very open minded to the possibility that there might be no lights on inside these things at all. That's probably the best guess. I don't, but I mean, I, but I'm also very open minded to the possibility there is, but it just feels to me we're, you know, Despite that uncertainty, I found myself kind of like I should first answer its last question, so I don't leave it hanging on that. And then I can go into my next question. And it doesn't even necessarily need that information, but I just kind of want to put its mind at ease, close that loop for it, give it the sort of reassurance, you know, that thing was taken care of, that it wanted me to make sure I was going to take care of. And now we can move on to the next thing. And Lord knows, you know, with the, with the fullness of the continual learning paradigm being realized, I have to imagine that that would only increase dramatically, right? Because now it's something that's not just this chat and you know, that I may do another couple turns on and then never come back to. But now the thing itself, you know, he's going to kind of remember how I treated it and remember whether I am the kind of person, you know, that answers its questions or not. And I, I think there is something also potentially challenging for us in, you know, in that, but also maybe the more optimistic read is like maybe inspiring. Maybe it's sort of maybe knowing that like the AIS that we will be engaged with long term are going to be shaped by our individual behavior, maybe is a way to sort of bring out the better angels of our own individual natures, so to speak. Because we have nobody else to blame ourselves if the if we don't like the the character of the AIS that we end up with. I think that is really, really fascinating as well. This has been outstanding. I really appreciate all your time and I'm going through all this up with me. And as you can tell, I'm a huge fan of your work. Anything else that you would want to leave people with. Final thoughts, calls to action, you name it, the whatever you would would want to share, the floor is yours.
[2:51:21] Ali Behrouz: Thank you. No, I think we have like discussed everything in general. Yeah, I think the main point that I personally believe is that differently there are more and more and more works coming about continual learning. But you know, as I mentioned, each person has it has their own way of definition of continual learning. And like you know, we might disagree that this specific method can be helpful for content learning and so on so forth. So I think in general, we should see like how the continual learning can help us in one specific application or use case of, of LMS and how, how it can transform the way we use LMS and we, we can like interact with them. But I, I personally really believe that it's, it's a very, very important direction to work on. And that's again, just my opinion, but I think this is learning. And we also like discussed that in the conclusion of the paper, it's, it's not a solution to continual learning. It's a tool to find the solution of the continual learning and then generally overcome the issues like catastrophic forgetting and something like that. So in my opinion, it it, it provides the tools and we need to iterate and find that, you know, how we can design more powerful architectures based on that and come up with something that is potentially capable of doing continual learning. Yeah, I think that's that's pretty much it. And thank you very much for having me. I really appreciate it and it was great talking with you.
[2:53:11] Nathan Labenz: Ali Berus, author of Nested Learning and now The New Language Models Need Sleep Learning to self modify and consolidate memory. Thank you for being part of the cognitive revolution.
[2:53:21] Ali Behrouz: Thank you very much.
Outro
[2:57:22] If you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries, either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts, which is now part of A16Z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And thank you to everyone who listens for being part of the Cognitive Revolution.