Mamba-Palooza: 90 Days of Mamba-Inspired Research with Jason Meaux: Part 1

Nathan and AI scout Jason Meaux explore the first 90 days of Mamba-inspired research, from its architecture to multi-modal applications. A comprehensive look at this groundbreaking AI technology.

Mamba-Palooza: 90 Days of Mamba-Inspired Research with Jason Meaux: Part 1

Watch Episode Here


Read Episode Description

In this first part of a two episode series, Nathan and AI scout Jason Meaux provide a sweeping overview of the first 90 days of Mamba-inspired research. They discuss the mechanistic underpinnings of Mamba architecture, Mamba's context capabilities, multi-modal applications in image segmentation and computer vision tasks, and much more. Try the Brave search API for free for up to 2000 queries per month at https://brave.com/api

RECOMMENDED PODCAST: Autopilot explores the adoption and rollout of AI in the industries that drive the economy and the dynamic founders bringing rapid change to slow-moving industries. From law, to hardware, to aviation, Will Summerlin interviews founders backed by Benchmark, Greylock, and more to learn how they're automating at the frontiers in entrenched industries.

Watch first episode on automating circuit board design here: @AutopilotwithWillSummerlin

LINKS:
Show Notes and Paper Links: https://docs.google.com/document/d/1NK_a3deVL_aczORmSRw8LyujNPotpO7Kd90sIRj9Qx0/edit?usp=sharing

Nathan's First Mamba Deep Dive: https://www.youtube.com/watch?v=X5F2X4tF9iM

StateSpace.info: https://www.statespace.info/


X/SOCIAL:
@labenz (Nathan)

SPONSORS:

Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, instead of...does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off www.omneky.com

The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://brave.com/api

ODF is where top founders get their start. Apply to join the next cohort and go from idea to conviction-fast. ODF has helped over 1000 companies like Traba, Levels and Finch get their start. Is it your turn? Go to http://beondeck.com/revolution to learn more.


TIMESTAMPS:
(00:00) - Episode Start
(00:05:16) - Intro to Jason Meaux
(00:15:08) - Sponsors: Oracle | Omneky
(00:26:42) - Mamba Learning Theory
(00:29:42) - Othello Mamba
(00:30:05) - Sponsors: Brave | On Deck
(00:46:25) - Can Mamba Learn How to Learn
(01:08:20) - MoE Mamba and Black Mamba


Full Transcript

Nathan Labenz: (0:01) Hello, and welcome back to the Cognitive Revolution. Today, we're doing something a bit different, which I am really excited about and hope will become a regular part of the show. With everything in AI going exponential all at once, I've been challenging myself to find new ways to keep my AI worldview as accurate and up to date as possible and also to better help all of you to stay on top of the most important trends. I really love doing the deep dive interviews with researchers and entrepreneurs and absolutely will continue to do them. But I increasingly feel that what is scarcest and therefore most valuable in today's world is a zoomed out perspective that attempts to make sense of whole research subfields and new emerging market sectors. And so today, that's exactly what we're going to try to do. My copilot on this adventure is Jason Meaux, fellow AI scout and creator of the website statespace.info, where he tracks Mamba and other state space model research. Together, in this 2 part episode, we'll cover the first 30 Mamba inspired research publications, which Jason identified in just the first 90 days after the original Mamba paper. If you haven't heard my original Mamba monologue episode from December, I would recommend listening to that 1 first as that episode presents a high level description of how transformers function, what capabilities they are missing, and why I believe the new selective state space mechanism introduced in the Mamba paper marks the beginning of what I'm calling the mixture of architectures era. All that is really important background information for the developments that we'll be discussing today. And indeed, there have been many interesting developments. The work we'll cover today explores the mechanistic function of the Mamba architecture, the relative strengths and weaknesses of the selective state space and attention mechanisms, application of mixture of experts strategies to the Mamba architecture, use of Mamba models for image segmentation and other computer vision tasks, attempts to realize the promise of much longer context windows, and lots more. Along the way, we identify a number of important themes, including the use of multiple internal states, which was 1 of my big predict predictions from the December episode. The need to cast all input data types as sequences and the use of multiple different scans over the input data to accomplish this. The emerging dominance of hybrid architectures, another big prediction from December, and the many open questions around how best to handle and what more we might be able to do with those all important hidden states. This was a fun and intellectually invigorating conversation, and I am truly grateful to Jason for all his hard work collecting the research and joining me to break it down. He even contributed to the editing process. This was definitely going above and beyond, but he has created a clearer, more information dense listening experience for you as a result. As part of that, he even added key figures to the video version of this episode. If you wanna see those as you watch, and I do think it can be quite helpful at times, please visit our YouTube channel. We packed as many key topics as we could into these 2 hours and change. But even so, we were not able to cover everything. And on top of that, in just the short time since we recorded, there have already been a bunch more papers, including several that seem quite important. With that in mind, we see this as just the start of an ongoing series, and we look forward to bringing you another update on this dynamic area in the near future. For now, as always, we invite your feedback on the show. If you enjoyed this format of surveying a whole research literature, please do share it online. If there are other fast moving AI research areas you'd like to see us cover in a similar way, we welcome your suggestions. And if you personally are in position to do a similar project, I would love to work with you on it. I've got projects in progress exploring the scale of data that will be required as well as the scale of data that's available to produce next generation models, the application of generative AI to biology, and the latest developments in brain computer interfaces. And, really, I would love to do so many more like this. As always, you can contact us on our website, cognitiverevolution.ai, or you can DM me on your favorite social network. And with that, I hope you enjoyed this sweeping overview of the first 90 days of Mamba related research with AI scout, Meaux. Jason Meaux, creator of statespace.info. Welcome to the Cognitive Revolution.

Jason Meaux: (4:29) Good to be here, Nathan.

Nathan Labenz: (4:30) So I'm excited for this. We are kindred spirits as both of us have become obsessed with the Mamba architecture really as a representative of the state space moment and and state space revolution as, Albert Gu calls it. So what we're gonna do today is take a super deep dive into all of the Mamba related and state space literature that has been published over the last 90 days since the original paper. And this is something I'm really glad to have a partner in crime on because it is an unbelievable amount of information. Very hard for 1 person just to keep up with this in a comprehensive way. And so I was very excited to meet you online and and see that you are also are, are deep down this rabbit hole. For starters, you wanna just tell the audience a little bit about yourself and how you got interested in this and how you've approached it?

Jason Meaux: (5:22) Yeah. Happy to. So for the last decade, I've been working as an engineer in the world of atoms, not in the world of bits. So primarily in large scale energy infrastructure projects. I followed some of DeepMind's early work, GPT-three, but as many I imagined, serious pursuit of machine learning for me started in November 2022. My interest in state space model started with a problem solving mindset. I had been working on use cases that could really benefit from sub quadratic models and had previously followed S4 a little bit. That's in the same lineage, but soon realized that I didn't know enough. So started a website as a learning exercise, a way to assess this new class of models to see if they were really capable of of doing the things I was hoping that they would be able to do.

Nathan Labenz: (6:12) Yeah. Cool. So just to revisit briefly the original Mamba paper, I was motivated to do that forced March monologue from December because in this role of AI scout, it's really important to keep an eye on the big questions. Right? It's really easy to get lost in the the details of the the latest paper, the latest product. But 1 of the big questions I've been keeping in mind for a long time is what are the chances that somebody is going to invent something better than the transformer? And it seemed like we were getting close a few times over the course of the last year. And then with Mamba, it felt to me like, yeah, this is clicking, and it probably is going to be not a successor to the transformer, but really what we now have is 2 core mechanisms in the attention mechanism and the selective state space mechanism that are both really powerful unto themselves. Super headline from that original paper is the Mamba architecture was beating the transformer at the core loss metric for text modeling. Right? Just super aggregate. What's the loss? Look at the curves. It's a lower curve. Okay. That's interesting, obviously. I haven't really seen anything else so simple and attention free beating transformers. There was a theoretical reason to believe that this could be the real deal, which is the idea of dynamic computation. So essentially, the processing of each token differently depending on context. That is something that the transformers do. The attention mechanism forks into the k, q, and v vectors, and then those get recombined. State space models had not really previously done that. They just had a sort of single set of weights that would process every input in the same way. The selective mechanism now allowed for this additional path for the input to influence the nature of the computation. And so it looks a little bit more attention like. Seems to this was, of course, their motivation for doing it as well. Right? To be a little bit more, attention like in that way of having dynamic computation as opposed to fixed weight computation. And theoretically, that seemed like it was gonna work. Loss curve suggests it was gonna work. It also has super attractive properties in terms of how it scales. Linear scaling as opposed to quadratic scaling. That is to say each time step, each token prediction is constant time. And it's constant time because the state, unlike the attention matrix, does not grow over time. You've got a fixed size state that persists from 1 token generation to the next and which gets transformed but doesn't actually grow. So that allows you to have a constant time inference step, and that means your inference and also your training can be linear time, whereas, obviously, everybody knows that the transformer is quadratic. So those are the the main observations that I think probably attracted both of us to this. Then the Mamba architecture demonstrates that you can get really far without attention, but there was also enough evidence already even in December to suggest that probably different hybrid forms would be the way that things would ultimately be performance maximized. Today, and it's about 90 days since the original Mamba paper, we are going to look at all the updates to say, jeez, how is this playing out just 90 days in? I'm sure somebody could come up with something else that has kicked off a similar flourishing of, you know, downstream research. Certainly when maybe stable diffusion was released, people going nuts. But this is definitely up there in terms of the flurry of activity. And you've been keeping up with it better than just about anyone else. Do you want to, start off by giving us a little bit of an overview of what we're gonna cover today? We've got new modalities. We've got architectural variants. We've got just a ton of interesting findings, but give us a a preview of what's to come.

Jason Meaux: (10:12) Yeah. Happy to. Yeah. So flurry of activity is the right description since the original Mamba paper came out. Well over 30 new papers and significant projects, especially recently, it seems you can't go 1 or 2 days without seeing another Mamba repo or paper come out on archive. By the numbers, 60%, well over half are addressing vision and or image processing. So that was a little bit of a surprise because that's not a use case that the original Mamba paper really addresses. They focused on natural language, genomics, DNA sequencing and audio. So very encouraging to see Mamba having an impact in that area. Of the 15 or so vision papers, 9 were image segmentation papers. So this is very applicable to the biomedical field. But other things like image restoration, dehazing papers, and all kinds of other general vision tasks. Mamba seems to be having an impact. I think going out from there, natural language was about a quarter of the paper. So we see some new architectural modifications, some interesting hybrid models. That seems to be a enduring theme that we'll explore. And then just many edge cases that people are applying Mamba to work with graph neural networks, work with audio, work with stock picking. People are throwing a lot at the wall, seeing what sticks. And of those 30 papers, 80% of them modified the original Mamba architecture. So they had some variation. They didn't just apply the vanilla Mamba model to a new modality or a new domain. So work so far has been somewhat innovative. It's work thinking about how to use the selection mechanism, for example, in Mamba in the most effective way. 73% of the papers reported state of the art results. Those have, for full transparency, not been independently verified by you and me. But these are the reported results. And a lot of these papers do come with code repos. You can go try things out for yourself.

Nathan Labenz: (12:25) Yeah. There are a lot of caveats that apply to all of this analysis. 1 of the biggest ones is just that all this stuff has obviously been done very quickly. Right? The first papers started to appear as soon as 30 days after the original. And even the ones that have just come out in the last couple days are, you know, still just 3 months. So that's not a long time to go do state of the art work. And the papers themselves also, in some cases, feel relatively quickly drafted. I definitely felt in reading through these that there were a lot of questions that I had that I could not necessarily answer. And so we'll ground everything we can as as much as possible in the literature, but there is definitely gaps. And, you I would say reading this widely in this area does give me an appreciation for the peer review process and just putting people a little bit more through their paces of follow-up questions and, hey. Can you run a, you know, this experiment? And, like, how good was your transformer baseline? Really? That's always a question I have here. When it's state of art, that's 1 thing. When it's, like, a head to head of a toy example, how much did you optimize that that baseline? At the same time, though, it's, become somewhat well known that the Mamba paper itself got at least 1 confident reject in its peer review process. I certainly don't think peer review has all the answers for us, but all this work will continue to definitely mature, and I think we'll get more answers. And and hopefully, of the questions that we can't answer here will make their way back to the authors and maybe inspire either a little bit more experimentation or a little bit more clarity. 1 of the things I think has definitely seemed true is that it's pretty easy to work with. 1 of the very first things that came out was a blog post by a guy named Lucas Nell who took, Andre Garpathy's Nano GPT and basically made a Mamba version. So I actually haven't spent a ton of time with, but it's by Andre Grapathi, self recommending, and really meant to be just the simplest, most bare bones, most easy to understand on ramp to building your own and training your own GPT at a small scale, something that you could even do in a collab notebook. And so he works with that project. Reason I think it's definitely worth highlighting is it is accessible. He's got Glad notebooks out there that you can also go hack on. And a couple comments that he made really stood out to me. This is a quote from the blog post. He says, Mamba beats out transformers for this speech synthesis task, which is 1 of the experiments that he did. Beyond that, it was also really memory efficient, which is awesome for playing around on collab notebooks where you continually run out of space on the GPUs with transformers. I've certainly experienced that. Still quoting, I think it's a game changer in terms of performance and quality, and it's a super simple switch. A Mamba block is literally a drop in replacement for a self attention block.


📢 Sponsor Message (15:12)

Hey. We'll continue our interview in a moment after a word from our sponsors.
>

Nathan Labenz: (15:16) So just the ease of making these kinds of switches and doing this sort of architectural experimentation is obviously a key to seeing so many papers over the last 90 days. I do have some questions still around easy it is at scale. I think a lot of the papers that we will go through are working on pretty small datasets, pretty small models. It's foundational work in a lot of cases. Okay. Can this work in a new modality? How does this compare to the state of the art for, in the grand scheme of Internet scale data, a very small dataset? And so there may be a degree to which it is more forgiving at small scale because things can still be fast fast enough without necessarily taking as much care to the hardware aware algorithm as the original paper puts in. But I'd love to hear your comments because you've actually done some of this coding with the Mamba architecture as well. And I've mostly stayed at the analytical level, reading code, but not really writing code or, you know, not training any of these from scratch myself. Could you comment a little bit on your experience of working with the code itself?

Jason Meaux: (16:27) Yeah. Absolutely. Yeah. Mamba's, I would say, overall been a very pleasant experience working with, and certainly credit to the original creators of the architecture to think about that. It does seem to have the feel of the kind of drop in replacement for transformers in certain parts. I will say that it's always encouraging to see when a brand new model architecture like this can work with the existing libraries that are popular and that train models. Some of those are things like the hugging face training library. There's something called FSDP, which is this idea of, I wanna do a big training run. I have a cluster, so I wanna shard it. So you have an option to do fully sharded training. And the good news is Mamba seems to be compatible with that library. And then so I've successfully been able to train quite a few Mamba models for little experiments, and it works great.

Nathan Labenz: (17:25) In your work so far, have you had to get down to the CUDA layer and tinker with the kernels at all, or has it been all at the Python orchestration layer?

Jason Meaux: (17:34) Yeah. Good question. No. My investigations into the CUDA layer have been simply curiosity. It's worked out of the box. Have not touched that. 1 of the original authors of Mamba is known for writing incredibly performant CUDA code, Tri Dao. And so I would never want to accidentally modify some of his work in a way that would cause some of my work to go sideways. So I'm, for the time being, trusting the CUDA code that's been written. It's been great.

Nathan Labenz: (18:05) Cool. So actually 1 question. 1 thing I've been realizing in general through reading all this literature is that it's often not super obvious how the state is being handled. Again, to remind everyone, the state is this substitute for the attention matrix in that it's an encoding of everything that has happened in context up until now. But for the Mamba architecture, you don't have the ability to look back at all of the previous tokens and figure out interaction with each token. That's how attention works. In the Mamba situation, we've got the state gradually gets modified with new information. In the earlier versions, this was a fully deterministic encoding that was designed from first principles. Now we've got this selective mechanism. So it there's a learned aspect to the encoding as well. This is super exciting because it has the promise of creating a long lived memory essentially that could and they demonstrate this in the original paper, particularly on DNA more so than language, which is interesting. But they do demonstrate that it can handle super long sequences. And for anybody who's been frustrated by the fundamentally episodic nature of transformers, where you're always trying to figure out what context does it need, how do I efficiently manage that, Context windows are getting longer. That's still a challenge. The promise here was that you could build up a state over time, potentially save that state, potentially fork that state, and all sorts of interesting things open up when you have this state. 1 thing that I did get wrong in my original episode was an analysis of how much s ram there is on the a 100, which was used to do this training. I had said that it was a couple 100 kilobytes. And I had inferred that the state because the the hardware WER design here is about keeping that state in the sRAM, which is the form of RAM that is closest to where the actual computation happens. It's much smaller, but it's immediately or functionally immediately accessible for the purposes of computation. Whereas the high bandwidth memory is in in a typical transformer structure where, like, all the parameters sit, but you have this need to load these things in and out of the high bandwidth memory into the SRAM to do the computation. So here, the state can stay in this SRAM. And I had thought that it because 1 of the ways that the specs are presented by NVIDIA is that it's a couple 100 kilobytes. Actually, a listener reached out to me and said, that's actually per computation core. So in aggregate, over the entire a 100, you're looking at more like a few tens of megabytes sRAM. But this has me realizing a little bit that I I don't have a super crisp understanding as to exactly what is the state and how is it handled. The mob architecture has layers in the same way that a transformer has layers. Right? And the layer in the original Mamba architecture basically replaces attention with Mamba. You still have an MLP block, but instead of attention in MLP, you now have selective state space MLP. We'll get into some interpretability work, that looks at how information is processed through the layers a little bit later. But how is the state handled through those layers? Should I understand that each selective state space mechanism has its own state? And would that, in practice, mean that there's there's 20 layers? Are there 20 states, each 1 for its own selective state space mechanism? Or is it, like, 1 state that is shared across all of those and is being modified? For something as fundamental a question as that. I I feel like I should have command of that answer, but I don't feel like that's been stated very clearly in any of the papers. So we haven't discussed this actually, so I don't know if you'll have the answer, but maybe it's clear from your deeper dive into the code.

Jason Meaux: (22:00) Yeah. I it's a very good question. So I'll think about that to the best of my ability. You covered really well the key differentiation with Mamba versus the previous state space models is that they are input dependent and time varying. There are parameters. If we could pull up the original paper, there's an excellent diagram comparing the S4 mechanism to the S6. The main difference being in S6, there are a few of the dimensions that are chosen to be dependent on the input, on dependent on l. So the those parameters are b, c, and delta. But essentially, what is happening at each step is that this mechanism is running and it's constantly getting feedback and update from the input sequence and it is updating the hidden state. So the hidden state does get updated through time. But you're you you bring up a good question. So in terms of what is the state? Where is it located? We can confidently say the state is located in the SRAM for the reasons you just mentioned. That's part of the hardware aware algorithm. It needs to live there. Yeah. Some of the specifics for me personally are definitely still a little bit hazy.

Nathan Labenz: (23:13) Yeah. It's interesting. It may not even matter all that much. A lot of times these things that seem like they may be fundamental, it turns out you can do it either way or maybe you do it 1 way with a skip connection and it ends up working very similarly. So I'm not sure how critical that distinction even is, but it is an interesting reflection of for as much time as we've spent studying this, there are some things that are just, like, very low level implementation details that the papers don't really make super clear and which I think everybody's still collectively in the process of of figuring out. So keep that that level of uncertainty in mind as we go through a lot of different angles on this. So, basically, from here, I think we're gonna organize this into a series of deep dives. First 1 being Mamba learning theory. That is to say, what can we say at this point about the relative strengths and weaknesses of the architecture as compared to transformers. When we see something that has this memory advantage in theory, then we wonder maybe it has some weaknesses. Sure enough. As people are digging in, we are starting to see some differences, some things that the transformer doesn't do well, that Mamba could do much better, and vice versa. So the competing strengths and weaknesses, and that that naturally leads back to the hybrid forms outcome as well. So that'll be 1 section. Another will be looking into mixture of experts variations. Obviously, mixture of experts is a huge trend in general. And we've got a couple papers already showing how that can be applied to Mamba. Then as you mentioned, vision, just a ton of work in vision and some definite trends that I think give a good flavor for how this thing works and develop a little intuition for how Mamba systems see the world and how you have to massage your modality to fit the way that, a Mamba system works. We'll also have a bit of interpretability along the way, and we'll get also to some experiments on actually using the super long context. The opportunity to do super long context is a huge part of what makes it exciting. So how is that actually shaping up in practice? And then at the very end, we'll get into just a little bit more speculation and discussion before breaking. So let's do the first bit on learning theory. There have been a number of papers that have looked into what the Mamba architecture is capable of and and what it's not. First 1 is simply, is Mamba capable of in context learning? Spoiler. Yeah. Short answer is yes. I think this 1 was pretty interesting because it did a layer by layer study. We've seen many of these from transformers, right, in in the original Mamba episode. I went on a long digression as to first of all, it's just weird that the transformer is, like, the same at every layer because we now have a pretty good sense that the layers are doing different things. Again, this is summary of a lot of different bits of research coming together, but it's pretty clear that, especially with a sufficiently large transformer that has some of these more emergent proto world model type of capabilities, it's pretty clear that there is a gradual working up from inputs to higher order concepts. That tends to happen in the middle to late middle layers. And then the last couple layers seem to be dedicated to taking that higher order understanding and cashing it out to a particular prediction. And sure enough, we basically see the same thing in this paper. And by the way, the authorship of all these papers, incredibly global. We are seeing folks from literally all over the world. I I would say it is probably a minority that are from The United States, definitely a minority from California, and folks from Europe, folks from Asia, a lot from China, Hong Kong, some from Korea. It really has been remarkable to see how how far this, and fast this idea has traveled. So this 1 in particular is from the Instituto Italiano de Technologia and also the University of Friedberg. So I guess that would be Italy and and Germany, a collaboration. They're basically showing that, yes, the Mamba architecture can do in context learning at a high level. Yes. It can. And that we are seeing a gradual refinement of its understanding as it goes through the layers. This work reminded me a lot of Logic Lens, which is 1 where you look at the activations at every layer and say, if I had to make a prediction from the activations right now, if I just jumped from whatever layer I'm in, if I jump from this to the decoder and immediately had to decode these activations, how well would I be doing? This is pretty similar and basically finds that in the early layers that you're not doing very well, pretty gradually, pretty consistently as you go through the layers, you are seeing an evolution of the activations such that your predictions get better and better all the way to the end. There are a little bit of bends in some of these curves. I wouldn't say it's super super linear or super consistent through the process, but definitely the trend is pretty clear there. Any other any comments or reflections on in context learning?

Jason Meaux: (28:27) Yeah. It was very interesting work. We had never seen a paper Mamba paper up to that point that did that type of analysis. So good to see.

Nathan Labenz: (28:35) The next 1, I would say, is pretty similar. It's Othello Mamba. And, again, this is directly inspired by earlier work on a transformer. There was a a project called Othello GPT, which I believe is a Neil Nanda solo or possibly 1 that he supervised. Why don't you take it? Give us the the kind of original Othello GPT and now the new Othello Mamba.


📢 Sponsor Message (28:59)

Hey. We'll continue our interview in a moment after a word from our sponsors.
>

Jason Meaux: (29:03) Yeah. Sure. Basically, it was a repo that came out where we realized that these models had the capacity to play board games. Anyone who's been following machine learning, of course, knows like the heavyweight work there has been a lot of DeepMind work as well as other labs like OpenAI. But games have been an interesting thing to apply to models to stress test them. It's not just are they good at the game? So can they do they have some percent of win rate? But do they understand the rules of the game? How often do they try to play illegal moves, for example? So you can imagine how this would extend to other board games and sort of stress test. Are these models really capable of reasoning and understanding?

Nathan Labenz: (29:49) Yeah. I would emphasize too, in this particular project, these are pretty small models. So the size of the transformers under consideration are 11,000,021 parameters, and the the Mamba versions are 9,000,017. And so these are not really intended to be state of the art in terms of their performance at the game. They're not an alpha 0 competitor. But what they are really useful for is demonstrating the reasoning, as you said, but also allowing a a window into how is it doing what it's doing. And the thing that really was remarkable to me about the original Othello GPT, which is basically almost exactly reproduced except even a little bit better with the new Othello Mamba, is that the model only sees a sequence of moves. So in Othello, it's an 8 x 8 board. It's a little bit like Go in that you have a black stones and white stones, and you're jockeying for position. And when you surround somebody else's position, you can capture pieces. So the sequence of moves is all that the model sees. But to play effectively, it has to figure out somehow what is the actual state of the board. I knew that pieces moved here and here, but the the capturing dynamics and the changes that happen are not explicitly presented to it in the data that it is able to learn from. So the question then and this is a toy question, obviously, relative to the super big philosophical questions on the state of the art, you know, frontier models today where people are like, does it have a world model? What does that mean? 1 way to try to get at that is to say, let's look at this much reduced world, right, where if we think that we might be getting some sort of world model in frontier models purely from next token prediction, can we create something that's like that in a small model where still all we're doing is just a sequence of tokens, but we know that there is a higher order board state in this case that you really would presumably want and need to understand if you're gonna play the game effectively. So what is really amazing about this is that the model is found pretty clearly in my view to develop, again, purely from a linear sequence of moves, a state of the board that does take into account these capture dynamics. And it essentially learns the rules of the game and a 2 dimensional representation based on just this pure sequence of moves. So I think that's a pretty remarkable finding. The way that's done in both the Othello GPT and the Othello Mamba is by just a simple linear probe. So, basically, you are taking the activations, training a single vector that you will then use to project that set of activations onto a 2 dimensional board state. And so you are training this little thing, but you're training it to essentially interpret the information that's there. The hypothesis is that the information might be there, but we can't read it, you know, native just with human intuition. So instead, we'll train this little thing to help us read it, map it onto the board state, and then we can see how accurate is that representation of the board state. So it's not perfect, but it was definite news and clearly of interest when the fellow GPT results came out, and the board accuracy there was in the mid fifties. 5557% board accuracy. Whereas with the new Mamba version, we are seeing 6771% board accuracy for the 2 different, model sizes that were tested. So significantly more accurate representation of the board state. Possible room for artifacts in there. Maybe there are idiosyncratic reasons. We're not exactly measuring the internal state and the the accuracy of the internal state. What we are measuring is the ability for a linear probe to decode that successfully. So presumably, that's not perfect, and the internal state would seem likely to me is, like, a bit better than this measured result. Nevertheless, it's getting reasonably accurate. Definitely a hard challenge. Right? If you just said to yourself, here's a linear sequence of moves and keeping in mind that there's this kind of capturing dynamic that goes on the board. Could you look at a sequence of moves and immediately tell what the board state is? Not easy. Right? You'd really need to play it out and actually have a board and to implement the rules of capturing to see where the board state is. At least that's what I would have to do. And this thing doesn't really have the luxury of that. Right? It has to learn to do that all with its internal representations. So to me, this is a really notable result because the original result is so notable, and it works basically the exact same way, and the accuracy is higher. And you see the same pattern of, as you noted, the and the way they do this is they'll train 1 linear probe for every layer. The idea being that the representation is different in the different layers, and so you maybe could just have the same probe that would decode all the layers. But for best performance, they train a dedicated probe for each layer. These things are so quick to train. That's not really a problem. And you see the exact same curve that we're accustomed to seeing where the performance or the, let's say, the accuracy of the board representation goes up and up as the information is processed through the layers. You can see, like, a peak where that seems to be the highest order processing that the system is doing, and then the collapse at the end as it caches out to a final prediction. This is actually 1 of my favorites. I love these little toy problems where you can, with relatively minimal compute, figure out that this thing can actually solve a structured problem and develop a representation that it was not incentivized to do. Right? You could if you're a purely stochastic parrot, you wouldn't do this. Right? You would just purely associate certain prior moves with future moves, and you wouldn't expect to have a board state that actually tracks all the capture changes and and all the things that happen in a sequence of moves. So I thought this 1 was super cool. There are some interesting limits. I don't if you have any intuition for this, but it definitely gets less accurate as you get into really long games. The maximum number of moves you can have in this game is 64 because there's only 64 squares on the board. And they do show that the accuracy drops over time as the game progresses. And presumably, that's just it's getting more complicated, but that's simply something I would like to understand a little bit better. Because this is not like long context. It's just complicated. Maybe a bigger model would do it better. More training would presumably help. But at least in this version with the resources turned into it, you do see accuracy declines as you go from the early moves. It's very good. And then if you actually play all 64 moves, by the very end, it's not able to do it anymore. So it is still definitely finite in its capabilities at this scale. And what I I wonder, what's your gut say if we took a fellow GPT or Mamba for that matter, gave it more data to work with? How do you think that would, would play out? Would it be able to handle the long the full up to 64 move game?

Jason Meaux: (37:19) My gut would say certainly sizing up the model parameter wise would be important. You know, you always want to AB test things to try to understand it would be interesting to freeze that result and simply size up the dimensionality of the state and then try to repeat the experiment. I think this is like the new way of working with these state space models like Mamba. What is the mechanism underlying this model that's creating this effect? It's hard to say. And I'm very excited by the game work that's going to be coming. So far, we've seen 1 chess model, an 11,000,000 parameter chess model with Mamba. That's only training on 18,800,000 games. It was done by on Twitter at Haley storm c. And although it does make illegal moves and it isn't perfect, I did have a 37% win rate versus stockfish level 0.

Nathan Labenz: (38:18) Just asking perplexity how strong stockfish level 0 is, and it says that it is an Elo rating of around 1,200. Generally considered to be novice in the chess community, it is a significant milestone for beginners who have improved from lower ratings. So a milestone for early players is what we're able to compete effectively with just an 11,000,000 parameter Mamba model.

Jason Meaux: (38:42) That work actually is gonna get scaled up, and we should see a 50,000,000 parameter Mamba model play chess relatively soon, which is good because there's quite a bit of work by Adam Carvonen, who's trained transformers on chess. So we should see some really good apples to apples comparisons in chess as well, just like the GPT work you just cited. So that's coming.

Nathan Labenz: (39:06) Yeah. Interesting. 1 other little note on this, Othello Mamba 2, is and, of course, their plan is to scale this up and look at other games as well. They report that it is more data efficient, meaning, like, it achieves the same performance with fewer examples, which is interesting. But they did report that it took longer to train than a transformer of the same size. And they're they are working on an a 100, but they said it was 7.6 times longer to train than a transformer of the same size. And they also note that messing a little bit with the batch size changed that significantly. They moved the batch size from 2 56 to 64. And then after that adjustment, it became still slower, but 3 times slower as opposed to the 7.6 times slower. So this to me does suggest some interesting nuance to the speed. I don't know what's going on there, and it doesn't sound like they know really either at this point, but there are some definite subtleties to making sure that you actually achieve the speed of the original Mamba. You can't just port that everywhere for free, it would seem. This 1 also a little bit highlights the sort of confusing nature of the state. When I first saw this, I was like, oh, they're gonna be looking at the state itself and trying to decode that. But it doesn't seem like that's the case. Instead, it it is still the activations that sit between the layers as opposed to the state itself. So I would love to see, as an extension of this, some sort of way to try to decode what is in the state. That to me is really interesting. And and in general, it's it does highlight another sort of philosophical mystery about this for me. Maybe I'm just not bright enough. But when I think about, like, why does this work? It is really interesting that the state of the state, if you will, is not directly subject to optimization pressure. All this stuff is still trained on next token prediction or classification or whatever as we'll get into different modalities. But in this case, it's trained on making the next move in the series of Othello moves. And there's not as far as I'm aware, and I don't think we've seen any examples of this yet. This is definitely something I would be looking out for. There's not anything yet that says, what about the state? And is there any sort of pressure that I wanna put on that to perhaps represent certain information in a way that is gonna be super usable. It seems like that is happening just because just because the next token prediction requires the parameters to move in in the right direction, and so the state is taken care of. But I really would expect an explicit optimization target for the state itself to be coming to a loss function near you in the not too distant future. Did you see anything like that in this whole review where where people are actually trying to optimize the state itself?

Jason Meaux: (42:06) No. I haven't seen that. I think it's a very interesting line of research. You've alluded a little bit to the interpretability angle for something like this. What does the state look like at each time step? It's sort of designed to be a black box in some way. The loss you're trying to minimize is ultimately a difference of just input output. We've had attention for a long time. There are all sorts of things you can do with attention, obviously. I imagine we are gonna peel apart this Mamba later and find out all kinds of things about it over time.

Nathan Labenz: (42:36) Yeah. I think this sort of toy setting of a game with a board state would be a a pretty natural place to try to do this kind of research. It's obviously a lot harder when you're thinking of the true actual world state. How do I represent that? It's not like we have a great way to say whether it's right or wrong or to project it onto anything super simple. But starting with the state, in some sense, has to represent the state of the board because there's nowhere else for that information to hide. And so it's gotta be in there, presumably at least as rich as the activations. And if the way I'm thinking about this is right, you could imagine a scenario where you optimize not only for making the next move, but also for, an accurate representation of the board state, perhaps accelerating training, perhaps just helping you get to higher, you know, levels. Because right now, it doesn't seem like there's as much emphasis on building up that accurate world model as there could be. But, yeah, more to come on that. Okay. Cool. Next, still within the Mamba learning interpretability section, The the question is, can Mamba learn how to learn? And the solution that they present is a hybrid model called the Mambaformer. You can get a sense for what that might entail just from the name. This comes out of a South Korean group, South Korea and United States. Actually, 1 of the authors is a professor at the University of Michigan, not too far from me. So that was cool to see across 12 time zones, including right in my own backyard. But this really gets at the question of what can transformers do? What can state space models do? What can Mamba do? Is there anything that can do everything under consideration? And what can we come up with that can get the best of all these different architectures? So why don't you take us through the some of the tasks that they looked at and what worked, what didn't, and how they ultimately resolved it?

Jason Meaux: (44:28) Yeah. I'd love to. This was a fascinating paper. It came out a month ago, which in machine learning time feels like a quite a while ago, actually. Just before we talk about maybe the hybrid model they came up with. I think it's interesting to just look at the evaluations they made on existing architecture. So they evaluated a transformer, a Mamba model, S4, and something they call S4 Mamba, which is essentially just Mamba as it is defined in the original paper and then remove the selection mechanism, remove the input and time varying feature. State space models and transformers both showed relative strengths on certain categories. There are some places where the transformer succeeds and Mamba seems to falter. And the opposite is also true. So 1 of those would be a task called sparse parity learning. But a high level, you have this binary classification and there's data that has a lot of different features. Out of all these features, only a few are actually important for making the classification. That's why it's called sparse. So it's a challenging task because you have a lot of data being thrown at you. Only certain features are ones that you should actually be paying attention to. And so the model has to figure that out. The transformer really struggles. It was not able to do better than random guessing no matter how much the authors trained it. So they used an embedding dimension of 7 68 dimensions in a 24 layer model, which was much bigger than the other models they trained. And it still could not do better than random guessing. The state space models did much better. Mamba in particular, along with the variant S4 Mamba was able to basically solve the task with complete ease. So it took the network of only 2 layers. Another 1 where Mamba did well was the mini outlier linear in context learning task. This one's similar to the last 1, except it's trying to learn a simple linear regression. So you take like linear equation and you have a mixture of what we could call clean inputs and outputs that are associated with a function like that. And the trick is to actually mix those clean values with noise. So the algorithm they chose introduced noise in the form of ones and zeros at a 90% probability. So 90% of the time the model is having to deal with what is essentially noise. And this is an area where transformers actually were able to perform the task to a certain degree, but MonBook at pretty much every step of compute is performing clearly better than the transformer. And so the question for those 2 valuations, what can we glean? What does it really mean for a Mamba to be performing better than really what has been the state of the art for a very long time? It is the capacity to deal with irrelevant context. And it's also surprising, though, because really the model that has the superior memory is going to be able to have a better chance of solving these problems. Transformers are typically thought of with attention to have a more precise, better, fuller memory. But that doesn't really play out. Mamba is clearly getting an advantage here. I do wanna also cover the tasks where Mamba really struggles. So on the flip side, there are some tasks that transformers really excelled at, Mamba struggled with, something called multi query associative recall. So this is a kind of an interesting eval. It's not necessarily in context learning, but it is a very good stress test of the capabilities to recall information. And the way that this evaluation is created is you have these series of key value pairs that get fed into a sequence. The example they give, you have a letter and a number. So a 4, b 3, c 6, and then you can introduce a query that now test the model to remember what was the key value for a, b, c. And you can push that as far as you want in terms of context length and Mamba, it's not completely a fail at this task. In the paper, in fact, they do something interesting, which is they control for recurrent state size. So they try to remove the advantage a transformer will have when it uses full attention and so they control for that at very small recurrent state sizes. Mamba way outperforms transformers at being able to recall these key value pairs. But up to a certain point, when attention's really able to fully maximize what it's capturing, transformers easily solve this task. There's no variation at all. Whereas Mamba sort of plateaus close to 90%. This is a task that's somewhat memory intensive. It requires high precision. You never know how the synthetic tasks actually translate to the real world. I think that's important to remember both for the positives of what we just discussed with Mamba and state space models, but also some of the negatives like this 1. But it is certainly telling us something about what's going on.

Nathan Labenz: (49:44) Yeah. So first of all, let me just make sure I can describe the the tasks, and then I'll offer my my working theory. So in the first 1 that you described, which is where the transformer fails, but the Mamba succeeds, the sparse parity, this seems to be a lot of noise, a lot of irrelevant information. You said in the 1 case, was 90% just random stuff, and then the signal is all in the other 10%. And then in contrast, the 1 where the Mamba architecture struggles and the transformer succeeds is when you have a established pattern earlier in the context. And now the challenge is basically to be able to recall from that earlier established pattern and and reuse it now to complete the task. I guess my theory is it seems like this is a reflection of the high level of parallelization of the transformer versus the fundamentally sequential nature of the Mamba architecture. When I think about this super noisy environment and I think, okay. I'm a transformer. This is the opposite of anthropomorphizing. Now I'm modelizing myself. I'm able to connect every token to every other token. And when I do my back propagation, I'm essentially saying, okay. I want to obviously improve my prediction accuracy, and I can go change every parameter, and that includes the relationship between all of these tokens. However, when there's so much noise, perhaps the problem that we're seeing is just that the noise dominates. And because I'm considering everything at once, and I'm saying, okay, I'm gonna adjust everything that goes into the attention matrix, but I'm not really zeroing in effectively right away on stuff that actually matters. Because I can, in a sense, maybe overfit to the noise or just thrash around by continually overfitting to the noise. My update is dominated by updating on things that were irrelevant. That seems to be what's happening. And it's that's certainly reflected in the fact that even a transformer of some significant size is not really able to do that task basically at all. Now in contrast, if I'm thinking about this sequence where there's a pattern that's established earlier, that's obviously great for the transformer because I can look back and say, oh, last time there was an a, a b followed, and now I've got another a, perhaps a b follows. And so it it finds those patterns and updates on taking advantage of those patterns quite efficiently and effectively. Now if I take it from the Mamba standpoint, I'm like, okay. In all this noisy stuff, because I'm considering 1 thing at a time and updating 1 thing at a time, I can look and say, does tweaking the way in which I process this particular input help me much or not? And because I'm confronted with noise noise noise noise signal noise noise noise signal, then I can perhaps say tweaking this thing that is noise doesn't seem really gonna be helping that much. Whereas now a more narrow focus consideration of should I process this token that happens to be signal? Oh, hey. I'm actually getting a lot of value from this. And relatively quickly, it seems to lock in on which ones are the signal and which ones are the noise. I'm guessing because it's handling this in a fundamentally sequential way, and so there's an opportunity for it to update more meaningfully on the ones that actually give it signal versus the the rest, which are obviously just distracting. And then taking that to the sequence task where you need to be able to look back at the pattern, now I could be in trouble. Right? Because at the time that the sequence came in, it wasn't yet obvious that I should be paying attention to the relationship between a and b as it happened. And only later do I when I see another a, do I wish that I had updated accordingly or had maybe allowed that that relationship of a b to make a bigger impact on the state, which is encoding all the information that I have. But once that has passed, it's gone. Right? That was 1 of the big things that I talked about that a bit in the the original Mamba episode. There's gotta be some weird stuff that's gonna happen based on the fact that you get 1 chance to encode each token into the state. And then if you didn't take your opportunity to update on that information, then later, there's not really much you can do about it. So that is my working theory. It's it's that it's failing to update as the tokens go by because it's not yet clear that it needs to. And therefore, when it when the challenge actually comes, it's not prepared for it. Whereas with the transformer, it can look back and see all this stuff, and it can figure that out still the runtime of the the token that it needs to predict, which depends on that earlier established relationship.

Jason Meaux: (54:39) Yeah. I can agree with your framing of that. You can imagine the advantage attention has in the sense that it almost can recover. Now I can reference information that I, for lack of a better word, previously passed over. Yeah. Mamba is not quite working that way. I think your framing of the problem leads elegantly into the solution the authors had for this paper, which is acknowledging that there's something about attention that is advantageous in certain tasks, there's something about some of Mamba's properties. And so they decided to try a hybrid architecture to, in 1 model, can we take the advantages of both and combine them to go through these evaluations successfully? And so that's what they did. They look at the transformer consists of a multi head detention and a feed forward layer. And then Mamba consists of 2 Mamba layers. And so they define these 2 ideas of hybrids. So the first 1 they define was this idea of a standard hybrid, which would simply take the transformer and take out the feed forward layer and put in a Mamba layer. And so they even test that a little bit. It's not quite as promising as the 1 they ultimately land on, which is what they call Mambaformer, which is the same thing. Keep multi head attention, add the Mamba layer afterwards, but take out the positional embedding and put a Mamba layer in that way. The results they got were great. They were able to go through every 1 of those evaluation tasks and solve them. So here's like a great indication of a theme that you and I have talked about, which is when in doubt, think about what can you do with hybrid modes? How can you bring in these different mechanisms together in a single model? Now, don't necessarily test. Do we still have inference efficiency? They don't necessarily test. Could we train a model like this as quickly and efficiently as the base architectures? But it does at least show that the final results, the final outputs are promising.

Nathan Labenz: (56:45) Yeah. The nature of this hybrid too is pretty interesting. And, again, just to reemphasize the headline, the key figure in the paper is here's a lineup of tasks. These are pretty gnarly synthetic tasks that honestly takes a little bit of wrapping your head around just to figure out, like, what is the task? What are we even trying to get at here? Because they're all a little bizarro. But you've got the ones where you got multiple columns, and it's okay. Here, the transformer can do all these, but it can't do this, this, this very well. Mamba notably can do all the ones the transformer can't do, but has its weaknesses. And then sure enough, the Mambaformer can do all of them. So it's best of both worlds sort of solution. What was really odd to me is that they take out the feed forward portion entirely. In the past, we've seen the the original model takes, like, attention away and brings the selective state space in and still has a classic feed forward. Here, the feed forward goes away, and now you have both the attention and the state space. And I would guess that's not gonna be the final form. It feels like if anything has stood the test of time in machine learning, it would be the ML layer. And it seems like that is where facts are really typically stored in language models. So I would not expect that you could scale this particular hybrid up to large scale language model tests. Instead, I would guess that you'd probably have some hybrid like this still with a MLP block. Because otherwise, I just don't see how you would recover the all the factual storage that seems to happen there. This is toy problems. It's definitely really interesting for kind of a study of mechanism and definitely strengths and weaknesses. My guess is this gets built on in a bigger model, bigger dataset, more realistic task setting with the MLP layer coming back. And it is another instance of it seems like everything is working. When they replace the positional embeddings with another Mamba block, to me, that was like flashbacks to the original attention paper where the original positional embeddings seem insane to me from the get go. And it's like, wait. You do what? You you know, you have, like, some sine function thing about the position, and that just you just superimpose that on everything, and that kinda works. And it's, yep. That's how it works. And that we've seen actually, I don't know which positional embeddings they were trying, but that was the and the original 1 was the sort of trigonometric function. More kind of principled ones have been developed as well, like alibi, which is a much more intuitive, I would say, positional embedding. So I'm not sure which positional embedding they were using as the baseline here, But it is again striking when it's, oh, yeah. That's not quite working. Why don't we just try throwing a Mamba layer in there and see what happens? This wasn't really developed for that purpose. At least to me, it's not at all obvious why this first sequential processing would replace the positional embeddings in a way that would make a ton of sense or why that would work. But sure enough, it works. Okay. Great. We'll just leave that to future research. It's like the number of things that are working that we don't really understand or we're pretty clearly just somebody being like, ah, shit. Why don't we just try stapling this there and see how that goes? Maybe I don't give them enough credit. Maybe they have a more principled reason for doing this. But the general phenomenon is definitely real where people are just trying some stuff, and sure enough, it works. And then they're like, I often say the whole industry slash field is the dog that caught the car in that. Very few expected to be as far along as we are at this point in time. And this feels like another 1 of these moments where it was like, yeah, we just try it and sure enough it works and figure out why later. And for now, here it is. Pretty crazy. Okay. Summarizing all of this learning theory, we see that the Mamba architecture can do in context learning. We see that it does have a similar tendency to build up these representations of the world or in the case of the Othello toy example, the board state that are critical for certainly the way a human would think about predicting the next move. It's notable that much like in the transformer case, this was not something it was explicitly trained to do, but something that happened in the course of just training it on these linear sequences. We also see a similar pattern to transformers where it seems like there's this gradual elevation of concepts and gradually more accurate activations as you go through the layers. And, again, you see that collapse at the end where the higher order understanding that has been worked up and and crunched on then collapses as it's making its final prediction. And then we see these interesting 1 can do this, the other 1 can't. And it seems like the strength of the transformer, which is its ability to go back and look at the entire sequence and and realize later that, oh, that pattern is the pattern that I need to be paying attention to now. That's driven so much of its power, at least in a synthetic environment. We're able to show that if you put just a ton of noise in there and it has to look back at all this noise, it can have a hard time finding the relevant data in a super noisy environment. And then in contrast, the Mamba architecture can do that according to my hypothesis because it's able to evaluate and update for each each token 1 at a time, but it struggles to identify and make use of some of these patterns because as the tokens are rolling by, it may not be obvious. And so later on, it just didn't necessarily encode that information. And then we end up in the hybrid, and we end up with something that doesn't look maybe as principled as 1 might wish for, but which is working and which can do all of these little microsynthetic skills. Anything else to add about Mamba learning theory before we move on to mixture of experts?

Jason Meaux: (1:02:36) Great recap. Fascinating paper. Fascinating results. Great job by the team that put that together.

Nathan Labenz: (1:02:41) Okay. Cool. Then let's move on to mixture of experts. Honestly, this is a pretty short section, I think, in the grand scheme of things. Mixture of experts probably should be its own full deep dive because at this point, it's very clearly a major force driving frontier models. It has been credibly leaked, and I would say it's consensus accepted understanding of g p t 4 at this point that it is a mixture of experts architecture. And certainly, we've seen really good results from Mistral as well. And then most recently and arguably most notably, Gemini 1.5 was also described as a mixture of experts architectures. We've got multiple models at the frontier that are apparently built on the mixture of experts structure. What that typically is, instead of the single MLP block, it is multiple MLP blocks. That's the the baseline mixture of experts architecture. Instead of 1 MLP block, we'll have multiple, but we'll only activate a subset of those multiple MLP blocks. Again, this could be its own deep dive, like how many MLP blocks, how do you think about their relative sizes, How do you think about load balancing across them? That's a really interesting situation. What kind of specialization do the different experts have? How many experts should you be loading in at a given time? I've seen some interesting work where it's looks at, is there a single expert that should always be enabled? And then, like, other experts that get chosen at runtime versus are they all chosen dynamically? It's its own can of worms. But, basically, I'd say what we've seen here with the application of that paradigm to Mamba is pretty pretty basic, pretty straightforward, but definitely just shows that it also works. What they have done in 2 different papers here, mixture of experts Mamba, MOE Mamba, and then also black Mamba is basically, you could think of it as if you started with the transformer mixture of experts, you would think of it as basically the same thing that the original Mamba paper did, which is take the attention mechanism out, put the Mamba selective state space mechanism in, and find that, yep. Okay. It still works. Or if you wanted to think about it as starting with the Mamba, you could think, okay. Instead of my 1 MLP block that every token gets processed through, now I'll have multiple, and again, you have all these kind of different trade offs. So 2 papers on this. They both show pretty positive results. I have 1 kind of point of speculation that I'm really interested in that that that I don't think we've seen yet. But first, you wanna just give the the headlines in terms of the results?

Jason Meaux: (1:05:20) Yeah. Headlines are are interesting. The initial paper thinks about Mamba as a starting point and making modifications to so taking out a Mamba layer and adding in a switch based MOE layer. They scaled up to 32 experts. They found that at that level that the MOE Mamba was able to achieve the same loss as the original Mamba with 2.2 times less training steps. So you get a sense of some training efficiency there. And then Black Mamba, also interesting results. They scaled it up. So whereas the first model used 416,000,000 parameters, Black Mamba uses 2,800,000,000 parameters with 8 experts. They replaced the feed forward block with an MOE block and had a Mamba layer in there to replace attention. Their most interesting evaluation was comparing inference, which the other paper didn't emphasize as much. If you look at their chart, you can clearly see lines for generation latency are below transformer, transformer MOE and Mamba. If you do the math, it looks like 82 tokens per second for 1,500,000,000 parameter and then 68 tokens per second for 2.8. The hardware that was running, that's not explicitly stated in the paper, but still pretty decent throughput, assuming their setup wasn't too crazy. Yeah. 2 very strong data points. Any thoughts on on this? I know you had talked about some background on the Black Mamba team.

Nathan Labenz: (1:06:55) Yeah. Not too much is known about them. Almost all this work is academic. There are a few papers that have an author from Together AI, which is affiliated with the the original authors of the Mamba paper. It's Tri Dao that's at Together, and it's Albert Gu that's at Cartesia. So you you do see a couple of Together, folks sprinkled into the authors throughout this whole body of work. But mostly, it's it's academic type collaborations. This 1 stood out because it's from a company called Ziphra, z y p h r a, stealth startup about which basically nothing is known. They do have a website, but as far as I can tell, this paper is really the only content that's on there. And the company also appears to be very new. Well, 1 of the co founders recently left conjecture just in the last few months. So it's left conjecture, started a company, immediately did this, and published it really just a handful of weeks later. That's definitely 1 to watch. There'll be 2 other comments I think are interesting. 1 is the trade off with the mixture of experts architectures in general between more efficient training and inference, but that does come at the cost of many more parameters total. Right? You've got a ton of parameters, but only a subset are being activated at any 1 time. So your facts are more spread out, maybe more accurately accessible. Perhaps that's why it works better. You do have to set that up to actually productionalize this on an increasingly complicated hardware foundation. You can't just have them on a hard disk somewhere. They need to be in high bandwidth memory so that they can run quickly when they're called upon. And to do that, you need to have arrays of GPU each with, you know, experts sitting on them waiting. It definitely seems to suggest an advantage for your big tech incumbents with this kind of work because who has that sort of infrastructure sitting around? The likes of Google. Right? They've been building out cloud infrastructure and figuring out how to route things and how to load balance since long before most of these other companies were even started. And the just the capital that they have and the sort of software layer that they have to manage that complexity definitely seems like a huge advantage. Like, even if you had the weights for GPT 4, it would not be easy to set that up. If you just think, okay, there's 80 gigabytes maybe of high bandwidth memory on a single GPU. If I have 2,000,000,000,000 parameters, I'm gonna need more than 10 GPUs just to set the thing up. And then that then I have to get into the load balancing and all these other sort of interesting challenges. So I think that is just a really interesting observation that, like, open source is gonna have a hard time catching up on that front. Right? You could maybe have open source models, but you're still gonna need, like, infrastructure providers to support that kind of work. So that's 1 observation. And then another thing I think is I'm looking for is swapping out the state space portion of the architecture as opposed to the MLP layer. It seems like probably should work, or at least there is some version of it that should probably work, but I don't think we've seen anything like that so far. In the broader mixture of experts architecture literature, there is some work about swapping out the attention portion. And, again, it's, like, relatively not that deeply explored, it seems, as compared to the just swapping out the MLPs. But there is some work there, and it does seem to be viable. So this is 1 where I think, jeez, the selective state space, if that's the the sort of not replacement for in the big picture, but that's the kind of analog to attention that figures out how are we gonna compute on this particular input special way that's not uniform across all inputs, then it would seem natural that would also perhaps be be swappable. And maybe you would need some kind of again, think this is I get very speculative here, but just thinking back to the last 1 where it was what works best if we put a Mamba layer first and then we get into our hybrid? I could also imagine something similar where it would be like, process the the sequence through a an initial Mamba or some initial kind of workup and then start swapping perhaps both the selective state space and or the MLP layers as well. So I don't think we've seen the end of mixture of experts. Certainly, writ large, it seems like it's gonna be a huge force, and I would definitely expect to see more work looking at more intricate sorts of swapping. This right now is the switch transformer, your vanilla MOE standard. It would be the baseline for for many points of comparison in the MOE work from what I've seen. And basically just showing that, okay. You can bring this over here, and it works. And you can scale it up to a decent degree, and it works. And you get some of the same benefits in terms of takes less compute to get to a, you know, similar level of performance. What comes next, though, I think is is probably a lot more intricate and and perhaps unexpected forms of figuring out what to swap out under what conditions, what has to be preprocessed before that'll work, etcetera, etcetera. Okay. That's the end of part 1 of Mambapalooza. In part 2, which should be out by the time you're hearing this, we get into all of the image segmentation and other computer vision applications of Mamba, as well as various projects that begin to realize the dream of super long context windows. Plus, we get into the potential problem of rotting internal states and even explore a little bit of how these hybrid state space attention models are already beginning to be used in the context of biology. There's a lot more to come. Can't wait to see you there. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.


Turpentine Network Message (1:12:55)

Turpentine is a network of podcasts, newsletters, and more covering tech, business, and culture, all from the perspective of industry insiders and experts. We're the network behind the show you're listening to right now. At turpentine, we're building the first media outlet for tech people by tech people. We have a slate of hit shows across a range of topics and industries from AI with Cognitive Revolution to Econ 102 with Noah Smith. Our other shows drive the conversation in tech with the most interesting thinkers, founders, and investors, like Moment of Zen and my show Upstream. We're looking for industry leading hosts and shows along with sponsors. If you think that might be you or your company, email me at erik@turpentine.co. That's erik@turpentine.co.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.