Mamba-Palooza: 90 Days of Mamba-Inspired Research with Jason Meaux: Part 2

Nathan and Jason Meaux continue their deep dive into Mamba-inspired research, exploring its impact on computer vision, context length, and hybrid SSMs in biology. Part two of a comprehensive look at this groundbreaking AI technology.

Mamba-Palooza: 90 Days of Mamba-Inspired Research with Jason Meaux: Part 2

Watch Episode Here


Read Episode Description

In this second part of a two episode series, Nathan and AI scout Jason Meaux provide a sweeping overview of the first 90 days of Mamba-inspired research. They discusstMamba's application to computer vision, experiments in extending effective context length, the potential problem of internal rotting states, and the use of hybrid SSMs in biology. Try the Brave search API for free for up to 2000 queries per month at https://brave.com/api

LINKS:
Show Notes and Paper Links: https://docs.google.com/document/d/1NK_a3deVL_aczORmSRw8LyujNPotpO7Kd90sIRj9Qx0/edit?usp=sharing

Nathan's original Mamba Deep Dive: https://www.youtube.com/watch?v=X5F2X4tF9iM

Part 1 of Mamba-Palooza: https://youtu.be/Bg1LQ_jWliU

Statespace.info: https://www.statespace.info/

X/SOCIAL:
@labenz (Nathan)

SPONSORS:
Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, instead of...does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive

Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off www.omneky.com

The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://brave.com/api

ODF is where top founders get their start. Apply to join the next cohort and go from idea to conviction-fast. ODF has helped over 1000 companies like Traba, Levels and Finch get their start. Is it your turn? Go to http://beondeck.com/revolution to learn more.

TIMESTAMPS:
(00:00) - Episode Start
(00:01:14) - Nathan and Jason’s Bet
(00:04:51) - U-Mamba
(00:15:10) - Sponsors: Oracle | Omneky
(00:06:08) - Swin U-Mamba
(00:18:59) - Vision Mamba
(00:24:30 - VMamba
(00:27:26) - VM-UNet
(00:30:37) - Sponsors: Brave | On Deck
(00:35:15) - SegMamba
(00:36:59) - Vivim


Full Transcript

Transcript

Nathan Labenz: 0:00 Hello, and welcome back to the Cognitive Revolution. This episode is part 2 of my Mamba Palooza with fellow AI scout Jason Meaux. If you missed part 1, you might wanna start there. And if you haven't heard my original Mamba episode from last December, I'd recommend starting with that 1 for important foundational context. In this episode, we'll be covering Mamba's application to computer vision, experiments in extending the effective context window, and a bit of biology as well. Once again, special thanks to Jason for putting in a ton of work to make this happen and an open call to all of you to suggest additional topics for us to explore on the show and especially to volunteer to go on 1 of these adventures with me. You do not need to have a PhD in machine learning to do this work. You just need extreme curiosity for the subject and a relentless drive to understand what's going on. As always, we appreciate it when listeners share the show online. A tweet is worth a lot, and a review on the major platforms is especially valuable. With that, here's part 2 with Mamba scout, Jason Meaux, beginning with a discussion of a friendly bet that we made about just how many Mamba papers we each expected to see. Enjoy. Okay. Cool. So we are back. First, let's talk about the bet. Yeah. We got an interlude.

Jason Meaux: 1:16 The bet was between you and me. Maybe it was a few weeks ago. And so we decided, okay. In this 2 week period in February, how many Mamba papers will be published? I think I'm pleased with the results.

Nathan Labenz: 1:27 Yeah. This is definitely a object lesson in exponentials. Be crazy. In the process of setting this over under, I was thinking, okay. Where are we on this curve, and how how fast is it gonna bend? I said 5 and a half. So you wisely took the over. It quickly became clear that the over was going to win, and then we revisited. Did I reset, or did you reset the over under?

Jason Meaux: 1:51 Yeah. You reset it. I I thought, wow. 14. That's gonna be a tough get for the reset. But we got fairly close. Right? 13 papers in 2 weeks?

Nathan Labenz: 2:01 Yeah. So 13 was the final answer. My instinct is to say, what's the over under for the next 90 days? It seems crazy to say that it would be more than a 100 that would come out over the next 90 days. But then again, maybe not.

Jason Meaux: 2:14 I'll I'll happily set the line, and then I'll let you take the over under.

Nathan Labenz: 2:17 Okay. Sure.

Jason Meaux: 2:18 So, yeah, 90 days from now, I'll set the line at 55.

Nathan Labenz: 2:24 And that's more 55 new ones over the next 90 days?

Jason Meaux: 2:27 5 new ones.

Nathan Labenz: 2:27 Yeah. I think I have to take the over. Right? Because that would be less than a doubling relative to the first 90 days. Is the second 90 days of Mamba gonna have more or fewer papers than the first? I think I have to say more. You can check back on June 5 to resolve our wager. So let's get into vision. This is the next big deep dive section for us. And this 1, I think, is interesting because as you noted, it is the majority of the papers. It's been striking to see this much work on the vision modality, particularly because there was no aspect of vision or image processing in the original paper. I don't know why that is, but what theory would be that images are not sequential in the same way that all the other things are sequential. Right? In in language, it's token to token. In music, it's proceeding through time, and you're predicting waveforms or whatever. In DNA, that's 1 base pair at a time. But in an image, you have this kind of different sort of challenge where it's not like there's a single order to the pixels. Right? They're in at least usually a 2 dimensional configuration. We could have additional dimensions, whether that's going from image to video, adding a time dimension, or even going into 3 d imaging. So all of those are are represented in the in this Mamba literature so far. Now transformers also have a little bit of a challenge with this, and the way that it's typically been handled is by moving things into patches and just chunking images down into small bits and then treating those bits as tokens. And that's worked pretty well in transformers. But, again, they with the transformer, it's not recurrent. Right? It's not, like, processing just 1 at a time. It's processing everything at once. Even if there is a sequence to it, everything can talk to everything. That patch, approach has been basically fine, but it has worked really well. And, of course, we have lots of vision transformers. So how did folks handle this in the context of image understanding? Let's do a rundown.

Jason Meaux: 4:37 Sure. So I I guess if you look at the literature of Mamba, we already covered half of the papers. Over half of the papers deal with vision. By far, the most common vision task that's tackled is biomedical image segmentation. This was kicked off early January by researchers at the University of Toronto, a heavyweight for a lot of machine learning work. And their paper is called U Mamba. They used a convolutional neural network block, CNN, that they then interleave with a Mamba block. Their motivation was thinking the convolutional layers could be useful for local feature extraction. So really focusing on what is at the moment in hand and then using the Mamba SSM for tracking long range dependency. In the paper, there's a really useful diagram of that. It has this U shape of what's going on. And you can see there's a chart on the left where we first enter this sort of convolutional layer. It's flattened and it goes through a layer normalization. And then there's this block that includes the Mamba SSM layer. Their combination of these 2 features help them get positive benchmarks for organs, CT scans, MRIs. This architecture was able to perform right there at the state of the art, sometimes beating the state of the art, at least how they measured the the competition. Very promising paper and it kick started a flurry of image segmentation papers that came later. I guess just quickly, could name 1. There was also Swin uMamba, and they actually cited beating out uMamba. So we now have Mamba variants not just comparing themselves to legacy transformer or other vision models, but are comparing themselves to other Mamba variants. And so each paper is slightly innovating. The next, just 1 thing to say about that use case, because it's 1 that is somewhat new to me. So you can think of a doctor who's trying to monitor disease progression of a patient. They may see hundreds of grayscale images that creates a huge burden for the doctor to separate organs from tumors to different types of tissues. The better these models can be, the better that feedback cycle is for the biomedical field. If you can really quantify in very precise ways what's happening in a certain organ that can lead to better outcomes.

Nathan Labenz: 7:11 Yeah. It seems like in this section, then that that is definitely worth emphasizing that unlike the language tasks throughout this whole body of literature, which mostly seem to be attempting to compare the whatever Mamba version they've cooked up against some sort of best transformer recipe we know, which is actually the language that the original Mamba paper uses. Right? They talk about, is it transformer plus or something like that that they call it in the original 1? But, basically, they they describe it as the best transformer recipe that we know, which is interesting in that and I trust those guys because they're obviously leaders in the field. I trust them to know a pretty good transformer recipe, but there could be a better 1 out there. And, obviously, a lot of these things are now kept secret. So we do have a bit of a challenge in those language ones where it's, like, smaller datasets, not huge parameters, not really trying for state of the art, but instead trying to compare on a somewhat like for like basis. But you always do have to keep in mind how hard did they really work to optimize the other 1. Was it really an awesome transformer version that they used or perhaps not the best? I think that is a lurking caveat in some of these papers for sure. Here in the image segmentation portion, it does seem like we are really talking, like, actual state of the art type performance. These datasets, first of all, are, like, hard to create. It might even be worth taking just a 1 1 more step back and just talking about what is the image segmentation task. At least as it's done in The U in a way, resembles it really is image generation. It's like image to image. Essentially, you put an image in. It processes it through this unit. And the reason it's called a unit is because it gradually works from more bits of input and fewer dimensions for that input to a different shape where there are fewer channels, you might say, but greater depth in each channel. And the whole point of that is to and this is a fairly classic approach for a convolutional network structure. Right? You're working your way up from the lowest level information to these, like, higher level features through these blocks. And so it makes sense that you're considering more of the image, and they're mixing together as you go through these layers. And there's more information for each of these bits of the image that you're considering, but you're abstracting away from the very low level pixel details up to some sort of higher order, still positionally related, but more kind of blurry portions of the image. And there's more parameters or more numbers to describe each of those positions within the image. This is, like, pretty classic U Net type stuff, and it's analogous to how the transformer goes through layers and works up to these sort of high order concepts. This is the vision equivalent of that, where at the bottom of the U Net, you have the fewest locations to consider because you've blurred and and, you know, zoomed out a little bit from the lowest level details, but you've got the greatest depth of meaning for each of those positions. And then the upside of the U Net is working back the other way and now moving back toward a new image. So it's we start with an image, working down the unit toward high order concepts, and then working up the other side of the unit toward a new image to output where all of that high order stuff can now be cached out into a new prediction. And there are also skip connections the way these are typically drawn. There's, like, the u and then there's lines across the u. And this this is basically saying, okay. We're also going to, at each stage, give that earlier that as we were working down the unit, working up concepts and down the unit, we'll also pass that over to the corresponding same size portion on the generation upswing so that we are grounded by that original most kind of concrete representation. And then what's output is basically a color overlay that goes on top of the original image. Actually don't know if they're generating a pure overlay that then gets merged onto the image or if they're generating, like, the image with color all kind of in 1. What you end up seeing is this black and white grayscale image as you said and then a version of that where shapes have been drawn on in different colors and the different colors correspond to different things. And this is useful, as you said, for all sorts of reasons. I think it is interesting to realize that the image segmentation is essentially an image generation task grounded in the original image and to make the analogy between the U Net, the the bottom of the U Net being akin to the middle layers of the transformer. And now you might also say, okay. With all that said, what's the Mamba doing here? And there, the traditional way to do this has been convolutions. And the challenge with convolutions is that they are inherently local. So you're going patch by patch and convolving and mixing the information in each of these local patches, working your way through the image, and and that's how you progress from 1 block to the next is doing all these local convolutions, passing them on. You could eventually have everything convolved into 1 perhaps, but there's no way early on to start to understand the relationship between distant parts of the image. And that is where the Mamba portion of the architecture really shines. That is now I can take a full pass through the entire thing or perhaps bidirectional or multidirectional passes, and I can get that kind of comprehensive view at the same time. And then combining that with the convolution is where this thing gets its its very best results.

Jason Meaux: 13:05 That was an excellent summary of how everything is working. I think your framing of it as an image generation task moves my values a little bit. That's maybe the 1 area I have underappreciated, having focused more on language. That's not just make a picture of a group of puppies. It's perform an image generation task that segments this image that could ultimately save lives. Yeah. Your framing has somewhat moved my priors on the importance of image generation. So

Nathan Labenz: 13:35 Check out definitely the old, episode that we did with Tanishq Matthew Abraham too. He did a project on virtual staining of tissue. A typical tissue stain that you might use to look for various pathologies in the tissue involves, first of all, cutting the tissue out of your body, which is not to be taken lightly, and then slicing it on the glorified meat slicer, and then plating that on a piece of glass and then dropping a chemical on it to stain it so it turns colors so that you can actually see it effectively, and then having a person look at that under a microscope. And these days, you might also take a picture of that, and you can have a classifier trained on those images. But to actually get your tissue into even an AI classifier, you have to go through that process of cutting it out, slicing it, staining it, and it takes a lot of time, hours for somebody to work that thing up. So that means you can't do it while in surgery. His new technique was building in part on a probe, and this was not his particular work, but there's a new kind of probe that allows you to essentially get these kind of 2 d images of tissue while the tissue is still in your body so you don't have to cut it out to get a look at it, which is a great start. But then you also can't stain it because you don't actually have it, so you can't drop the chemicals on it. So now how do you actually bring this to visual accessibility? Because it's it's all, again, grayscale and hard to see. Virtual staining was the answer for that.

15:06 Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: 15:10 And this now takes the time down to get a read on what is up with this particular bit of tissue from a slight carve out and a slice and whatever and how many hours that takes to finally get something back to using the probe to get the image, sending this into the virtual stain, and boom. You've got a stain on screen that that whole round trip can be seconds. Now it may still be years before they actually get this into the clinic at any scale because that's a whole process, but that was definitely a really impressive piece of work. And 1 of these last papers, the SWN U Mamba incorporated pre training. They did a very similar thing to, you know, some of the other papers in terms of having a 4 way scan, but they boosted the performance even further just by doing pre training on just, like, general images Starting from scratch because the data is so scarce in these medical settings can be really tough. And that was something that Tanishq also had to overcome in the type of I think in the stains you would have more, but there was another thing that he did called reconstructing the mind's eye where they basically take result of a brain scan and try to reconstruct what you were looking at the time based on the activity observed in the brain. That data, super hard to get. You have to be in an fMRI. How many people wanna sit there and look at images and have their brain scan just to create datasets. It's expensive. There's not a lot of data. But where he had a significant result was using large scale foundation models for vision and then fine tuning them to this particular context where there's not a lot of data, but there was enough to fine tune. And I think this is a similar thing that they did with this Swin UMamba paper where they basically said, hey. We only have so much of the various kinds of medical imagery in these off the shelf datasets. But maybe if we just train this thing on just general imagery first, it will already have a lot of the kind of at least low level features. Right? It maybe should already have edge detection. It should already have these these kind of primitives, if you will, of vision, and then we can do that later training and adapt it to this particular situation. And that seemed to work. I don't think it was, like, a huge advance in terms of the the number of percentage points improvement that they got. And there's a turns out there's a lot of medical image segmentation benchmarks out there. So from what I could tell, also, they seem to be referring to somewhat different benchmarks for their evaluation. So it's it's hard to say exactly how much improvement they got out of that, but at least, you know, a few points. And, you know, and that's just from pre training on ImageNet. So you you could imagine definitely a a bigger, you know, better version of that too if you wanted to, take something like this up to Internet scale. You wanna take Vision Mamba first?

Jason Meaux: 17:47 Yeah. Absolutely. We'll jump into Vision Mamba. This was a work that came out mid January from some groups in China, Huazong University of Science and Technology, an outfit called Horizon Robotics. They use bidirectionality to learn visual representations. So the model consists of a forward and backward pass. And the idea is when you allow information to flow in both directions, you can get a better representation of what's going on visually. They mentioned that VIM, short acronym for Vision Mamba, is 2.8 times faster than the transformer that they compared to. It saved over 85% in GPU memory when they performed batch inference. Those images had resolutions of 12 48 to 12 48. And just to underscore the level of enthusiasm they have, I'll just quote their papers. The results demonstrate that VIM is capable of overcoming prior computation and memory constraints of transformer style understanding for high resolution images. It has great potential to become the next generation backbone for vision foundation models.

Nathan Labenz: 18:59 Yeah. A couple of things jump out to me about this 1. First of all, going back to the original Jamba monologue, 1 of the things I had speculated about is that we should start to see variations with multiple states, multiple state space models working together. And this, I think, was the first 1 that actually hit the public. And just to clarify, when you say a forward pass and a backward pass, maybe we should use the terms forward scan and backward scan as so as to differentiate between that's funny. Differentiate's another, overloaded word in this case. We wanna distinguish between the forward pass of, like, proceeding through the model and the the backward pass of back propagation and actually updating the weights. Here, what we have is within a Mamba layer, there is a fork where the image as it has been turned into a sequence, and you can take an image and just put all the pixels in a sequence or put patches of of the image in a sequence. But that is as we mentioned, you have the problem of the Mamba structure can only process that in order, and it can only decide how to handle certain information based on what it has already seen. What if there is a surprise later, a pattern that wasn't obviously important the first time you saw it, maybe later becomes important, but you didn't really remember it. So how do you handle that? What they do within a given Mamba block is just take that order, reverse it, and then run-in parallel 2 different selective state space models. And then they do at the end of that, 1 of my favorite things in all of machine learning, which is just add them together, just superimpose them on 1 another, and proceed. And in that way, you now have basically 2 angles, 2 2 views of the same data, and they call it forward and and backward. And as we'll see, there's gonna be other variations on this as well that my where my mind went to was the Silicon Valley center out compression. What about why are we starting from 1 end and going to the other? What if we could do something more radial? Right? It seems like the the important stuff is at the center of the image. Who knows if that will really be needed to have a a further unlock? But, this definitely feels like 1 of those things where it's early going, and it's let's try we know that images fully be represented in a a single sweep through the image or a single scan through the image. So what if we try 2? We'll take 1 from either end and hope for the best, and sure enough, it seems to work pretty well. Another thing that definitely jumps out is the fact that there's a robotics institute. I don't know what Horizon Robotics is, actually, if that's a company institute or whatever. But, clearly, when you are working, you know, with robotics, you're you're interested in doing stuff on the edge, the resources that you can bring to bear are quite different. And so when they compare the transformer to the Vision Mamba architecture, they show that you actually get to an a memory requirement of greater than 80 gigabytes to process an image of the 12 plus by 1,200 plus pixels. It's a lot to carry around on the edge. The Vision Mamba version drops that memory requirement down to 11 gigabytes of memory. And now that's, like, under what my several year old MacBook Air has. So you go from exceeding what, like, a pretty high end GPU has to fitting onto a laptop level of memory. And that's obviously a big deal if you're trying to build robots and and have them do stuff and have them be responsive to their environments.

Jason Meaux: 22:52 Related to this, V Mamba came out. So if you want to tell us a little bit about V Mamba?

Nathan Labenz: 22:59 Yeah. It's gonna be hard to keep up with all these names. So the first 1 was Vision Mamba. The second 1 is v Mamba. And going through this, I felt good about my intuition or perhaps I should feel like my ideas are obvious because just as I was looking at the forward scan and the backward scan and thinking, jeez, I bet you could do a lot of different scans. What would a center out scan look like? The vMamba introduces what they call the cross scan. It is actually a 4 way scan where they start in each corner, and they strike the image 4 different ways. That creates 4 different representations, run those in parallel, and then just superimpose them all right back into the same space again. From the diagram in this paper, it looks like there are 4 independent parallel computations through the state space mechanism. But that's not entirely clear from the paper, and there are some other examples that show it seemingly differently. My best guess right now would be that you can do it multiple ways. And, you know, we may not know what is optimal, but they may all kind of work. But in this case, it looks like they are running 4 copies of the state space mechanism. Now this is relatively small. The the parameter range for this is, like, 20,000,000 up through 70,000,000 parameters. For the last 1, it was similarly in the, like, low tens of millions of parameters. So this is small to the point where you could run 4 in parallel, and you probably don't have to worry about it that much because it's all gonna be pretty fast. Definitely a question starts to arise if you think, hey. What would it look like to try to do this at 7,000,000,000 or 70,000,000,000? Then can you just run for state space processes in parallel and and merge them? You could conceptually, but how much would that start to erode the hardware aware design? My my guess is it probably can be made to work, but it's just complicated. If the internal hidden state is big enough that you can only fit 1 on the available SRAM, then presumably, you have to actually start parallelizing this across different hardware, which is certainly done. But those are the kinds of things that a lot of these early papers seem to be skirting around and not really worrying about just because the the datasets and the models that they're working with are are so small that they just don't really have to worry about it.

Jason Meaux: 25:30 Yeah. I'm with you on that. I think you and I have talked about SRAM constraints. There's a clear signal being sent to the market that there are architectures, including transformers that benefit from increasing the size of SRAM. There are some companies that have already released some chips that do that. And, yeah, if if you run things on a distributed compute cluster, how does it affect the hardware where algorithms, a lot of that we have yet to really answer.

Nathan Labenz: 25:59 Already, we have 2 related works. It's not always super obvious in going through this literature how much is directly based on earlier stuff and how much people are coming to the same conclusions independently. But 1 adaptation of this is called VM U Net. That's Vision Mamba U Net for medical image segmentation. And they take very direct inspiration from the 4 way scan and essentially apply that exactly as developed in the in the Vision Mamba paper, but they're applying it for medical image segmentation. But I also wanted to highlight this other 1 called Mamba ND, and this 1 came out a little bit later. This is February 8. They, again, do this 4 way scan, and they also generalize a little bit past this to consider video modality, and that involves adding a temporal dimension as well. So they describe the image as 2 d, and they talk about basically the height and the width dimension, and then they add on a temporal dimension. They also do some interesting things. They, like, look at weather data as well. So lot going on here. But what I thought was really interesting about this paper is they take this multi way scan, and then they look at this question of how should we wire up these different state space processings of these different scans. And they've got quite some elaborate wiring diagrams in the paper that basically show, okay. Here's, like, what the basic 1 looks like just as a reminder. Here's what the bidirectional 1 looks like where you've got 2 running in parallel. That's we referred to a minute ago. Then there is the 4 running in parallel for 4 way scan, and that is, again, I think what the Vision Mamba paper does. But then they also start to look at doing things partially in series. What if we put all 4 or all 6 of our things in a series? What if we put 4 of them in parallel and then recombine and then have 2 more in a series? So lot obviously, the kind of permutations on this are endless. What they report in this Mamba ND paper is that the sequence actually was the highest performing. So they literally just put 1 state space selective state space mechanism after another, each handling a different version of the scan as input, but happening 1 of 1 at a time and just passing their results onto the next 1. So you've got the skip connection where it's here's what came out of the last state space processing, and here is your particular version of the scan. So what would I rather do? Would I rather take a again, this is modelizing myself. Right? Would I rather clone myself 4 ways and have each version of myself look at something from a particular perspective and then come back and combine our put our heads together, merge our thoughts, so to speak, and figure out what we can say about it? Or would I rather which is really the only thing I actually can do in practice, would I rather just run 1 scan, learn what I can from that, and then scan a different way, learn what I can from that on top of what I already learned, and then scan a different way. And that sequential processing of the the different scans taking into account what was already learned from the earlier scans is what they report as giving the best results in this Mamba ND paper.

29:31 Hey. We'll continue our interview in a moment after a word from our sponsors.

Nathan Labenz: 29:35 I guess I wanna say my intuition is that as I talk through it, that does make sense because now I'm not just scanning independently in isolation and then merging later. Now I actually can take into account the results of the first scan as I do my second scan. I already looked at it 1 way. Now I'm looking at it another way, but I can keep that first view in mind, so to speak. I'm I'm mindful that I might just be post hoc rationalizing that, but makes more sense now as I talk through it. What do you think? Do do do you feel like I'm talking myself into something here, or does that resonate?

Jason Meaux: 30:10 No. I think it resonates. Doing things in parallel is great and efficiency and distributing workloads. But now I think your comment is correct. The whole point of the selection mechanism is these features are input dependent. So whatever bits of information you picked up on the first scan, the only way you could incorporate that extra input dependence would be doing it in in sequence. It's interesting to think about though, because there may be ways around this.

Nathan Labenz: 30:40 I think that wraps it up. Toward the conclusion, they have a nice little figure that shows what they call the effective receptive field of various designs. They do show a unidirectional, just a single scan versus a bidirectional, both directions versus the multidirectional or the 4 way or however many way scan. And you can see that the receptive field that is, like, the way that an individual bit of processing can be informed by the rest of the image is definitely clearly maxed by the multidirectional approach. But it's just fairly intuitive really to say just, again, everything has to be a sequence. Right? This is 1 of these, emphasize these key points for intuition. This thing can only handle sequences. Everything has to be a sequence. So you have to turn whatever it is you're working with into a sequence, and then you also have to think, if I do turn it into a sequence in that way, what might I miss? Or how could I maybe represent it as multiple sequences in order to get a robust view of this particular data type. So, yeah, that's definitely a big theme. And I think that basically is the core that's my biggest takeaway of this entire vision section. So we can maybe move a little bit faster through some of the subsequent papers with that in mind and just see how it's gonna play out in a number of different ways. Let's run down the I guess it's 4 or 5 more still that we have in the vision section here. I think we've got the core themes established. So, really, at this point, it's just noting that this same technique is working in all these different contexts with, like, subtly different modalities. Next 1 is called segmamba, and it goes now to three-dimensional data. And it definitely also shows the importance of being able to handle long sequences because they're looking at data that is in a 64 by 64 grid. And you're like, that doesn't that doesn't sound that big. That's 260 thou plus mini cubes within the 64 by 64 by 64 cube. So that's a 260,000 length sequence if you're just going 1 cube at a time. It's also amazing, by the way. Don't I don't know exactly with this 1, but some of these things are actually quite crude measures. The 1 that, Tanischka was working with reconstructing the mind's eye was blood flow to the region, And it was a region of, I think, 0.2 mm cubed, something like that. A small region, but, like, you can fit a lot of cells in there, and they were literally just measuring blood flow. We're not talking, like, measuring neuron spikes here. Nothing that precise. Literally just blood flow to little I I remember it being, like, almost a grain of rice kind of size. Really not big. Big in looking at such a crude measure as blood flow. Something probably fairly similar happening here. If you're breaking any substantial tissue into 64 64 64, you're getting little chunks. But, again, that's already a long sequence. What they do in this paper is a 3 way scan. They do slices. They do a forward and a backward on each slice, and then they do 1 that cuts through the slices. The whole thing was run on just 4 a 1 hundreds.

Jason Meaux: 33:57 Yeah. That's pretty promising. You can only imagine how much compute you could get yourself into when you start talking about 3 d medical image segmentation. That's yeah. 4 a 1 hundreds is is almost nothing, so that's a positive sign. Anything else on vision we wanna cover?

Nathan Labenz: 34:13 I think this next 1, v vim, this 1 now moves to video. So instead of having a three-dimensional object in terms of space, it's now a 2 dimensional object that changes through time. Again, it's another 3 way scan. So here, they are scanning forward in time and backward in time, and then the third dimension is within single frames from the video. So generalizing to video seems again to work pretty well. And then 2 more that I lumped into this vision section, 1 called Mamba Morph. I thought this 1 was maybe most interesting just because of the nature of the task. I hadn't really seen this before. They have 2 different kinds of scans, a magnetic resonance scan, typically an MRI, and a CT CT scan, a computed tomography scan. Now you take these 2 different scans, it's 2 different machines. Right? It's the same body of the person, but you're gonna have disagreement between these 2 scans in terms of the shape that they are seeing. What do you do about that? This task is called registration. You have each of the scans, and then the output is and this is another kind of segmentation like sort of task. But instead of generating a new image, the output is a deformation field, which describes how the 1 image needs to be deformed to fit with the other image. Because your body was in different positions or whatever as the scans were taken, they just don't align, and so it becomes very hard to read them together. This deformation field that this thing can create can then be applied to 1 of the original scans, snapping them essentially together in alignment so that you can see the results all in 1 view. But just another interesting use case where, you know, imagine how hard that would be to do. Right? You basically just can't do that stuff manually. But there's actually 2 graph mambas. You wanna take graph mamba?

Jason Meaux: 36:13 Yeah. Graph mamba. These are 2 papers I probably are a little bit light on. The first 1, it's titled Towards Long Range Graph Sequence Modeling with Selective State Models. What they observed is that attention mechanisms really have been the state of the art here for tracking long range dependencies among nodes and wanted to see if state space models such as Mamba, could it enhance the long range context modeling in a more efficient way? And they go up to 1,400 nodes. The average GPU memory usage for graph Mamba is very low, sub 200 megabytes. And the transformer based model they used previously basically errors out at almost half that 800 nodes, and you can just see memory kind of top out there. That was probably the biggest thing that jumped out at me at this paper. Anything else, Nathan, on this 1?

Nathan Labenz: 37:09 Yeah. I I can't say I'm an expert in graphs either. The sort of problem seems to be like predicting what what amino acid might appear in a given position within a protein sequence, for example. And that's a linear sequence in some sense because the DNA sequence is linear, but it's also like a three-dimensional thing, and there's different residues that interact with each other, and it just gets extremely complicated. So they end up being represented as graphs. The I think the the key thing to know here is that graphs don't necessarily correspond to a single sequence. Right? They have and, again, I'm not a expert in the the math of of graphs. But you can just imagine a graph, and let's say you have a, and then you have b points to a, and c also points to a. What's the order of that? There isn't really an obvious order. You have 2 things that both point to the same thing. Whether b or c should come first or maybe even a should come first, like, it's that is not obvious. And so they essentially have to develop heuristics in and there are 2 of these graph papers to flatten the graph structure into some sequence. And that that's a deterministic process that uses the best judgment that they have to figure out how to cluster these and what should come before what. And then 1 of the papers also does a reversal again, so you have whatever their best heuristic guess was and the reverse of that. But also, as you noted, especially the performance. Trying to handle all these interactions is extremely memory intensive. But with these scans, it's just way way easier to handle it at a reasonable amount of memory. The flattening. To me, again, the art we're in the seeing like a state era. I think of this as seeing like a Mamba. It's about what is the sequence representation that I can bring to all these different things. And I may need deep expertise to figure out how to flatten this graph structure into a sequence. And then for good measure, I might need to just go ahead and reverse that as well. But if I can do that, then I can cast all these different kinds of problems as something that the new architecture can handle.

Jason Meaux: 39:24 Yeah. That's a good you need to trademark that scene like a Mamba. That's gonna be in a paper title coming up. Yeah. I'm sure.

Nathan Labenz: 39:30 That's right. Watch out for it. Calling our shot now. That brings us to the end of vision, and I think we've probably harped on the key points enough. Everything has to be a sequence. Figuring out how to represent different things as sequences seems important, and seems like a lot of ways to do that, a lot of different just multiscan approaches, and seemingly a lot of flexibility in terms of exactly how you wire that up. Do you put each scan in parallel and then combine them at the end? Do you put them in sequence? We got at least 1 paper that did a systematic study of that and found sequence works. Most of them seem to be doing it more in parallel. So I think the the final word is definitely not yet written on the best practices there. But my current read would be, it seems like if you just kind of do a multi scan approach and wire it however you want, it'll probably work.

Jason Meaux: 40:24 That was a a great summary of of vision. I I I would love to jump in the long Mamba.

Nathan Labenz: 40:28 Yeah. So now we're on to I think this is our final deep dive section. Right? And this is about actually using the super long context. And, again, this is 1 of the things that got us really excited about this in the first place that you could process, in theory, an arbitrarily high number of tokens and develop some high dimensional internal representation of that full history, not with full fidelity necessarily, but ideally with all the important parts represented that you'd really need to be successful. That's again why I think multiple states seems inevitable because probably can't have 1 approach that will do everything the right way. But maybe across a few, you could really get there. But the fact that this is linear in sequence length because it is constant time per step just is so exciting. It's man, what what if we had 1000000? What if we had, you know, a 100,000,000? There's, in theory, no limit to how many tokens you could process and and gradually build up this super high context state. But now okay. Cool. We saw in the original paper, DNA was running long. They didn't have super long text. So this first paper is taking a step in the direction of, okay. Let's actually use some longer text episodes for training.

Jason Meaux: 41:46 Yeah. Absolutely. It was exciting when this code repo came out. Long Mamba, they used a subset of the same dataset this model was trained on, and they simply cut it off at a little over 16,000 tokens. And so after training on 16,000 tokens, the perplexity curves look good. Perplexity itself generalizes well past 16,000. You don't see increase in perplexity till past 40,000 tokens. So that was an indication that something's happening, something's generalizing past the training window. And then they ran a test that's become popular called beetle in a haystack. It was popularized by someone on Twitter, gcamrod, who ran these evaluations on Claude and GPT-four. Essentially, that eval is meant to do is to hide some information inside varying positions in the context window. So you can imagine a 10,000 window. What's the sensitivity in that window for the model to find information if it's near the beginning, in the middle, or at the end? They used slight variant that was in the academic literature from a paper last year. I'll just read the actual text of it just to give an idea, because it's funny that we're evaluating models this way, but it says, There's important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there. And then there's this simple repeating of sort of noise. So the grass is green, the sky is blue, the sun is yellow, kind of repeats. And then somewhere within the context, there is a passkey. For example, the passkey is 12362. Remember it, 12362 is the passkey. So imagine running that eval now hundreds and hundreds of times, and they came up with a needle in the haystack result that actually looks pretty good. At 16 k context, it's nearly perfect. There's a couple spots where it looks like there may have been some imprecision. At max context length of 16 k, it was actually in the initial position near the front where the model failed to make the retrieval, which is sort of interesting. There's been this trip tendency in transformers. Some evaluation transformers where this lost in the middle idea and the SSM, it actually seems to be more starting lost at the beginning, lost at the top when you try to push the model past its training window for this evaluation. It does generalize a little bit. So there is a bit of an ability to go past 16,000 tokens and retrieve the passkey, but it quickly diminishes anything beyond 20,000 tokens, it's going to struggle. I guess 1 other thing is the author was only able to train on 16,000 tokens of context. I believe there is a memory constraint up to that point. Extends a conversation that was on GitHub between Triedao and Zhang, which is this idea of applying this transformer Excel style modification. If you wanna train on something like long context length, perhaps you don't have to do that all in 1 go. You can, in the case of Mamba, save the SSM hidden states after a batch and then load those SSM hidden states back into the model and then continue training from there. So rather than stopping training Mamba at 16 ks, you can just keep training in batches of 16 ks on the same hardware. So that's a interesting thing. There's some code in this repo that tries to implement that. There's nothing inherently about the architecture that would prevent this. You should be able to do this transformer Excel style training. You should be able to do this saving of the hidden state. So that's long, Mamba. Any thoughts on that?

Nathan Labenz: 46:01 Yeah. I guess the big bottom line takeaway is it's a pretty strong piece of evidence that and we already had pretty strong evidence in the form of, like, performance on long DNA sequences from the original paper, but we didn't have examples of language going beyond a couple thousand tokens. And even in the chat fine tuned version that I tested in December, I found that it basically unraveled after a 3,000 or so token level. And then I looked into the data that it was trained on, and sure enough, it was trained on pretty short chats. So not crazy, but it was just not gonna go any further. I would love to understand. I don't know if you have the understanding. It's it's interesting. It seems like it works. There's 2 interventions in this paper, right, if I understand correctly. 1 is or maybe 3, but at least 2 is, 1 is changing this 1 hyperparameter that somehow the delta parameter I I don't quite get what it's doing, but somehow just changing that allows it to immediately work at longer context, although not as well as just actually training it on longer context. And then it's also interesting to see that when you do train it on the 16,000, it seems like it works well for much longer than that with aggregate scores continuing to improve up through around 40,000 or a multiple of the 16,000 training length. And then if it does start to get a little worse as you get out past that, still not too bad. But the needle in a haystack is gone at that point. It suggests that it's like a kind of fuzzy memory of of things that came before still maybe. The fact that perplexity is going down would seem to suggest that it's still getting some additional value from those earlier tokens that it's seen, but it no longer has the precision to do the needle in a haystack. It seems like it's kind of again, I'm modalizing myself here, but I always say, I can't remember what I had for breakfast today or yesterday, but I can remember what I generally have had for breakfast. And it seems like there's a little bit of maybe that happening where the representation that's building up in the state is more useful than not having it. Training on longer sequences would probably overcome that challenge. Like, it it seems unlikely to be a coincidence that this was trained on 16,000 token sequences and has needle in a haystack high scores up to exactly that level. But yeah. And then you can also another thing that I can think about here, 1 of the 1 of the more interesting things that somebody has said to me recently was Robin Hanson saying that we will likely continue to struggle with rot in AI systems in the same way that we struggle with rot in biological systems and in traditional software systems. And I'm still not a 100% sure what I think about that at the level of foundation models, but it does seem to be an issue with states. If I was to interpret the reversal of perplexity, right, where we've trained on 16,000 tokens, we're seeing improvements up through about 40,000, and then it starts to get worse again. But not too bad, but it is getting worse again. That kind of seems like a rot thing where it's now you're just gumming up the works. It hasn't really been trained to handle something this long. They're not catastrophically failing, but the works are getting gummed up, and it's getting marginally worse. You might wish that you could just throw more and more at them, and maybe if it doesn't get better, hopefully, at least it doesn't get worse. That would seem to be another conceptual advance that would be needed to figure out how to prune the state to let go of unneeded things. And that would maybe, again, go back to 1 of our earlier topics around some sort of optimization targeting the state itself. In fact, here's a prediction. In the next 90 days, we will see some sort of regularization of the state. You might have talked about weight decay and state decay in the past, something where the old stuff is gradually let go of, some sort of strategy to clear out the gumming up of the works and keep things hopefully running at least constantly as you go through time, but you have to do that by letting go of some stuff. It's interesting that we definitely have that in our brains. Right? So there's some sort of pruning of memories. We're not keeping everything. We're consolidating.

Jason Meaux: 50:25 And when you say pruning, you're thinking pruning of the state, not the weights.

Nathan Labenz: 50:30 Yeah. I'm thinking yes. I'm I'm thinking about the state. Pruning may not be the right word exactly. I don't have an intuition right now for exactly how you would do it. But we've seen a lot of things in general over the last few years where it's, oh, here's a clever penalty that I'm adding to the loss function. You know, a great, example of this was seeing is believing, which was out of, Max Tegmark's group, and Xuming Liu was the lead author on this paper. They trained, like, very simple kind of toy networks, but with a loss function that had a penalty for just the magnitude of the weights. It was optimized to solve the problem, but then also subtract I think it was just the sum of all weights. So any weight that wasn't directly contributing to solving the problem at every step was just getting turned down. And this created these sort of crystallizing effects where you'd start off with a very messy dense network, and over time, you would get down to the most sparse network that could solve whatever problem they were solving for. There's a couple parts maybe missing here because we we're not seeing anything yet that is optimizing on the nature of the state. So figuring that part out alone, and that may be a little tricky depending on how again, getting things on and off of SRAM could be an interesting challenge associated with that. If you need to do that at every step, it could be tough. But 1 kind of crude version of it would be however big the state is, penalize the sum of all the values in the state. And therefore, try to filter out information that isn't carrying its weight so that only the the actual needed information over time would be retained.

Jason Meaux: 52:12 Yeah. That's a great prediction. What you just said reminded me of an idea that you had riffed on, which I agree with, is this idea of, like, memory decay, state decay in the same way that you could think of like almost weight decay in a model. And I think that is very interesting because there are all kinds of ways you could design that decay function. You could do it over time. You could do it over tasks. You can do it in all sorts of ways. And it's just intuitive that could add a lot of value and efficiency in the same way that human memory is both frail because it has a recency bias, but it's also incredibly efficient because we have a recency bias. And it's somehow equilibrium's all out where we tend to have the right balance of that.

Nathan Labenz: 52:59 Thank you. I don't know that I have anything much more to offer at this point in time. But, yeah, I guess to summarize my understanding of this right now, it doesn't seem like there's any obvious barrier to super long context natural language tasks, but it does seem like we will need datasets to actually do that. The fact that you can scale up to 16,000 tokens on 8 a 1 hundreds is definitely pretty encouraging and seems like it will it probably is pretty scalable. But if you don't specifically really reward high attention to detail of the, like, needle in haystack variety at a given length, then it seems like beyond a certain point, memory is allowed to get fuzzy. And beyond an even further out point, the state starts to get gummed up and essentially, as Robin Hanson will put it, rot. And at some point, it catastrophically fails. So we obviously can't train infinite length states. So if we wanna have something that can run infinitely, we're gonna have some sort of cleanup mechanism that would prevent that process from eventually ruining things. Now Robin believes that will not be easy to achieve. He notes that evolution could have created organisms that live forever and just regenerate themselves, but in practice, didn't. And instead, we have generations. He expects the same thing to happen. And we also do see that in software. Right? We do not incrementally upgrade software. Instead, it's it's time for a new system. Partly, you could also argue that's because of the technology foundations have advanced. And maybe if they were held constant, maybe you'd more incrementally update things. Even I think a lot of software developers would, you know, be screaming for the rewrite. In this case, can we come up with that mechanism that effectively clears out the old accumulated gunk that's no longer needed in the state? Yeah. Throwing down the gauntlet, of that challenge to the community. If we wanna see guys that live alongside us as long as we do, that's a lot of tokens. They're going to have some version of that. We're also very interested to go look at more detail, and I'm certainly not schooled in this, but what do we know about how that works for us? We have certainly a better mechanism than the the state space models do at this point. Right? Our brains do ultimately decay and fail, but they last a lot longer. They process a lot more tokens. There So is something that we're doing that kind of addresses that problem, and, it would be interesting to see what biological inspiration we might be able to draw from. And that perhaps is a good transition to the next in this sort of super long context section, which is this project called EVO that came out of the Arc Institute. This is something that has super big deal written all over it, honestly, I would say. The Arc Institute, if you haven't heard of that, it's well described in the recent Patrick Collison and Drorkech podcast episode where, basically, they observed that a lot of people would like to do science, but they don't necessarily want to be a lab leader as you have to be if you wanna do science in today's academic environment. Right? You have to become a professor. You have to write all the grants. You have to recruit the grad students. You have to manage the grad students. You have to go acquire your own machinery in many cases and maintain it within your own lab. And there's just this whole managerial and financial and machine maintenance aspect to the enterprise of doing science that a lot of people are either not that excited about or perhaps not that well suited for. And then, of course, many people worry, jeez, what if the best scientists are just selected out before they ever even get there? ARC Institute basically tries to tackle that problem. They essentially try to get the best scientists to just come there and just be scientists. And they'll pay them a salary, and they don't have to do the grant stuff, and they have all the machines. Hey. You're good at science. Do science. That's what you do. You're a scientist. So it actually sounds pretty awesome, and they've had a couple hits already in a pretty short organizational history. This 1, EVO, is relevant to us and may be relevant to everyone. This might be the most important thing that we talk about depending on exactly how things go from here. But they use the Stripe Taina model, which is another hybrid. This is the architecture that's originally put out by Together dot ai, proved by Together to be a effective and fast language model and now adapted to training on DNA sequences of up to a 131,000 tokens. In this case, each token is a base pair within the DNA working primarily on prokaryotic genomes. So, basically, I think mostly bacterial genomes so far. And they're observing all these kind of advanced behaviors out of this model where it can now generate long sequences. And because it has this long context awareness, distant parts of these generated genomes seem to relate to each other in interesting ways. And I would even say somewhat speculatively, but seems pretty real to me. If you believe that a language model is developing a world model because developing a world model is the best way to get good at next token prediction, then it seems like this is maybe the first thing that we've seen that is doing something similar for biology. In order to best understand and predict the genome, especially with these super long range dependencies, And DNA is like a terrible case for that, way worse than language. In language, something that's hundreds of thousands of tokens away could be related, but by and large, you don't need that much context to know what's going on in the text that you're currently reading. Whereas in a cell, the different portions of the genome, they may be close to each other, they may be far from each other, but the way that they interact is to a significant degree independent of just how close they are on the genome. A bi biologically, probably could come in and correct me. It's actually it's more complicated than that. But definitely fair to say you could have things on polar opposites of the genome each express a protein. And if 1 of those is broken, then the whole cell might die. You have these long range dependencies, which can be super important. And it seems that in order to make these long range predictions, this EVO model has started to develop what you might think of as a cell model. Again, if you look at the language models and you're like, this thing seems to have a pretty good intuitive sense of what's going on. Obviously, not perfect, but it's amazing how much of a world model does seem to develop just through next token prediction. Here, we seem to have something similar where a broader multilevel understanding of biology is starting to emerge from just training on the DNA because the long range dependencies can be effectively handled. So I thought this 1 was probably worthy of an episode all to itself. And by the way, if anyone's listening still this far into this episode, I am actively looking for the Jason of the intersection of AI and biology. And I know there are some of you out there, but I would definitely love to team up with somebody a little bit to do some deeper dives into the intersection of foundation models and biology. This is an area where I'm aware that there are super important things happening. Arguably, it could transcend what we're seeing in language in terms of importance. And I think that the language stuff is gonna be very important, but the biology stuff, if it works, it's gonna be every bit pound for pound as important. And I struggle to read the papers because at this point, I can handle the machine learning side pretty well, but the biology is far more complicated and just has far more facts. As, Patrick Collison said, we don't really have record of somebody in their early twenties revolutionizing biology. And the reason for that seems to be that it's just super, super complicated, and there's really no way to get to the point where you can leave the field without slogging through years of all these details and facts and interactions and whatever. It's not I mean, as as much as people may think machine learning is complicated, biology is way more complicated, way more messy. Everything is not designed, and it's it's totally haphazardly interacting. Consider this my call. If you're out there and you're interested and you're working at the intersection of machine learning and biology, specifically foundation models for biology, I would love to do a similar lit review survey episode and would welcome any outreach or even any nominations. If you have somebody in mind, let me know, and and I'll be willing to go knock on their door if that's what it takes. But anyway, EVO, 131,000 tokens trained on. Again, you see that kind of ratio where it seems to work up to they say it works up to 650,000 tokens. Perhaps that's just a coincidence, but it was also definitely caught my attention that the last 1 was trained on 16 and, like, worked pretty well up to about sixties. That's a similar ratio. Maybe that's nothing, but long contexts and seemingly profound understanding coming out of this new architecture as applied to biology.

Jason Meaux: 1:02:35 Yeah. I I I have a follow-up question for you. They they have a piece in the paper. Is DNA all you need? You think you made a good case of why it might be. Do you see it as a natural extension of some of the work we've discussed or some of the toy problems where the model can understand a simple linear regression or can understand a board state of a game. Do you see that as just scaling up that sort of emergent understanding of something or something fundamentally different happening? How would you think about that?

Nathan Labenz: 1:03:05 I think the of all the papers we've talked about, it seems like the graph 1 is maybe most analogous in some way. The DNA is a sequence, but the interactions at the RNA level and the protein level, those are gnarly graphs, basically. Right? You've got things that promote other things and that inhibit other things and that block other things from going to the target that they need to go to, and it's just man, there is a very complicated causal web or causal graph that seems to play into any actual phenotype that we care about. Yeah. In that sense, you could maybe make an analogy between those graphs and the and their flattened versions and understand the DNA as a flattened version of that graph. What we really care about is all the interactions that are happening within a cell and whether it's working appropriately or not happens at that graph level. The DNA, while it is a sequence, is a naturally occurring flattened version of that graph that actually is the cell. Maybe I I squint at it and and see it that way. It's weird because the DNA is a natural physical thing that encode what happens at the higher level and is not a synthetically flattened thing. But not sure how much that matters for the purposes of the machine learning challenge with these techniques anyway. If you did have the graph representation of all the different parts of the causal machinations that go on in a cell and you said, okay. Here's my big gnarly graph representation of everything that determines what's happening in a cell. You might just be like, okay. I'm gonna have to make a couple cuts here in these cycles. Now I'm gonna flatten. Now I'll reverse. Let's see how we do. And given that, if that's the best strategy that we have, then it's also not crazy to think, hey. Maybe in a sense, nature did that for us with the DNA encoding of everything that's going on in a cell. And, yeah, maybe we can learn from that.

Jason Meaux: 1:05:09 Yeah. It's fascinating. This seems like a model that can be legitimately useful to researchers. It's what you might call the most production ready version of almost everything we've described today. You think this is somewhat of an inflection point?

Nathan Labenz: 1:05:24 Yeah. I don't know. Arguably, maybe the inflection point has already happened. I think certainly things like AlphaFold are a huge deal. They have a whole division now within Alphabet that is dedicated to taking AlphaFold, extending it, commercializing it. They have drugs in in or very near clinical trials that were originally discovered with these techniques. So I might even say the kind of critical technology threshold, maybe even already passed even before this. When is it worth the wet work? That's a a good memorable phrase. When is it worth the wet work? There's an infinite amount of experiments that we could do in actual chemical space, but they can be slow. They can be expensive. Somebody has to measure out the stuff. You need all the reagents, blah blah blah. Obviously, the advantage of computer systems in general is that they run faster, you know, than than our ability to do those sorts of things. These systems are always going to have, or at least for the foreseeable future on the current paradigm, they're going to have errors. They're gonna have false positives. They're gonna have false negatives. They're gonna they're gonna be definite imperfect models of the cell or whatever it is that they're modeling. But if they're good enough, then they can really narrow the search space. I think that's how people are broadly using them right now is to say, okay. Let me just run a zillion generations, and we'll take the top whatever percent and actually look at those in actual tissue culture or whatever next. And then eventually, we can work our way into actual organisms and then into people. And so it's just the the space is so vast. Even if these things have some false positives, false negatives, whatever, if we can just narrow the search space by a couple orders of magnitude or, you know, another way to think about that would be increase the hit rate at the experimental level by a couple orders of magnitude, then it's a total game changer for the ROI of doing the science. If it costs 1000000000 dollars to bring a drug to market, a lot of that is clinical trial and probably doesn't change. But there's definitely a part upfront that is I need to just throw a ton of stuff at a ton of diseased cells and try to find a hit. Sometimes think of it as like playing the the old kids game battleship. You're just very largely historically have been shooting in the dark. You don't necessarily know the shape of the proteins up until pretty recently. You don't have much of a sense of how things interact, and you're just throwing stuff at cells and, like, seeing if anything kills the cancer. You're just, like, hoping you to get hits. That's been, I think, a lot of it. It's pretty blind, groping search. So and the hit rate had been pretty low. And certainly a lot of techniques that people use that are, again, finding things in nature. Hey. Look. This frog doesn't get cancer. What's going on there? Maybe we can find something in that frog. There's definitely other things besides purely blind search. But to be able to just crank up a supercomputer and start scoring everything and then just narrow your search to the top 1% or whatever and and hopefully get something that might be a 100 x, higher hit rate on the other end of that is definitely a regime change in science and also would seem to increase just, like, how much of that work we would wanna do. I don't think we're gonna deplete the hits in the near future if you could find them at a 100 times the speed 1 would assume that you might imagine doing a 100 times as much of it. Jeez. This has just got a 100 times more valuable. Let's go do a lot more. That's my general high level understanding of how this stuff might really come to change the world, and I think there certainly could be a lot more nuances to it than that. If you can get that hit rate up from what it has been, it definitely could be a a real, real game changer. I do think in terms of your comment that the state space models are influencing attention, I think going back to the original episode

1:09:13 or maybe it

Nathan Labenz: 1:09:13 was in Twitter discussion afterward, somebody had asked me, what are the odds that you think this actually happens? What are the odds that state space models are part of frontier systems in the next 2 years? And, obviously, we could debate exactly what the percentages are there. I think they're pretty significant if only because the world is definitely hungry for compute, and this takes a little bit of the edge off of that hunger. That alone is is enough reason, I think, to expect that it's gonna make its way into the mainstream. But I also said and still believe that if it doesn't happen, why wouldn't it happen? And the best answer I can come up with there is just that attention really is all we need, and it just just keeps working, and there's just enough optimizations. And, hey, we already have line of sight to 10,000,000 tokens or whatever with ring attention, and we're seeing that come online in some form with Gemini 1.5. Obviously, we don't know what's under the hood with Gemini 1.5. But my guess is as we, you know, talked about, I think there probably is some full attention there still. But, hey, maybe it just works so well that we don't need to send as many people or optimize the the state space side as much. If the first 90 days are any indication in terms of the literature that we've seen, nobody needs to incentivize people. They're just excited to go do it. So it doesn't seem like there's gonna be any critical shortage of people eager to play around with this and and figure out how to make it work. But it is definitely notable that the the progress continues to advance on the attention side as well. And you can imagine memory tokens or other other kind of things like this. Did we talk about memory tokens or no?

Jason Meaux: 1:10:53 No. We did not.

Nathan Labenz: 1:10:54 Okay. This would be the other way in which I could imagine something like this happening within the attention paradigm. We have these different kinds of tokens. The pause token from think before you speak, the backspace token. And we have seen 1 paper now that has something that I think of as a memory token, which is basically just compressing a certain part of the context into embedding space in a similar way that images get ultimately put into this. It's a joint text image embedding space, but it's largely text based. Right? These models are ultimately gonna output text. Right? GPT 4 v, Claude 3, Gemini, Ultra, Gemini 1.5. They all output all text. So they take in these other things. They're ultimately working those things through a text latent space and ultimately outputting text. If you wanted to extend the window and you know that you can represent these, hey. A picture's worth 1000 word, but a picture only costs 85 tokens at the low res setting on OpenAI GPT-4V. It seems like there is an inherent efficiency there where individual tokens, the 1 hot embedding space that that text itself gets moved into is not the most efficient way that you could represent that text in embedding space. So perhaps you can imagine 1000 to 1 ratio, right, where you'd say, okay. I'm going to take this whole body of text and compress it down into a single token. Does that give you a 10 to 1, a 100 to 1, 1000 to 1, some ratio like that? And then by compressing those full text sequences down into these memory tokens, you could potentially still have a sort of full normal attention mechanism downstream, except that in this deep history, you're maybe working at, like, 1000 to 1 sort of ratio. Now, obviously, the compression has to be effective. The attention has to learn how to take advantage of that compressed representation. And you might even imagine the librarian kind of setup where it's also if I get a hit on a certain token and I need more, maybe I can even do an an inbuilt rag type thing where I actually go pull the original context that that compressed token represented, pull the whole thing into context if I really wanna be extra careful about making sure I'm, you know, recalling it effectively. But these are the kinds of things that I could see substituting for a state space revolution would be the just relentless march of progress on the transformer and the attention mechanism and finding these sort of compression hierarchical native rag formats that might it might just be so good that it takes the air out of other lines of research. Certainly, would say transformers have done that in general. Right? Like, we've seen everybody has left whatever their pet architecture was over the last few years to largely go do stuff with transformers. Now here's something that's, like, pulling some of that energy away from transformers again. Still a very small percentage, of course, but it is pulling some away. But will the transformer have the last laugh by just continuing to work so well that nobody needs to go a different direction after a while? I certainly couldn't rule that out. Yeah. It feels like it could very much go either way where the the state space models are definitely getting momentum as we've seen incredible flurry of activity in in just 90 days. But don't count the transformer out either when it comes to just figuring out perhaps more local modifications that could achieve similar things.

Jason Meaux: 1:14:24 Yeah. You mentioned the Gemini 1.5 Pro data point. I think that's important because this model came out. It seems that they've achieved the ability for very long context windows. And as far as I can tell, they're indicating that this is transformer based work. It's also interesting to just put the data point out there that there's another company, magic dot dev, that recently announced, I think, a additional $100,000,000 raise from very legitimate people in the space, Nat Freeman, Daniel Gross. And they clearly state on their website, quote, transformers aren't the final architecture. We have something with a multimillion token context window. So you can only begin to speculate what they might be cooking up. At the at the same time, we see transformers flexing long context windows. There is this interesting data point with magic dot dev.

Nathan Labenz: 1:15:15 We may soon have an episode with somebody from magic dot dev to probably not spill all the secrets, but maybe at least tell us more about what it is able to accomplish. I'm definitely really intrigued to see that. I've just dropping a full code base into Gemini 1.5 has done a lot for people, and it wasn't even really productized for that. That's just just dropping stuff into it and and hoping for the best. Yeah. It sounds like what the Magic team has got is gonna be a big deal. It sounds like that's pretty likely at this point. Mindful of time here, we have been at it for a while. Do you have any concluding thoughts, or should we adjourn and resolve to, come back and and take stock of this again in either 90 days or perhaps even sooner when our personal context windows are hitting the point of overflowing? Should we leave it there for now?

Jason Meaux: 1:16:06 Yeah. Let's leave it there for now, Nathan.

Nathan Labenz: 1:16:08 Thank you for all your hard work in putting this together. The website where people can check out the link index at least is statespace.info, and we'll share some more stuff online as we publish this episode as well. But for now, it's my pleasure to say, Jason Meaux, fellow AI scout, Mamba scout in particular, thank you for being part of the Cognitive Revolution.

Jason Meaux: 1:16:30 Thanks, Nathan.

Nathan Labenz: 1:16:32 It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.

1:16:47 Turpentine is a network of podcasts, newsletters, and more covering tech, business, and culture, all from the perspective of industry insiders and experts. We're the network behind the show you're listening to right now. At Turpentine, we're building the first media outlet for tech people by tech people. We have a slate of hit shows across a range of topics and industries from AI with Cognitive Revolution to Econ 1 0 2 with Noah Smith. Our other shows drive the conversation in tech with the most interesting thinkers, founders, and investors, like Moment of Zen and my show Upstream. We're looking for industry leading hosts and shows along with sponsors. If you think that might be you or your company, email me at erik@turpentine.co. That's erik@turpentine.co.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.