In this episode of The Cognitive Revolution, Nathan dives deep into the world of state space models with returning co-host Jason Meaux and special guest Quentin Anthony, Head of Model Training at Zyphra.
Watch Episode Here
Read Episode Description
In this episode of The Cognitive Revolution, Nathan dives deep into the world of state space models with returning co-host Jason Meaux and special guest Quentin Anthony, Head of Model Training at Zyphra. Explore the cutting-edge Zamba 2-7b model, which combines selective state space and attention mechanisms. Uncover practical insights on model training, architectural choices, and the challenges of scaling AI. From learning schedules to hybrid architectures, loss metrics to context length extension, this technical discussion covers it all. Don't miss this in-depth conversation on the future of personalized, on-device AI.
Check out more about Zyphra and Jason Meaux here:
Zyphra's website: https://www.zyphra.com
Zamba2-7B Blog: https://www.zyphra.com/post/za...
Zamba2 GitHub: https://github.com/Zyphra/Zamb...
Tree attention: https://www.zyphra.com/post/tr...
Jason's Meaux Twitter: https://x.com/KamaraiCode
Jason's Meaux website: https://www.statespace.info
Be notified early when Turpentine's drops new publication: https://www.turpentine.co/excl...
SPONSORS:
Weights & Biases RAG++: Advanced training for building production-ready RAG applications. Learn from experts to overcome LLM challenges, evaluate systematically, and integrate advanced features. Includes free Cohere credits. Visit https://wandb.me/cr to start the RAG++ course today.
Shopify: Shopify is the world's leading e-commerce platform, offering a market-leading checkout system and exclusive AI apps like Quikly. Nobody does selling better than Shopify. Get a $1 per month trial at https://shopify.com/cognitive
Notion: Notion offers powerful workflow and automation templates, perfect for streamlining processes and laying the groundwork for AI-driven automation. With Notion AI, you can search across thousands of documents from various platforms, generating highly relevant analysis and content tailored just for you - try it for free at https://notion.com/cognitivere...
LMNT: LMNT is a zero-sugar electrolyte drink mix that's redefining hydration and performance. Ideal for those who fast or anyone looking to optimize their electrolyte intake. Support the show and get a free sample pack with any purchase at https://drinklmnt.com/tcr.
CHAPTERS:
(00:00:00) Teaser
(00:00:42) About the Show
(00:01:05) About the Episode
(00:03:09) Introducing Zyphra
(00:07:28) Personalization in AI
(00:12:48) State Space Models & Efficiency (Part 1)
(00:18:59) Sponsors: Weights & Biases RAG++ | Shopify
(00:21:26) State Space Models & Efficiency (Part 2)
(00:22:23) Dense Attention to Shared Attention
(00:29:41) Zyphra's Early Bet on Mamba (Part 1)
(00:32:45) Sponsors: Notion | LMNT
(00:36:00) Zyphra's Early Bet on Mamba (Part 2)
(00:37:22) Loss vs. Model Quality
(00:44:53) Emergence & Grokking
(00:50:06) Loss Landscapes & Convergence
(00:56:55) Sophia, Distillation & Secrets
(01:09:00) Competing with Big Tech
(01:23:50) The Future of Model Training
(01:30:02) Deep Dive into Zamba 1
(01:34:24) Zamba 2 and Mamba 2
(01:38:56) Context Extension & Memory
(01:44:04) Sequence Parallelism
(01:45:44) Zamba 2 Architecture
(01:53:57) Mamba Attention Hybrids
(02:00:00) Lock-in Effects
(02:05:32) Mamba Hybrids in Robotics
(02:07:07) Ease of Use & Compatibility
(02:12:10) Tree Attention vs. Ring Attention
(02:22:02) Zyphra's Vision & Goals
(02:23:57) Outro
SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://www.linkedin.com/in/na...
Youtube: https://www.youtube.com/@Cogni...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...
Full Transcript
Full Transcript
Jason Meaux: (0:00) The future of AGI will involve a combination of cloud and on device deployment.
Quentin Anthony: (0:06) These large monolithic model companies doing, like, in Doctor. Croppen AI just can't really specialize to every single person on the planet. We think that you need to have your own set of weights, and changing a system prompt per person is not enough. Right? We wanna actually bake into the weights. You can make the model simulate learning faster than it really is by doing activation steering. Like, if the user tells the model, you're being too dry, then you can very quickly steer the activation to be a bit more fun until tonight when you can bake into the model. I think it's gotta be continual learning, and it's gotta be per user. And the the only way to do that is with weights on the phone.
Nathan Labenz: (0:42) Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Erik Torenberg. Hello, and welcome back to the Cognitive Revolution. Today, we're once again going down the state space model rabbit hole with returning cohost Jason Meaux, who regular listeners will remember from our Mamba Palooza literature review and Albert Goo interview episodes, and Quentin Anthony, head of model training at Zyphra, a large language model startup that's just released their Zamba2 7 b model, which is built on a hybrid architecture that uses both the selective state space mechanism and the traditional attention mechanism, albeit with some notable tweaks relative to the standard implementation. In addition to sharing Zyphra's high level vision for highly personalized on device AI, Quentin was super generous with both his time and knowledge, sharing a wealth of practical lessons learned from the front lines of model training. Over the next 2 hours, we will cover the delicate architectural choices that balance efficiency and capability, the many practical challenges of training at scale, including choosing the right learning schedules for different phases of training, the nitty gritty details of training hybrid architectures, including why Zamba models don't need positional embeddings, the Zamba model's use of shared attention blocks and internal LoRa adapters to maximize performance on the edge, the not so simple relationship between loss metrics and model quality and capabilities as well as the challenges of context length extension, the Zyphra team's experiments with different optimizers and why they're sticking with Adam for now, Quentin's intuitions about the relationship between model scale and loss landscapes, and finally, even their recent published work on tree attention, which offers important advantages over ring attention for multi node training. I have to say, I got a lot from this episode. And while it's technical enough that I wouldn't necessarily call it entertainment, I am confident that you will too. If so, we always appreciate it when folks take a moment to share the show with friends or write an online review on Apple Podcasts or Spotify, and we welcome your feedback via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. Now I hope you enjoy this highly technical conversation with cohost Jason Meaux and guest Quentin Anthony, model training lead at Zyphra. Jason Meaux, returning guest and cohost and chronicler of state space models at statespace.info, and Quentin Anthony, model training lead at Zyphra, which has just released the new Zamba 7B SSM hybrid model. Welcome both of you to the Cognitive Revolution.
Quentin Anthony: (3:28) Thanks a ton. Great to be here.
Nathan Labenz: (3:30) Jason, regular listeners will know as a partner in crime who's also obsessed with state space models and the potential that they have to unlock new capabilities in terms of potentially long term memory, extreme efficiency, all these kind of interesting things. And we've been, down the rabbit hole together on that a couple times. So I was excited to have him back to help me with this conversation about the new Zamba model and and all the ins and outs of that. So, Jason, I'm gonna ask you to lead the questioning today, and I'll be in the supporting role, but listening intently and probably jumping in with a few of my own follow-up questions along the way as well. How's that sound?
Jason Meaux: (4:13) Sounds good. Let's do it. So let's start with maybe a broad question first. So Yeah. On the Zyphra website, it says that, quote, the future of AGI will involve a combination of cloud and on device deployment with an increasing shift towards local inference. So why is local inference going to be important, and how does this idea overall influence the direction of the company?
Quentin Anthony: (4:43) Yeah. There's a few aspect few angles of this. The main 1 is personalizability. We think that doing system prompt hacking and things like these large monolithic model companies doing, like, in Doctor. Croak and AI, just can't really specialize to every single person on the planet. We think that you need to have your own set of weights, and changing a system prompt per person is not enough. Right? We wanna actually bake into the weights you like to be talked to this way, your favorite restaurant is x, that sort of thing. A second thing here is privacy. So there's a lot of data about your laptops, your phones, that you don't really want to be communicating to the cloud. There's a lot of enterprises out there who don't wanna share all of their data with OpenAI, all of their proprietary code and things. So keeping that all on device, on prem, in the organization is where we want to keep it. And then there's just some very practical challenges here of OpenAI spends. How what was the CapEx realistically, right, right, in order to infer all these models? That's not really something that can keep up forever. Is being fueled by a lot of VC money. But if everyone's device is able to efficiently run their own models, then we offload that to the users. And then it's also just much faster. You can run it offline. All these small benefits come out from having the model locally in the device, but it's all steered by this global personalizability, privacy, these sort of north stars. In terms of Yeah. Go ahead. Go ahead. There are some model capabilities that are more challenging. Like, they're they're too challenging to have on the 1 b or a 2 b or something like that. Most people we see actually when they communicate with models, they just need these higher level, more simplistic tasks. Just like basic chat, like factual recall. They're not asking, like, these o 1 type math questions, these Olympiad questions or anything to, like, their day to day on device model. But sometimes you will need it, and that's why we have also a cloud offering, like, in the 7 b range. The 7 b range is a turning point where you can fit it on, like, more powerful laptops, powerful on prem, but it's able to think through these more difficult problems more clearly. And we want those on device models to be able to know, like, when they need to go back and ask, like, an Oracle model, something a bit more powerful. And then this is all seamless to the user. And then you can pick and choose how often your model wants to talk to the cloud.
Jason Meaux: (6:55) It it sounds like it's when I hear local inference, it's often a privacy bullet point is the main bullet point that gets hit. But it sounds like the overall idea is it's some mix of capability and privacy that if we had super efficient, super personalized, localized models that this is going to increase capability of the full stack. Is that correct?
Quentin Anthony: (7:17) Definitely. Yeah. You can't forget personalizability. It's really hard to bake in, like, personalized outputs, personalized, like, how the model likes to talk to you in the same way. It's just a hard problem. I would
Nathan Labenz: (7:28) love to dig into that a little bit further. Yeah. Because I'm working on a project right now to try to get a model to write as me. Something like a white whale project for me for a while. I figure if a model can write as me in a way that's compelling to me on a consistent basis, it seems like that's a pretty big threshold. So I've been trying to catch and see, can I make that happen? And the best readily available fine tunable model today is GPT-four 0. And so I've been working mostly with the OpenAI stack. They make it easy to do fine tuning. They don't tell you too much about exactly how it works. 1 presumes that it's a LoRa type process that they're running. I think there's a lot of reasons to believe it's something like that. It doesn't seem to learn facts very well. It does seem to learn, like, patterns and, like, stylistic layer, if you will, reasonably well.
Quentin Anthony: (8:23) Okay.
Nathan Labenz: (8:23) And I wonder what you think is the right way to even approach personalization in the first place. Is it like when you envision doing this personalization for individuals on their machines, are we talking Lores? Are we talking, like, full training of a 1 or 2 or 3 b model? How many tokens does that take? Do we still need to, like, stuff huge amounts of context in there? Because that's another approach I've been taking where I'm like, maybe if I give it 50,000 tokens of everything it needs to know about me as the system prompt, then it, like, won't need to hallucinate, and I still haven't quite figured that out. I'm I'm still getting hallucinations. So what is the the overall paradigm or path to this, like, depersonalization for individuals? Mhmm.
Quentin Anthony: (9:05) Yeah. So the exact process will not be entirely clear until we have a bunch of people having, you know, Zamba on their phones. But we have a few, like, bets. 1 is that just we have a lot of expertise on continual learning and continual free training. So we envision doing, like, a wait update, like, overnight. Maybe while your phone is plugged in, your laptop's not really actively being used on consumer hardware. It's really cheap to do this when the model's really small. Updating LoRa's, as you mentioned, is another like, that makes it even cheaper. But just continual learning, I think, is the way to go. In terms of your GPT-four o problem, I think I would need to know more details about the the fine tuning dataset in the model, which I'm just never gonna get. We find continual learning on a small model per person is really great. There's some sort of you can make the model simulate learning faster than it really is by doing things like activation steering. Like, if remember Golden Gate Bridge Claw and stuff. If the user tells the model, you're being too dry or something like that, then you can, like, very quickly steer the activation to be a bit more fun until tonight when you can bake into the model, okay. This person likes generally to be a bit less dry and to talk in a more fun way. I think it it's gotta be continual learning, and it's gotta be per user. And the the only way to do that is with weights on the phone.
Nathan Labenz: (10:21) But how much of that exists today? Like, can I get a Zamba first of all, like, which, like, scale of model would you recommend that I bring locally? And can I do this continual learning? Can I, like, start to feed it my emails and my Slack messages? I assume I would have to mix in some, like, general purpose pre training mix as well. I wouldn't just go all in on, like, my content. So tell me a little bit more about what that looks like practically. I'm taking notes.
Quentin Anthony: (10:51) Yeah. Absolutely. So right now, we have the 1.2 b, the 2.7 b, and the 7 b that we just dropped, all of the our Zamba 2 series of models. The 1.2 b is really good if you're, like, really edge constrained, like, don't want your phone to use much battery, that you want it on a phone in the first place. In terms of whether you can do this yourself, right now you can. It's just a little less user friendly than it will be someday. But right now, you we have our own Hugging Face fork for Zamba2. So you would just download the model weights, and then you would just have to up tokenize and actually just train on whatever personal emails or whatever else you want baked into the model. The product vision here, though, for Maya is that you have, like, a Maya cloud across all of your devices. And that's there's some, like, social aspects here. You can upload pictures. It'll be multimodal. Your conversations and such, you you can decide how much you want to share, how much you want Maya to be personalized to you. But training on that cloud of your own data, just as a continual pre train on whatever edge hardware you have is the high level rock star here.
Nathan Labenz: (11:54) Are you doing this today? Like, how far have you guys taken this internally? Do you have a Quentin model that drafts your emails, or what's the frontier in application?
Quentin Anthony: (12:04) So earlier, you said the 1.2 b is good for edge constraint. The 2.7 b is really good for quality. I I'm always surprised by it. We do have some fine tunes for that. So we're we have some role play models, some summarization models, if if you wanna summarize emails, summarize meetings, and such. We have some very early results on audio as well. So generating audio in the style of people you talk to, note taking audio from the meetings you're attending. It is definitely more mature in house, but we really want this to be really seamless before we give it to everybody else. So I would say that the 2.7, we have I've seen really amazing results on in terms of summarization and, like, these tasks that I was mentioning. And it's all from this Hugging Face, just fine tuning our DPO on whatever my personal data is. Cool. I'll probably have some
Nathan Labenz: (12:49) more questions about that as we go, but maybe a good transition to as we think about this on edge paradigm. Mhmm. Jason, you wanna take the baton and talk about why state space models are the or hybrids at least are the attractive approach to take there?
Jason Meaux: (13:08) Yeah. Absolutely. I guess when you think of what kind of models can run locally, you get hit by transformers in 2 places pretty hard. The first place is just memory. Anyone who's done their own work with a limited compute budget knows the classic out of memory error. So you quickly whether that's too large of a model in terms of parameter count or if your KV cache is is growing and at some point, you run out of memory. So transformers, because of the way they work, they're famously pretty memory intensive. The k v catch scales as you scale your sequence. And then, of course, secondly is just computation. Transformer has amazing properties, but it is pretty computationally intensive, especially when compared to other architectures. So when you start looking at Mamba, that's 1 of the exciting things about it. It has properties that are much more memory efficient. And computationally, it's able to be more efficient as it processes through sequences due to its fixed date size. And so you even see that play out in the literature. There's been some early work with Mamba models with a couple papers on robotics. These are models that would have to run locally. You could not do cloud API calls for split second decisions that would be made in robotics. So you you can just imagine, finally, maybe we have an architecture that helps inform what the ultimate models running locally could look like for these kind of applications. So how how does that vision look like for Zyphra and the Zamba architecture? What does the ideal local inference strategy look like as it relates to Mamba?
Quentin Anthony: (14:54) Yeah. So we love Mamba for its systems properties. As a guiding principle in terms of model training, we try to train the models that are perfect for inference, because that's where we're gonna need them to perform the most. We can eat some training costs, but at the end of the day, it needs to be blazing fast for inference. This translates to a few different things for us. For 1, attention for dense transformers, like a PHY type model, is just not going to happen for us. We don't think that this is efficient on a phone. Earlier, mentioned, like, we we want text and email summarization. Sometimes we want news article summarization. Those inputs are too long for the KV cache. Like, it's just gonna grow outside of memory for most phones today. Attention, though, is really nice. So you get these exact cross sequence dependencies that we think are just required for these some specific tasks. Remember, I say specific. So not every task needs attention, but there are some tasks like in context learning and long sequence dependencies for which attention is just necessary for you to be performing on. So other eventually, you train on several trillion tokens, then eventually Mamba becomes somewhat performant at this on benchmarks. But I've never played with a pure SSM or pure RNN model that's able to really, like to speak well in an in context learning regime. I I think that's a common intuition that people have when they actually sit down and play with the models. So pure SSMs have quality issues from what I've seen, and then pure attention dense transformers have, you know, performance issues. So I can't get by on a phone realistically to, you know, to run the amount of time with the with the fastest time to first token that the users actually want. So there's some other architectural changes here. Instead of having independent attention blocks, we have this global shared attention block. And the reason for this is that us and a few other groups have independently found that attention blocks are highly correlated across depth. So if they're telling the MLPs in a dense transformer, these are the important tokens. These are still the important tokens. These are still the important tokens. There's some small changes, but most of the specialization across depth is coming from the MLPs. So if you tie all of those weights together, then you're able to allocate more parameters towards the, for example, the Mamba block for us. If you are on a dense transformer, you could apply more to the MLP blocks or something. So now we just have, like, for the Zamba 1 series of models, you just have a single attention block, and you apply it every 6 Mamba blocks across depth of the network. We improved on this a bit in Zamba 2 by seeing, okay, a little bit of specialization across depth is helpful. So we added LoRa back in, and those LORAs are independent across depth. So now you have this single global attention block in the 1.2 b case, and then the 2.7 b and 7 b case for Zamba 2, you have inter interleave. So you have attention block 1, attention block 2, attention block 1, attention block 2, all across depth, and then individual lore is across depth as well. So they they get a little bit of specialization, but back to the systems angle, you're not really applying very much attention. So your slope of, like, your KV cache memory, here, you'll see this in all of our plots, is something more like 10 or 13 invocations rather than 30 32 in, like, a dense transformer because we're just invoking far fewer attention blocks. Now the very first criticism people should be thinking is that, okay. You're applying for the same perimeter budget, you're applying parameters over and over. The total number of flops is going to increase, and this is true. We're we're applying blocks more often than, like, a a typical 7 b, But the Mamba blocks and now the Mamba 2 blocks are so high throughput on parallel hardware in general that this totally makes up for it, and we end up with a net improvement of about 20 to 30% in time to first token and time for output token. So they make up for our sins of sharing these attention blocks. But, yeah, this both of these translates to significantly lower memory overhead, significantly lower time to first token at the prefill stage, so the model responds really fast initially. And time for output token is also 30% beat compared to similar dense transformer models.
Nathan Labenz: (19:00) Hey. We'll continue our interview in a moment after a word from our sponsors.
Jason Meaux: (19:05) The choice for using global shared attention blocks is really interesting to build the hybrid model that way. I guess it's an alignment I saw on Zyphra site, the desire to maximize performance per parameter. That seems like a choice that obviously is doing that as well as maybe some other architectural choices. How important is it in the Zamba family of models to focus on performance per parameter?
Quentin Anthony: (19:28) It's pretty important. Right? At the end of the day, most of your memory overhead is coming from just storing the parameters. We can get away with a lot of quantization, but each parameter is really painful when you're deploying it on on a phone. So trying to make all of those parameters maximally useful for the attention block as in tying it all together and making that attention block really well trained. Yeah. Maximizing the performance of each 1 of those parameters, we've gotta do. We don't want this really sparse model where most of what we're storing is not doing much.
Nathan Labenz: (20:00) But, yeah, very important. I think there's, like, a broader kind of trend here of, like, dense attention being the original thing. And then there's been obviously many attempts to make that more efficient with, like, linear attention or things that are, like, not quadratic. My sense is that those have all been pretty good, but somehow still fall short. And then you have these other techniques of shared query attention. Maybe you can help us understand in a technical sense, like exactly what that is. And then I'm also aware of shared query key attention, and now you're going all the way up to shared full attention block. So help us map that space and just understand, like, what are the different options and and trade offs? And is this, the end of that paradigm, or is there even, like, still further that we could go on the spectrum from full dense to full shared? Is there even further on that continuum?
Quentin Anthony: (20:57) So we use grouped query attention for our upcoming attention blocks. I would say I haven't seen, like, a non quadratic attention that really performs the same and does what we want it to. Frankly, I would say that there is a very interesting emerging area of work where people try to remove the attention from attention models. So that maybe the theory there is just more of a play from, for example, Liquid AI sort of player. You take a model that's already been trained with attention, and then you try to linearize that attention, get a final post training step. And the hope there is that the model's already baked in the ability to do these long sequence dependencies in context learning, and then you can just sort of copy it over to much lower cost blocks. It's a potential play.
Nathan Labenz: (21:42) Can you give it help us develop an intuition for this is, like, really a a question I wanna I'm asking for me, and hopefully other people find it useful. I have a intuition for why the full attention would be qualitatively better versus, like, a state space mechanism or some approximation of attention. For me, that kind of boils down to we compute a relationship between every token and every other token. We don't forget about any tokens. It's all there, and it's all when a new token comes in, we can see how that relates to everything that came before with no compromises. When we have that, we try to map that onto a linear thing. What's the intuition for why that, like, would work or wouldn't work? Maybe there is no intuition and people are just trying stuff and seeing what works, but I'm grasping for understanding of what that transformation is and why we would what should we expect and what should surprise us? Okay.
Quentin Anthony: (22:39) My first order intuition here is that attention is really good because the point of the attention block is to note take as much as possible about the incoming sequence. So you have this really long sequence, let's say. And if you're taking the dependency from every token to every other token, you're giving MLPs or whatever it is that you're projecting as much information as possible in the sequence. And you have multiple heads, each of them saying, okay. They are each looking at the sentence in a slightly different way. And then if you have large heads, they're keeping a lot of detail about their own specific angle of the sequence. Any sort of attempt to linearize this is saying, okay. How much of that detail do we actually need? So the original attention paper was so useful because a lot of that we found that MLPs could make use of a lot of that detail because MLPs are just, like, trying to find the deeper meaning on whatever those notes that the attention took. So attention takes notes about this as many notes as possible, but very computationally heavy. And then MLPs try to find deeper meaning from those relationships. They sort of mix the heads. They do crossheads. So for linear attention, it's all just trying to say, okay. Are all those heads doing something important? Are we able to cut away a lot of them? Do you really need every relationship to every other token? And this is, like, a very theoretical question. Right? So it's not super clear how many notes, whatever you need. You also have different ways of projecting across these heads. So MLPs are a very powerful way to mix across heads and make use of these notes that you're being that are being taken. But, like, a Mamba block has much smaller projectors. It's a very different story there of, like, how much they can really do from the notes taken by, like, the selective scan versus an attention block. Maybe if you had, like, bigger MLPs for a linearized attention, maybe that would make up for it. Maybe you still need, like, more notes if you had a Mamba block without, like, powerful projectors and then an LLP. This is really, like, something I can't really say. If I did, then I would make the model. But I think all of these people are motivated by the fact of how much can we throw away from these notes and still get performance because you really need n squared dependencies. This makes sense?
Nathan Labenz: (24:55) Yeah. And so coming to the shared attention, I guess I'm not a 100% sure what exactly all is shared. But if I understand correctly, it's like you are literally using the same attention weights over and over again at every layer, but you are recomputing the effect of that attention at each or, you know, every 6 layers because the Mhmm. Computation has proceeded. And so each time the attention block is used, you're able to use the same parameters. That saves space, but it doesn't save, like, runtime computation per se because you're still doing that attention on the input as it's been processed to that stage through the model so far. Is that right?
Quentin Anthony: (25:43) Exactly. Yes.
Nathan Labenz: (25:44) Okay. Cool. And that's different from so when you use a reuse an entire attention block, that's different from just reusing the queries or just reusing the queries and keys. You're reusing queries, keys, and value.
Quentin Anthony: (25:58) Yeah. So, yeah, we have taken it to the extreme of saying these tension blocks are pretty much just reminding MLPs anyway, so let let's just stop storing all of these independent parameters that eventually learn to do the same thing anyway. So let's just get them when you're reapplying, you also get really accurate gradients, right, for training time because it's applied 6 times instead of, you know, 1 if they're all independent. So we get this really highly trained, highly saturated attention block, and it's completely tied to query key value. Mhmm. And then we add a little bit more specialization back in for Zamba 2 by having alternating layers. So now you have 2 attention blocks. They're each tied, but then we apply them 1 after another. Then we just have some depth wise LORAs that are themselves independent. So you get a little bit of specialization across depth, but not too much.
Nathan Labenz: (26:44) Yeah. Interesting. The the fact that it's alternating is somewhat counterintuitive when you say there's specialization across depth. I would have thought you'd like maybe segmented into halves and have a first half and a second half or a first, second, third blocks, but you're still staggering them. And then I guess with the LoRa adapter, you have some kind of depth specific thing that is fully localized to that place Yep. In the
Quentin Anthony: (27:11) Cool.
Nathan Labenz: (27:11) Fascinating. Alright. I'm talking more than I intended to. Jason, let's get back to your main line of questioning on the technical details.
Jason Meaux: (27:19) Yeah. I guess if I could just get set the history straight real quick. We can go back to February. You actually released because this is how Nathan act and I actually came from there with Zyphra was not the release of Zamba, but the release of Black Mamba, a mixture of experts model that used, Mamba. This was really early in the literature. I think Nathan and I were counting papers at that point. This may have been among the first 10 Mamba papers published.
Nathan Labenz: (27:47) We've since lost count.
Jason Meaux: (27:49) Yeah. Well over 300 now. I really don't know the the full count. So what was it about Mamba in particular that got Zyphra making this early bet and going forward with Black Mamba?
Quentin Anthony: (28:03) Yeah. So you're you're taking me back. When I first joined Zyphra, we were think we're all about MOEs. We're, like, a mixture of experts with very vanilla attention blocks, no exotic shared blocks or anything like that. It was the way forward for on device. We had some ideas for reducing the inference time, memory, and compute for these, and we were all in on training. We trained some good models actually at the time for MOEs. Then Mamba came out, and we're like, okay. Wow. This is amazing for on device. But the question is, what does quality look like? Because we already knew that, like, attention free models had this sort of problem of quality. Our motivating figure was in the original Mamba paper. There's a figure where they try pure Mamba versus Mamba plus attention versus Mamba plus MLP. And they have the loss per flop graph for all of them. And they don't show the absolute loss, so you don't know what, like, the actual quality is at the very end. So the MLP plus Mamba hybrid, they saw what had, like was really expensive. It had a lot of flops for equivalent loss, and we thought, we don't know what the actual quality is here. So maybe the quality is actually really good and that you, in fact, don't need all of these notes from attention, all these really rich cross sequence dependencies, which ended up being the wrong the exact wrong thing. And we replaced these really expensive attention blocks in our current MOEs because they're the ones that's really get really giving us so much trouble. And then just do Mamba, and then a mixture of experts, MOP block, and just do that over and over. That would be really inference efficient for what we were preparing for. So we trained it. So we set everything up. We trained model. We did indeed see that in terms of long sequences, it was amazing for overhead. But Mamba on its own can't really make up without these the attention cross sequence dependencies in context line. Like, those exact same things we learned. So the model was not great at MMWU, for example, which is like your early indicator. Not that you wanna saturate MMOU, but that's a alarm bell of, uh-oh. Your model is not really able to learn in context. So that was the early prep that was our early lesson. That's why we didn't really proceed with doing Mamba without attention anymore. MOEs, maybe they'll come back in the future. But, yeah, that was the initial model. Any questions or thoughts on it? It was an experiment that we shared with everyone else rather than something that we wanted to, like, actually, deploy.
Nathan Labenz: (30:24) Hey. We'll continue our interview in a moment after a word from our sponsors.
Jason Meaux: (30:29) It's a huge change in strategy to pursue to go from an MOE to something like Zamba. I mean, you know, something that's MOE models really hard to run. The parameter count's huge. If you scale that up in terms of just memory, you've gotta store a lot more weights. And then you go to Zamba 1 way more parameter efficient. So I guess just the thinking of what drove that drove you to go the other way. I I understand the attention part within context learning. That makes sense. But I guess, like, the local inference part sort of as well a little bit.
Quentin Anthony: (31:00) Oh, definitely. Like, in term there's a reason all of our Black Mamba plots are, like, in terms of efficiency. It is such an efficient model because there's no attention anywhere. There's no quadratic complexities. There's no k v cache at all. Like, it's just the constant state that Mamba has to require, so your memory overhead is in on the floor. Oh, I also forgot to mention that motivating figure. So, yeah, the reason we were doing MOEs was that we were trying we knew that, like, the the flops per loss of, like, MOEs is much more amenable because you're only routing to few of these few of these experts at inference time. So we were trying to hope that we would, like, dip that line of loss per flop quite a bit lower than you both the attention plus Mamba and the MLP plus Mamba. And in terms of pure loss, oh, yeah. We did. We definitely did that. But in terms of model quality, it just wasn't where we wanted it to be.
Nathan Labenz: (31:50) Can you talk more about that divergence between pure loss and model quality? I think that is 1 of the biggest picture questions that the entire field is struggling with. Right? Like, from a sort of super big picture, like, when does AGI arrive? And also people worry about this in the context of AI safety with emergence. We believe or we feel pretty confident we can predict a relatively narrow range of what the loss value is gonna be as we, you know, extrapolate scaling laws out. But then everybody's, but what does that turn into? And so you're highlighting here 1 moment where you have or 1 example where you have seen this divergence between loss and quality in a negative way where,
Quentin Anthony: (32:34) like, the
Nathan Labenz: (32:35) quality is less than you might have hoped based on the loss. Presumably, it could go the other way too. Right? It doesn't seem like these are, I I think, more guidelines than actual laws is probably the right way to think about the scaling laws. So, yeah, tell us basically, I'd I'd be interested in everything about how these divergences between loss and quality open up.
Quentin Anthony: (32:55) Yeah. So loss is not a very useful indicator, like, at all. So early I use it to detect spikes, obviously. So it it tells you, okay. Your model died. And we definitely saw those, especially with Black Mamba as a very unstable model to train, actually. In terms of model quality, though, it it's not a great like, generally, you want it to go down. Do you want it to go down to 3 or 2? It's dataset dependent. The noise of your data strongly determines this. I would say 1 thing I check it for is that it's a very early indicator of whether your model is saturating or doesn't have a high enough learning like, whether it's progressing. So your loss decrease is obviously gonna slow, like, way down as you hit a stable point of training. Like, you you've reached, like, the the slope of the optimizer landscape, and you're you're going down. So that's great. But if your learning rate, for example, was not aggressive enough, then you might just completely plateau in your loss. And this is kinda like danger, like, you've done something wrong. So that's really all I can tell you. Everything else is coming from evaluations. Now evaluations are also tough because if you maximize evaluations of the cost of everything else, which if I wanted to do that, I would just do synthetic textbooks for my entire training data set, I would blow up evals. Great. But when you actually talk to the model, it's super dry. You can't really fine tune it very well because it's assumed that it's exact, very clean distribution of data. It doesn't really respond to unclean data very well. So what you need is to understand what each eval is actually doing and providing for you. For example, I believe it's Heliswag is 1 of the evals. And that 1 is, like, general language capability. We're finding, like, does your model just model very simple sequences, and that's like a very smooth eval. It's like a better loss almost, where it just, like, linearly increases across training. Then you have weirder evals like MMOU that are important indicators. For 1, they tell you how contaminated your data is. If MLU explodes, you've probably done something wrong. You probably downloaded some data that that includes some MLU questions. And then, obviously, you also like, we do very careful test overlap with our training sets as well. They'll also indicate this. The MWE is a really good indicator of whether your model reached this emergence of in context learning, and it really is an emergence. So it doesn't show a signal for a very long time. And then near the end, it starts to grow up because your model is starting to grow up this this in context learning task, which is difficult for it. Finally, there's a bunch of us. Like, we we test, like, factual understanding with the ARC, easy and ARC challenging or or ARC challenge evals. So those really show us how well the model, like, is understanding factual recall, how much Wikipedia we were put in there, how many textbooks. And then lastly, you really just need vibes, man. We have this, like, a set of prompts that show us really key like, is the model fun? Like, things like, is the model able to role play, like, as a pirate? Is, like, legitimately something that's really important to to see because if you give a model, like, phi and you tell it to act like a pirate, it it'll ignore you. Like, it'll stick with it a little bit and then just immediately drop it. And that's important to test. We also test things like, not quite strawberry, but, like, how many times, like, a flower bloom, if it blooms this many times, like, in in a year. Like, there's a lot of vibes questions that we ask the model across training. And at the end of the day, you want your model to perform well on things you actually want it to do, not just benchmarks. And those are the questions that we're asking, like personal assistant questions and this sort of thing. And that's what actually indicates whether the model is becoming quality. Anything there you want me to drill down on? That's kinda our process.
Nathan Labenz: (36:33) That's fascinating. Moments of I think more anecdotes, if any, on, like, moments of emergence are interesting. It sounded like you were saying MMLU kind of you understand to be a reflection of in context learning? Is that because you're, like, doing few shot testing? Like, I I'm a little bit unclear on on what basis you would equate m l MLU with within context learning. And then also just interested in what the 2 curves look like as you go through training. Taking away, I think, that, like, loss curve is just doing its smooth decay thing. And then but you're seeing these, like, relatively sudden spikes in performance on key tasks.
Quentin Anthony: (37:18) Is that right? This is exactly true. It's also like, all of the vibes questions are also like this where it's like, oh, the model suddenly gets in. Like, early on and even halfway through, the model keeps getting tripped up on silly things. But when the model gets it, we're like, okay. We've reached this key milestone. So, yeah, loss is always you just zoom in and you're like, okay. Is there a slope? Is it negative? Yes. Move on. The reason MLU is hard and the reason it tests, like, in context learning is that its format like, it requires a very strict format. That's really it. And also, very weekly test factual recall because these are all college based questions. But it's answer as a letter, multi choice, a b c or d. So the model is used to filling in the next token. It's not really used to just this big long question. Oh, format your answer like this. The model's not used to just saying c. It's used to saying, oh, astrophysics is this field, all these things that are used to actually using text. And for the model to understand, you are supposed to answer with a single letter, and that letter is supposed to encapsulate all of the background knowledge of whatever the question is trying to get you to answer, that is something that shows up very late. So, yeah, MMU is, like, random for most of training. And then after it it depends on the architecture, actually. So for a dense transformer, you need to be, like, later on in your learning rate schedule. So your learning rate has to have settled down a little bit, and then you have to have seen something like several 100,000,000,000 tokens. And then suddenly, MOU just start to climb pretty linearly until you saturate. So it's, like, flat, and then it grows up, and then it set set flattens again once you seem to your learning rate is too low, you've seen enough tokens already. For pure SSMs, actually, this number is really high. Falcon and Mamba, this was, like, 3 or 4,000,000,000,000 tokens, I think. It's multiple trillion before. And we theorized this is because of in context learning, like, it takes a long time. Like, you need a little bit of attention to kick the model into understanding. Early on in the question. It told me answer it with a letter. And that sort of relationship gets lost in a pure SSM block. But, yeah, this is what this is all it looks like.
Nathan Labenz: (39:24) So I often think back to that paper that was given paper of the year at 1 of the major conferences. Emergence is a mirage, basically, was the sort of counter narrative to the earlier narrative of emergence is a big deal. Mhmm. Sounds like you're on the side of emergence is a big deal, and I'm basically with you. But to try to steel man the idea of emergence as mirage, I think 1 of the more powerful points that they made was, like, a lot of these graphs that we see are logarithmic, obviously. And so when a thing is presented as a spike, that is still often over, like, actually an order of magnitude of of data. Right? You're like 10 to the whatever flops, 10 to the 1 n plus 1, 10 to the n plus 2 flops. And you're like, oh my. That came up fast. But it's actually that was 10 times more than everything that came before in that in that plot. Are you seeing things emerge? Is that consistent with what you're seeing in terms of emergence? Does that, like, change how you think about it at all? Or are you perhaps seeing things that are spiking even faster as you really drill into to what's happening? And I guess 1 more if I could tack on even 1 more question there. You said the term grokking. Is that basically your mental model for this that these things are learning the right right algorithm or the right conceptual representation to make a phase change?
Quentin Anthony: (40:46) Yeah. It's it's hard. Like, I I have everything I'm about to say is speculation as just a very empirical 1 in my case. So it's definitely something to do with learning rate. So there's papers like the mini CPM where they just keep the constant learning rate for a long time, and then all of a sudden, they anneal it really fast to 0. And you get all of the quality benefit during that annealing phase at the very end when you're annealing it very close to 0. And also many tokens, so it's it requires a lot of tokens, and then also something about their stage in the learning schedule. My sort of theory here is that the model needs to see a lot of tokens. Like, that's how I'm saying grokking. Like, it needs to grok. Okay. This kind of type of data that requires me to follow instructions exists. That requires a lot of tokens for the model to learn. So that's the grokking part. In terms of learning rate, I think it might be you need to settle more into the loss basin in order to, like, make these really fine grained changes to how the model views the world. It may really makes me feel gross to say something that generic, but it's some sort of relationship between the 2 of these. But both of them, I would call it grokking together.
Nathan Labenz: (41:58) Yeah. The learning rate, I do think is a or the learning schedule is a probably generally underappreciated part of this. It's not something I'm, like, super deep on either, but I've I've been watching for Sofia. I don't know if you've you've looked into that, like, alternative to Adam. Do you guys use that? We do not. Interesting. Have you tried?
Quentin Anthony: (42:20) Yes. We think that this is partially because of the learning rate schedule. So more aggressive learning rate schedules make a mirage of increased quality when you're really just playing with learning rate. So we stuck with Adam for now. We might be looking at, like, shampoo here soon, but not Sofia.
Nathan Labenz: (42:39) Okay. I wanna understand that a little bit better. But just to Mhmm. Try to set it up for you, if we envision a of course, we're all gonna do this in, different ways in our own minds. For me, it's a 3 d landscape even though, of course, this is huge numbers of dimensions in Yeah. Reality. But I subscribe to the school of thought where we visualize things in 3 d and then say, end dimensions really hard in our head to allude to the famous cartoon about that. Sure. So these are like very strange landscapes, right, where we're seeing an observation, computing computing a a gradient, how would we have done better on this observation? Let's move the weights a little bit in that direction, and we'll be at a slightly lower loss as a result of having made that change. And then in theory, that gives us better performance. Although, of course, it's a super noisy thing because an update that made us work better on 1 particular observation might not work better on another observation. There's baddening that goes into this and aggregating a bunch of observations at once and making an update. And this is basically a very black magic sort of thing still where nobody really has a great sense for what the true nature of these landscapes are or how best to navigate them. But when you talk about aggressive learning rate versus annealing, that's basically saying how big is the step size that we're gonna make, right, from 1 moment in wait space to the next moment. Like, how big is that delta? And if you find that the value is really happening most in this annealing phase where you're shrinking the step size Mhmm. This sort of suggests to me like a almost like an Antarctic landscape where there's, like, lots of little crevices that you could be, like, around a good space, and then you really have to drill into this very, very local space in the landscape to to find something that really works well. And that would suggest because obviously we see 1000000 things happening in the broader ecosystem. That would suggest that there's probably a lot of them, and that's where I get to this crevices notion. But then I wonder there. I'm like, I also see papers like the platonic representation hypothesis where at the same time that's happening, there's this other line of thinking that's at scale, things seem to be converging. The basic notion of that paper, I'm sure you've at least come across it a little bit is with greater scale, with more modalities, it seems like the internal representations of concepts are converging in some, like, high level statistical sense. So how do you make sense of can you begin to make sense of all that where there's, like, seemingly a lot of different, like, local points that you could drill into, but there's also this sense of, like, global convergence. Is can those be compatible? Does 1 have to be wrong? I find myself confused.
Quentin Anthony: (45:25) For the pocket idea, I don't it's definitely not a flat area with lots of pockets. If if we were to go with that, it would definitely be a general slope downwards with pockets along the way, and then you can settle into 1 of them as you go. This is just because even in the constant phase when you're taking those big steps, people like many CPM authors still saw improvements, just less improvement. I definitely know that your loss landscape is much smoother for larger models. You can more richly, like, model the space, and therefore, just increasing the scale makes it much easier for your optimizer to make progress. This is why things like novel optimizers like Adafactor, Sofia, or, like, second order optimizers it's like the the bumpier the landscape is, the more accurate your step should be so that you don't get kicked out of, like, the the gradual slopes that you want to reach. It's also, like, why we're seeing a lot of the conventional wisdom is assuming big models because big tech assumes big models. So the reason people all do dense transformers with like, large scale optimizers. The reason why people say, oh, shampoo does the scale and all this kind of thing is because when you're at large, your step is gonna be broadly accurate anyway when you're in this giant, smooth loss basin. Like, why bother with any of these things? And I agree with them. You shouldn't bother. But when you have a 3 d novel architecture model with a really bumpy loss landscape, that accuracy really matters. That also translates into things like batch size. So I trained the Pythia models back with Eleuther, and we were seeing that, for example, like, large batch sizes was causing instabilities. And it really confused us for a long time because the higher your batch size is, the more accurate your gradient should be. Right? So if you have these super accurate gradients, this is even at the cost of seeing less data. Right? As you increase batch size, you'll see less data. But now, I'm I'm fairly certain that by looking back at the data, if you have, like, bad samples, and as we were increasing the batch size, we were also increasing the learning rate. And if there's a few just a few outliers in that batch that sort of kick your gradient in the wrong direction, and then you have a large learning rate, you take a big step in that wrong direction, then larger batch sizes were actually bad because you were reaching too many of those outliers per batch. And you reached too many bad batches in a row, and now you're going completely the wrong direction, you just diverge entirely. So we actually had to use lower batch sizes, but it's complex. You also need to this is why people clip their gradients also. Because if you take 1 step in the wrong direction, make sure it's not a big step, and then hope that big step can be reversed by the next couple steps. So there's a lot of, like, weird alchemy going on to get models to actually converge. I've noticed that every run has its own, like, cursed way, and the reason people, like, really wanna use Adam, the reason we're using Adam is because Adam is extremely robust. That's the main reason. So that's why people can publish papers like Sofia and Adafactor. If you look enough, then you'll find cases where they're better. But Atom is good at everything. It's not the best at some things, but it's good. Doesn't have any, like, glaring weak points, like optimizers, like Adafactor and such, that might just not work, and then you spend 3 weeks, like, trying to get your initial stability for this weird combination of hyperparameters and model size to actually work. That's how I look at it, though.
Nathan Labenz: (48:49) Somewhat novel intuitions for me that I wanna just confirm with you that they match yours. Overall, you're like, it's easier to get big models to to work. It's harder with small models. That seems to be maybe a reflection of just, like, how overloaded the small models are. If I think about just a super dense, everything packed in super tight sort of network, then I would imagine that it's just easier to mess things up if I take a step in the wrong direction. Whereas if I go to 1000000000000 parameters at the very far end of reportedly a GPT-four type architecture, then I can spread this out over many experts and things aren't so overloaded and 1 wrong step in 1 place might not even affect most of the network because it's localized to experts or what have you. And so then I can be more aggressive in my step size and I can get something like a Sofia to work well in those bigger contexts because I just have more room for error essentially in my training process that I just don't have that luxury if I'm training a smaller model. Am I interpreting you correctly there?
Quentin Anthony: (49:58) You're close. So, like, for larger models, you want an actually, like, lower learning rates because even though it's, a smoother loss landscape, you don't need the small model ability to really jump over these crazy hurdles and stuff. Think of Adam. Right? You have a variance term. You have a momentum term. That's to tell you if there are bumps along the way downwards, then you still make progress downwards because you still have momentum. You have faith that this way is probably, like, the best for the lost landscape despite some bad gradients. For larger models, despite the the the smooth lost landscape, you wanted a lower learning rate just because a lot of instabilities just in general tend to come up just because you have so many parameters that, like, go beyond the point of no return.
Nathan Labenz: (50:42) So I I heard the Adam 1. The momentum is there to say, we're not just gonna look at this particular batch of updates and calculate naively what would have made us better here because we know there's a lot of randomness. And so we're gonna keep track of the general direction we're going, which is the momentum term to make sure that we're, like, not getting too randomly sent off in weird directions. We're gonna try to have some, you know, general sense of what direction we're going. I'd be interested to hear the sort of similar narrative form for Sofia if you have 1 there that would explain, like, why that's different and maybe why it's better in some places than others. And then I was also gonna ask about distillation. Seems like the big trend with the frontier developers has been go huge and then distill. I wonder if that's something that you guys might also be thinking about if there's, like, a Zamba 70 b or even 400 and however many b in the background somewhere.
Quentin Anthony: (51:44) So in terms of Sofia, I don't have a good intuition for this, actually. I I've explained why Adam is is great and why I think Adam is good at everything. It's, like, very robust, but I can't really explain why everything else is brittle in its own way. And this is not really just to say that, like, everything else is bad. It's just to say I haven't put the time in to any other optimizer just because I've been burned enough times by, like, alternative training, like, strategies that just end up not being generalizable. I don't want a bunch of like, a toolkit that's specialized to each training run. That's not the point because then I have to spend as much compute as a training run to try and figure out how to get this specific thing to converge. I want it to just converge. And even if it doesn't converge quite as well, I'll train on more tokens because I was able to kick it off day 1, and I'll I'll still end up at a lower loss. On your MOE point. So MOEs are themselves, like, a really tricky thing to optimize because, like, for 1, the attention block is not divided. The attention block sees everything. But for the expert blocks, you're subdividing the batch across everybody. So, like, they're effectively seeing, like, a different gradient noise scale than the attention block. So trying to find the right batch size is clearly much higher, but do you scale it by a number of experts higher for the batch size? Because you subdivide, like, intuitively makes sense. But the problem with this is that now the attention block is getting this massive batch size. This is part of why MOEs are super unstable, I think, to train. Like, Black Mamba was, like, a pain to train because of this. Like, attention blocks are seeing more accurate gradients is way 1 way you can think of it than, like, the the experts. Trying to port new transfer is something that I'm really looking into here. Actually, the people at the team at Cerebras is really deep on this. They're really good, so we're trying to crack this. Mostly them, but, yeah, we're working together. But in general, trying to find for these hybrid architectures that we're seeing for example, that's kind of true for us too because we have this shared attention block, which is going to have more accurate grades and see more data. Finding something that converges is really difficult. Being able to navigate the hyperparameter soup really determines if you live or die.
Nathan Labenz: (53:53) Fascinating. What about did we get to distillation? Are you guys gonna pursue a sort of big to small strategy as well?
Quentin Anthony: (54:00) Yeah. Yes. That is the plan. The problem here is that you need a lot of compute to train those large models, so it's something we're scaling up for now. In terms of the actual distillation step itself, unfortunately, big labs are very tight lipped about this. So, like, even annealing, it started emerging oh, annealing, by the way, is like a continual pretraining step that we picked up early where you first do, like, a standard cosine decay. You want the learning rate learning rates, and then you slowly decay it over time to some high value, higher than normal. And that's on, like, your phase 1 data. So this is, like, more like web data. It's a bit noisier. It's still important to have so that the model can, like knows what this data looks like, knows how to respond Yeah. To Exactly. But later on, you want to bake in more intensely a higher quality subset of your data. The higher quality doesn't mean textbooks again. Doesn't mean you want your model to be dry. But you like, you can only look closely at so much of your data. Like, everyone says, look at your data, but you can't look at 3,000,000,000,000 tokens. You can look at, like, 50,000,000,000, a 100,000,000,000 tokens and apply some, like, really detailed filtering and deduplication to those and make sure it's the highest quality, even if it's a distribution that represents all of your data. So, anyway, you decay down to a relatively high learning rate. And then on your first phase, and that's where most of your tokens are coming from. And then in the second phase, you very quickly rewarm your learning rate. You say, okay. I want the model to respond really quickly again. We're in a new loss landscape. We have a new distribution of data. And then very aggressively decay down to 0. You say, okay. I want my model to respond quickly to this data and then settle very quickly into the the new loss basin that emerges from the high quality data. And, also, you're able to stomach this new high learning rate again because your data is very clean. You don't have this problem I was mentioning earlier with Pythia where you have these crazy outliers that kick your gradient in some different direction. So you're able to, like, go to with a model that's already settled into a good place, you're able to increase the learning rate a lot on this high quality data, and then really bake it in so it's top of mind to the model. This also increases the flexibility. I see it as a branching path. The base model is the base the trunk. And then you can construct tune and then fine tune so your model is becoming progressively narrower along 1 branch. But if you have multiple anneals, that's like multiple subtrunks. Right? So you can anneal on role play kind of data. You can anneal on like, if you wanna try, like, very factual model, you can anneal on textbooks and such. But then from there, you can fine tune so that the role play model is role played on, like, specific characters from the annealed role play checkpoint. And that is a much more flexible scheme than just having single base model, a bunch of fine tunes from that base model.
Nathan Labenz: (56:51) Can you define anneal? I've learned this Oh. In the context of, like, biology, and I'm looking it up online. It's also used in glass and metalwork. Maybe this is wrong, but I I guess I think of it as hardening into its final shape. Or with DNA, it's it's actually, you know, combining coming back together and reaching some, like, lower energy stable point, basically. Is that the main model here too?
Quentin Anthony: (57:14) Yeah. We put very detailed instructions and explanations on it's called letter z and then cookbook, the z cookbook. If you look it up on GitHub, we have a whole section on what is annealing, what it means for us, what it means for your model. But, yeah, annealing for us is like trying to harden the model a little bit, the base model, into a pre fine tuned checkpoint that has some type of data that we want top of mind. And practically, this means rewarming your learning rate on the new distribution. It means having some replay, so you still have to feed some of the tokens from your batch from the phase 1 data so that you don't have catastrophic forgetting. And then that replay plus new data distribution, then you very aggressively an anneal the learning rate down so that your model settles, like, very definitively into a a loss basin for you. This all getting back to the reason I brought up annealing was you're asking about distillation. The since the GPT-three paper, the big labs have learned be a bit less open with people because then they'll actually take compute and try and do what you're doing. That's why annealing for a long time was, like, a hidden trade secret. Like, the llama team cracked this with llama 2. They split into the mistral team. They split off. They knew this sort of secret. And then it became more and more public over time, and now everyone publishes it. Distillation has been the same sort of regime where the big labs have cracked it, but they're staying tight lipped about it so they can try and extract all the value before their people get poached somewhere else, and then it ends up being public anyway. We don't really know the secret to distillation, actually, but synthetic data from large models is a proxy for this. Right? So we're trying to to look at this at least so that we can steer our annealing and post training strategies to to more approximate the output of large models via synthetic data, and then trying to actually use the logits during training to stabilize. So remember earlier, we were talking about can distillation help you navigate the bumpy loss landscape by having basically a big atom that's holding your hand It's saying, okay. No. This is the right direction because this is what the big model would do. The big model is able to see much better that this is just a little bump, you should keep going down. Using those loads at the training time is something that we're still trying to crack. We see generally positive things from a model conversion standpoint, but it's still really inefficient to try and have a live inferred big model at training time. So training pretrained scale distillation, we have not cracked yet, but I'm confident that we will.
Nathan Labenz: (59:46) So the basic idea that I have of distillation is training a small model to reproduce the outputs of a big model. Mhmm. So much at the token level, but at the either the logits level or potentially if you had, like, direct access to the large model, then you could perhaps tie at, like, multiple places. Right? You could say, I've got my trillion parameter model here. It has a 100 layers. Here, I've got my 7 b. It's got how many layers, but I'm in math. Okay. At this point, I wanted to tie up to the the big model and try to figure out a way to imitate the internal representations. Am I off base conceptually there at all? That's exactly right. And maybe I'll expand on what I
Quentin Anthony: (1:00:30) was saying to tie it to what you are. So when when I'm saying synthetic data, it's like a poor man's way of doing distillation. That's you're just choosing the max of the logits. Right? So, like, you're only training on what the big model is paying attention to. And you can do that also with closed source models. We could do that from GPT-four or whatever else. Now logits are important because those are, like, the the large models saying exactly how it would respond to the current batch of data. That's why I'm saying it's kind of like holding your hand in terms of the optimizer. Logits are also super important because you also wanna know what the large model is not paying attention to. So those tiny 0 values in the logits, like the whole distribution, helps the small model, not just the large values that you could get if you were to, like, approximate with just the synthetic data. But, yes, you need access to the model for these logits, which is all the big hurdle. It's it's helped now by big, like, llama 3, 4, 5 b. But when you're running inference of LAMA 4 or 5 b every single training step, that's why I was saying that the like, at pre training scale, this becomes prohibitively expensive. And trying to get around that expense hurdle while still getting logits per step is what we're trying to figure out. We think there are some cheaper ways to do this, and I think we will get them figured out. But this is why it's not popping up all over for open source people, because it's either prohibitively expensive or synthetic data is just not rich enough of a representation to really help you steer on the right angle.
Nathan Labenz: (1:01:57) And so the just to make sure I understand what makes it prohibitively expensive, it's basically that running 4 0 5 b is, like, some fraction as expensive as training 4 0 5 b. So if you wanna train your small model but also have to run 4 0 5 b, it's you're gonna end spending as much compute as if you were just training a much bigger model in
Quentin Anthony: (1:02:16) the first place. Exactly. Yeah. In terms of practical sense, if I have a 7 b, then I'm already doing parallelism across GPUs to get it to fit in VRAM. And then if I have 400,000,000,000 parameters that I also need to store, even if it's for inference, you don't need to store optimizer states and gradients and all this for the big model. You still need to fit it somewhere. You still need to use the flops to infer it. You need to synchronize. It's a forward pass with the forward pass of your own small model. Why don't I just train like a 4 b from the start, and why even deal with distillation? Is it really worth it? So far, these practical trade offs have not been cracked for us. That's what I'm saying.
Nathan Labenz: (1:02:53) Yeah. Gotcha.
Quentin Anthony: (1:02:54) It's not practical. It's probably a slight improvement, but a lot of things are slight improvements. And just seeing more data is always a very direct way to improve your model. And if I could see more data or have a bigger, more parameters, that's a more direct way to improve model quality right now than trying to get logits from Lama 4 or 5B. That's just the practicality.
Nathan Labenz: (1:03:14) Do you think it's possible that the big labs maybe don't necessarily have a huge secret, but what they maybe have is just a different cost benefit analysis where they're running so much inference that it's worth it for them to do this?
Quentin Anthony: (1:03:28) Oh, I think this is likely. Yeah. This is likely. I think that if you're Microsoft working on PHY, then just use GPT-four. It's on our own systems. We have infinite funding. So making it there's 2 possibilities. Right? Either they've cracked a cheaper way to do it and we have to find it, or they have more money and we have to find the cheaper way, crack it, and then do it. And both of those cases require us to crack or find some cheaper way, unfortunately.
Nathan Labenz: (1:03:56) Are there other things that you would put in the bucket of frontier labs are sort of generally understood to have a solution that nobody else, hasn't diffused to the rest of the community yet?
Quentin Anthony: (1:04:07) Data is a big 1. Data and data cleaning requires time, patience, and really just manpower that open source does not have. So when you're open source, then you're just beholden to whatever dataset other people are able to, like, completely focus on and not allow people focus on data. So it will always lag the big tech companies. Also, tech has access to their own data, which is also, you know, very clean. It's, you know, in infinite abundance, and we're definitely behind it. So, yeah, cleaning data, cleaning data pipelines. This is what we talked about. We just released Zyta 2, which is our Zyphra dataset 2. There's some new data in the open source regime called GCLM and Fine Web and Dalma and Zyta 1. And cleaning them up and producing, like, an actual, like, at scale dataset with Nemo Data Curator, which is NVIDIA's, like, data processing tool, we found has really lowered the bar for entry for data. But, yeah, data's 1. Another 1 is things like hyperparameter transfer, unfortunately, has not been picked up a lot by the community. You might recognize this as mu p or mu transfer. This is like I think it's Modula. It's a new paper that's also found similar things. But it's like 30, 40 pages of math for these papers, and it's not approachable for open source people or small labs and such. So only big labs have the compute and funding. Like, to put 1 person, okay, go read the MUP paper for a month and try and port it to our ginormous monolithic code bay code base and get it to actually work. So there's a lot of trade secrets, I would say. Distillation is another 1.
Nathan Labenz: (1:05:38) Give us a little bit more on that 1 because I I honestly don't have a lot of intuition for what that is adding to which we're all stacked. Repeat? Yeah. Exactly.
Quentin Anthony: (1:05:46) Oh, yeah. Oh, so mu transfer is a way of initialization where across it's like width scaling. So you do all of your hyperparameter search on some really low width model that's really cheap to run training on, and you do all of your learning rate sweeps, for example, all of your batch size sweeps. And then if you literally just divide by a scaling factor of your low width by the high width of the actual model that you want to train, both in terms of the learning rates of your hidden layers, like your embedding scaling, you can multiply by some new hyperparameter. Then you just get perfect transfer from the small model to the big model. And as in the learning rate that I found was optimal for my small model will also be the optimal learning rate for my big model. And this is all just making sure activations don't explode by all of these small scaling factors of width. This helps you in terms of both you're able to more accurately, like, search the loss landscape because the compute is so cheap on a small model that you're able to really, like, search a width or a higher amount search speeds of learning rates and bright bright sizes and stuff. It also helps the just the general stability of your model because your activations aren't exploding and your model's not learning on the fly how to account for its own explosions. So you're able to also deal with a much higher learning rate. Alternative architectures and optimizers are also much easier for this because you can just directly for example, the mixture of experts. If new transfer is cracked, then you know exactly how to ensure both the attention block and the expert blocks are both seeing relatively equal magnitude weight updates, which is what you want for stability. These sort of details tend to be hoarded in in big tech right now and slowly diffuse out to everyone else just by people going to different organizations in big tech, people publishing, people in open source kinda cracking it, that sort of thing.
Nathan Labenz: (1:07:38) If there's others in that category, keep running them down.
Quentin Anthony: (1:07:42) It's endless, man. So this is why EleutherAI is so important. I mentioned them before. It's an open source collective where everything start to finish is open source. The dataset, the the training framework, the actual produced models. If you don't see all of it like, for example, LAMA 1 was great because they shared everything. They shared all of their data pipeline, and people made red pajama, which is a full reproduction. Llama2 was they stopped sharing the dataset. And as I told you before, this is because of annealing. It's just too tempting to keep the important parts to yourself and publish a paper that's just vague and glosses over. But keep going. What else? So for example, the second order optimizer thing that I told you for small models specifically is definitely something hoarded. So that's like a trade secret. You I mean, Gemma also did it. So they just say, we use a second order optimizer. Which 1? How do you distribute it? Is a big 1. Because if you have a second order optimizer, you have a bunch more states that you have to store in memory, and how you split those across GPUs is nontrivial. So that's another engineering detail that they well, other people to play catch up on while they extract the information from. So second order optimizers are small models that are distributed in a specific way. Now we're cracking it with shampoo. And then all of these things that I'm mentioning all interplay. If you have a second order optimizer, how do you port mu p to it, mu transfer to it? Right? Because now it's not Adam. You have to reapply the theory. Do you spend another month trying to figure out the new theory? I would say, curiously, alternative architectures like Zamba are not hoarded by big tech because big tech saw the quality issue with pure SSMs in a lot of ways and said, okay. Quality is the most important to us, and we have infinite inference time compute. So let's not bother with it at all and ensure we have the highest quality model. And they that that's more for us. Right? But I think they were much slower to catch on to SSMs and alternative architectures just because it was not super clear to them at the time that you could solve this accuracy problem by some exotic architectures and by being really data driven. What else? Parallelism schemes for all of these. Like, I start starting to mention with the optimizers, but parallelism schemes for new model architectures and optimizers and combinations, those are tend to be proprietary. So Lama shared that they have, like, a 4 d parallelism sort of scheme, but how you set up that topology on your own, that's all, like, empirical knowledge that you have to get from training a model. Same thing with, like, arcane things like checkpoint restart. So, like, a lot of these bigger runs, they use, like, dedicated nodes that are just being used to save checkpoints. And then when you have thousands of GPUs and 1 of them dies all the time, so you have to keep slotting in the checkpoints from those backup nodes. That's something that's really easy conceptually, but hard to set up, and in some sort of proprietary big tech stack, it's everything is what I'm getting to. There's a lot of points, especially about your gritty details, engineering details, scalability details that are very closely kept, tied to whatever organization has them. And they also know that even if you, say, worked on 1 of those big tech teams and you know the secret and you go somewhere else, like, porting it to their stack is a very nontrivial Right? It's completely built off of the proprietary stack of whatever company came up with the idea. So do you remember it well enough to port it to this new stack? Maybe not. So there's really a hardening of ideas and training processes that are exist in each organization and that is increasing over time.
Nathan Labenz: (1:11:10) So I think we have a few more technical questions. 1 obvious kind of pulling up question for a second would be like, the story you're telling sounds pretty consistent with the allegedly leaked anthropic fundraising deck in which they said that the leading companies in 2526 might get so far ahead that nobody can catch up. And I think we've understood that or I've understood that broadly as sort of a accumulation of these secrets and then also just, like, having the best models. We now have this, I think, pretty credible reporting that OpenAI is using a 1 to generate the data for the next, you know, scale up again. And that seems like, you know, there's not just a moat, but there's, like, multiple moats in-depth there. Yeah. And yet you guys at Zyphra are, like, a small company and trying to figure out how to compete in this space. So what is your sense of how you and other companies that are not infinitely resourced, even if you're, like, reasonably well resourced, how can you compete? Is it about finding a niche that they're not going to play in, or is it about some sort of breakthrough insight that is gonna be needed? How do they not just come to dominate the entire space?
Quentin Anthony: (1:12:27) Yeah. I will say small models is overlooked by big companies because, like, OpenAI, x AI, Anthropical of them, their job is producing the highest quality model. And, yeah, sure, they wanted to distill into a smaller model for inference reasons, but if it came down to cracking distillation or making an even better, bigger model that because it only used to be a little bit better. Cloud Sonnet 3.5 only had to be a little bit better than ChatGPT to take a a large portion of market share. So accuracy really takes them. It's also why they don't want to, like, experiment too much with these weird hybrid architectures because if it affects model quality and you only determine that when you train a big model, it's a lot of cost when you know that dense transformers work the the way that you expect. Yeah. In terms of how we can compete with them, I would say it's a mixture of this. They're focused on large scale inference, large scale training. Definitely makes them even if they have things like distillation, that's only 1 part of the process. Right? So, like, we had to crack annealing. We had to crack architecture. We had to crack data. The entire, like, training process from start to finish, it wasn't just, like, the distillation step, which has only helped when you're at a big company. Right? We're very nimble in that way. We're definitely in a unique time where if you're nimble enough and you're focused on small models, like, you can very clearly have an advantage over big tech. Over time, as more of them use distillation and such that if we were to not move at all, then we would definitely vaporize compared to their resources. So the way we're staying ahead is by continually pushing onto what the next free lunch is for small models. Next, like, second order optimizers is 1. Getting a cheap way to do distillation is 1 that will protect us from that angle. Like, we're clearly very willing to make big bets on alternative ways of training, and big tech just is not really looking very closely at this sort of regime. And that's why we have the best model now. The 7 b is the best because of this exact reason.
Nathan Labenz: (1:14:27) Are you studying the Entropics project that has recently been catching a lot of buzz or similar?
Quentin Anthony: (1:14:34) Very loosely. I'm not an expert. I would kind of categorize it under the definitely doesn't scale. So I believe it was Tim Detmer's tried this at scale with, like, larger parameter counts, and it was, like, no better than Adam. Another thing is that I kind of view it as kind of specific. So whether it's robust or not or whether it's very specific to the training setup on, like, Nano GPT is yet to be determined. It's true with this for a lot of things. You can show, like, huge speed ups on even, what is it, like, crowdsourced compute, like retraining compute, but it's hard to show things that are robust. Like, that is the actual test. Is this robust? And I'm I'm waiting for someone else to spend the compute to determine if this entropics is robust.
Nathan Labenz: (1:15:18) You're describing it in a different way than I expected. Okay. The 1 that I have been at least momentarily fascinated by recently is essentially a improved sampler. So Mhmm. It is, like, trying to differentiate between, like, when the model is confident and should just, like, pick the most likely next token and everything's kind of hunky dory versus when it's out of distribution and then when it might need to, like it does interesting things like kind of dynamically inject a additional clarifying question if it detects that, like, generally, we're confused. Let's maybe, like, just burn some more cycles. How can we do that? Let's artificially input a clarifying question. And then sometimes it'll even, like, back up and say, we seem to have kind of lost a thread here. Let's, like, go back to an earlier point and rerun inference again. Mhmm. So I've thought about this entirely as an inference thing. You were going toward the training side. Are we talking about the same project? And if so, I missed the training side of it.
Quentin Anthony: (1:16:17) You're making me question myself. Maybe I'm confusing it.
Nathan Labenz: (1:16:20) Well, for what it's worth, in terms of, like, a possible next free lunch, the 1 thing you you did say that match my intuition, even though I've read like a little bit that like it might not be working as well at larger scale models, I wouldn't say that's like well established in my mind. But if there was a story as to why it might be, it would be like the larger models are maybe like grokking into some version of that on their own through all the, especially through all the reinforcement learning that they're doing in late stage training, whereas the small models are maybe not getting there. And so layering on these sampling heuristics could be, like, just much more beneficial at a small scale. This naturally has me thinking too. What about learning a sampling function as part of the overall architecture that feels like a candidate for if any of this is real, I haven't quite got to the point where I'm, like, convinced that there's a real there, but it seems like there is. Per perhaps, like, learning a sampling function could be another relatively, like, small amount of parameters that could lead to some, like, qualitative change.
Quentin Anthony: (1:17:27) I agree. It definitely could. I also agreed on the point you made about, like, larger models doing this on their own. Larger models for example, data quality matters a little less because larger models learn to self select what is important and what's not. There's a reason that when you go to a large enough model scale, you just scale it up and just churn it out, churn out more and more data on more and more parameters. And the model kind of is so robust that it's able to learn these properties that we would need explicit training properties and processes for at the smaller scale. So that is 1 thing that big tech kind of lies on as a crutch, that we are able to find the explicit process that is able to emulate the same behavior from the model a bit faster because they can just rely on the model figuring it on its own. But, yeah, I do agree that it could help us, but we haven't looked too deeply into this just yet.
Nathan Labenz: (1:18:18) I'll ask 1 more, then, Jason, let's come back to you, and you can map some of this stuff onto the specifics of Zamba and and state space hybrids in particular. How would you describe, like, where we are in this overall process? We always hear we're still so early. That basically seem that would be, like, my vibe level takeaway from your comments. There's a lot still that maybe hasn't even been properly investigated, let alone understood.
Quentin Anthony: (1:18:48) I would agree with this. I definitely don't think this is a solved problem. I don't think we found the limits of scaling. We clearly have not found the limits of architecture. Whether there's some groundbreaking thing tomorrow or, like, attention to comes out and everything is thrown away, I can't really say that. But I can definitely say that we've moved very quickly into a search space that's massive. There's a lot of very low hanging fruit that is just obvious, like, it's everywhere that we have not picked up. And these training processes and stability, like, there there's still so much out there. Even if you were to just take pre training model papers and compare them, like, I'm sure there's a lot of things that people are reinventing. There's a lot of trends, like, across scale and training process that are to be found that people have, like, implicitly found, but either not known or purposely not shared with everyone else that are just lying around.
Nathan Labenz: (1:19:40) That's my sense too. Doesn't seem like we're at the end of this cycle or anywhere particularly close, which is pretty crazy to contemplate because it's already come pretty far.
Quentin Anthony: (1:19:51) Yeah. 1 really low hanging fruit example of this is, like, the GPT-three paper had, like, really specific, like, hidden dimensions. And this was both true in terms of quality, but it was also in terms of, like, kernel paths. So, for example, there's, like, lots of powers of 2 in their hidden dimensions, but not quite enough. And everyone else copied the GPT-three model configurations for a long time without understanding that, oh, these are more efficient on GPUs in addition to be approximately the right size. So this is another thing we bake into our Zamba models, is that since we're trying to make the most inference efficient thing on any parallel hardware, if you do very simple tricks, put lots of power through 2 in your hidden dimension, like, you're also sizing up these different blocks just by rounding very slightly to good kernel sizes, will give you speed ups on any parallel hardware. So whether it's multi core CPU on a phone or something like MPU on Intel, like, or an m 1 for Apple or something. If you have parallelizable model sizes, then for the lifetime the entire lifetime of the model, you're more efficient. And that this is something that big tech just says, the llama 3 has this size, but they don't show you that they went per block and found the efficient sizes for kernels for each of these and then baked it into the model. Same with the vocab size. The vocabs are all rounded to some factor of 64 for the same reason. Like, there are just more efficient kernel paths you can go down with that.
Nathan Labenz: (1:21:11) And But my very simple, like, intuition for that would be just, like, if you wanna break something out into 2, 4, or 8 GPUs, then if you have a ninth thing, then you're probably Yeah. Like, you could if you go 8 wide, the difference between 8 and and 9 is 1 and 2 in terms of how long you have to wait. Are there, like, a lot of different versions of that you see, or is it really, really pretty much that simple?
Quentin Anthony: (1:21:39) It's not that simple. So you can decide how baked in your model size is going to be. So for example, this I I mentioned a few times, this powers of 2 in your hidden dimension. This is true for any parallel hardware. Like, if you train on an AMD m I 300 x GPU and you have lots of powers of 2, it's probably also gonna be efficient on NVIDIA h 100. It's gonna be efficient on your phone. But you can get really deep with this. There's something called wave quantization where, let's say, you have, I don't know, a 100 streaming multiprocessors on the h 100. If you have a 100 of 1 units of work, then you need to do 2 time steps. Right? First 1 with a 100 and the second 1 with 1. And the throughput will be roughly half because that second time step is pretty much empty. About the same amount of work, but 2 total time steps throughput is half. Whereas, if you had a 100 units of work, you just have to do 1 time step. So then, on your graph of throughput on the y axis and problem size on the x axis, you're gonna get waves. And each time you kick down on the wave, like, have lower throughput, you are on, like, a new time step. This is baked into your models. If I was NVIDIA and I wanted to train a model that's only efficient on NVIDIA hardware, I would bake in that this model is has very specific sizes to fit on the number of skewing multiprocessors on the h 100. It has the amount of SRAM, whereas on an AMD, like, competitor, like an AMD m I 300 x, on that 1, like, it'll require 2 waves or something like that, or SRAM will be, like, undersaturated with the model sizes that I put in. And, like, the bad version of this feature is that everyone is building their own chips and everyone is building their own models, then everyone would produce a model that's only efficient on their own chip and is, like, specifically inefficient for everyone else. So, like, there's it's unclear whether we'll go this direction of, like, perfect specialization for a company or if companies will say, okay. I can't crack all of model training, and I need my hardware to be able to run competitor models that we, like, run on our own data. So maybe in that sort of case, you would want everyone like, the good future of everyone provides similar hardware that can run everyone else. It's unclear which way this goes, but this is just another example of, like, there's a lot of low hanging fruit for 1, and for 2, there's a lot of return that you can get by just, like, getting out a magnifying glass and looking at every single step of your training process. So, like, extract everything you can from it.
Nathan Labenz: (1:23:59) Yeah. You follow Greg Brockman on Twitter, and you realize how much of that he has personally done over the last couple of years at OpenAI, and you're like, yeah. There's the reason that they're obviously extremely good at what they do, and you can see that the sort of the out rough outlines of how that work is unfolding there in in his Twitter feed in particular. So recommend that as a not very open window into what's going on, but at least a tiny little peek. Yeah. Jason, what's on your mind?
Jason Meaux: (1:24:30) Yes. Just a pleasure. Super interesting conversation to hear. The ins and outs, the inside scoop on training. I guess I I would really be remiss not to do a couple deep dives with Quentin while we have him on Zamba 1, Zamba 2. Maybe let's just start with a a deep dive on Zamba 1, if that's okay. And then particularly, you look at the architecture. It seems pretty straightforward. There's 6 Mamba layers to start off that the input sequence goes into. And then this interesting thing happens that I'd like to talk about. So there's this step before the attention block of concatenating the residual stream coming out of the Mamba block with the original input embedding. So this is very interesting. So the attention block gets to see the Mamba residual stream plus completely unmodified input embedding, it feels right that if you're running the attention computation, you'd wanna do this. But at the same time, it's interesting. How does 1 arrive at a feature like that? Is this something that your team is okay, it's intuitive, so let's try it? Or is this something that through many, many different experiments that just rises to the top and it's more experimentally found? Okay.
Quentin Anthony: (1:25:41) On the specific concat, if I recall, we did run a lot of experiments, but it also was intuitively. Right? Our experiments master intuition, which is where we wanna be, that we want the attention block to be able to see, like, the entire input and residual. There's really not much more to that. It empirically works, and it intuitively makes sense, so we stuck with it. In terms of Zamba1 in general, we kind of crystallized more and more of our thoughts on how a inference optimal model should look like. So that's when we discovered that, yes, indeed, attention blocks are correlated. Yes, indeed, you can replace them with just 1 and get pretty much all of the same benefits of attention. Yes. Indeed, hybrid models with the rough 6 Mamba blocks to 1 attention block was, a good split. Let's see other things. Everyone likes to talk about model architecture, but it's 1 step of the 3 holy the holy trinity. Right? So it's model architecture is a big 1, and we really optimize that for inference efficiency and, like, performance per parameter, as we discussed earlier. So number 2 is training framework and process. So we learned from the mistakes of Black Mamba of how to affect for optimizers and stability, how to tune hyperparameters for exotic model architectures, and we baked all those in. And then number 3 is data. So Zamba1 also came out with Zyta1, which was our training dataset at the time. And really getting quality baked in at scale for this size of model was really important. We have some nice graphs where, like, at 1,000,000,000,000 tokens, we're able to perform, you know, similar to models trained on 5,000,000,000,000, 10,000,000,000,000, 15,000,000,000,000. If you really optimize for quality at this smaller model scale, it really has effects. Annealing was first found with Zamba1. So this process of 2 phase training, we discovered it with this. So this is, like, the first pass of everything in our overall training process. Any thoughts or anything you want me to drill down on more?
Jason Meaux: (1:27:36) Yeah. I guess if if we could just walk through the the layers, they're pretty straightforward. 6 Mamba layers. And then just a follow-up question on the on the concatenation. So I I guess first thing is because you're using concatenation, so you have to pay for that computationally. I guess the dimension is expanding. Obviously, the trade off's worth it if you're getting a bump in, performance. And then I guess at some point, it's not entirely clear. At some point, I guess you're projecting back to the original dimension. So if you could just step through imagine there was a visual on the screen, step through the Zamba 1 architecture from beginning in. So,
Quentin Anthony: (1:28:14) yeah, no positional encodings. Then you have an embedding block, and then you have this alternating between Mamba 1 blocks, because it was only Mamba 1 at the time was out. So 6 Mamba 1 blocks, global attention plus MLP. So there's also an MLP in that global block that that's important for people to know. 6 Mamba ones, attention plus MLP. 6 Mamba ones, then the same attention MLP together. In terms of the concat, yes. Empirically, it it it is better to do a concat than to try and, like I think we tried to, like, project it back down and not beat that computational cost. Nothing really worked very well, so it's very empirical.
Jason Meaux: (1:28:52) So I guess in the literature, there's a range of results of Mamba2 attention layers. Zamba lands on 6 to 1. That's consistent with Zamba2, so it seems that don't fix what's already working. But some of the literature suggests wide ranges up to 7 to 1, 10 to 1. You know, it's not how does Zyphra team arrive at the 6 to 1 ratio? Was that experimental? Or
Quentin Anthony: (1:29:16) That was purely experimental. I also suspect that we have to apply our attention block more often because it's shared. So there's some sort of interplay there. Yeah. I mean, we 7 and 5 are slightly worse, but that's really all it is for us. I think it's similar to everyone else. Zamba, mixture of experts, clearly has some sort of interplay with how often you apply attention. I'm sure there's some there's some consistent rules here that are being applied to everyone, but the way it's manifesting is we all are empiric. We don't have time to make these, like, these consistent rules play nicely. So, empirically, we find that, yeah, for us, it's every 6.
Jason Meaux: (1:29:51) Excellent. So, I guess, jumping from Zamba 1 to Zamba 2. So the Mamba 2 paper comes out in May. The both of the original authors published that that paper, Albert Gu and TriDAO. The main result being the SSD algorithm is adding a restriction, which is adding structure to the a matrix. That's the part of the algorithm that influences how the state changes over time. And so now we can do matrix multiplication and take advantage of modern GPUs, get those tensor cores going. So now we can train much faster and, as a consequence, much larger state sizes. When that paper came out, was this immediately on the minds of the team at Zyphra, the kind of impact it could have? Was Zamba 2 immediately planned? And what were the features of the Mamba2, the SSD, that gave your team confidence it was worth coming up with the the follow-up model?
Quentin Anthony: (1:30:47) Yeah. We after Mamba1, like, evaluated a lot of architectures at the time. And since Mamba 1 gained our confidence initially, like, Mamba 2 was something that we initially immediately wanted to test out, we ran a lot of ablations of, like, pure Mamba. We found out pure Mamba 2, that is, we found, okay, has the same problems with, like, in context learning, long sequence dependencies. Maybe we can't just make a pure Mamba 2 model still. But we also were able to verify that we get significantly higher throughput on h 1 hundreds with our training hardware for no model quality degradation content compared to our initial Mamba 1 tests. We also found something a little different from Zamba. So Zamba 1.5 or it was either 1.5 or the initial 1. I don't remember which, but they found the Mamba 1 was better in a hybrid case. So they did, like, Mamba 1 plus attention. We didn't find that, so we found they really performed the same. So we stuck with Mamba 2 plus attention. But, yeah, the Mamba 1 to Mamba 2 change in from Zamba 1 to Zamba 2 is literally just throughput. So we get much faster models, both at training and inference time.
Nathan Labenz: (1:31:50) I'm not sure if this is the best time to interject, but I also my ears perked up around no positional embeddings. And I was wondering how you understand that working. Is that an artifact of the or a sort of consequence of the inherently sequential nature of the Mamba blocks that they create sequential understanding where the transformer doesn't have it natively or the attention mechanism doesn't have it as natively?
Quentin Anthony: (1:32:20) That is our exact intuition. Yeah. So if you we found that with or without them with Mamba wasn't much of a performance difference for the first model. So we're just like, hey, without them. For the second model, for Zamba 2, we actually put rotary positional embeddings in. This is because we found there are some cases where you can make rotary help you a little bit. So we kinda had to figure out how to ask the right questions. If I recall correctly, Rotary helps us a bit with long range dependencies, but it hurts us in terms of context length extension. Rotary is very inflexible with its context length, and so that definitely did bite us. In the future, whether we continue with Rotary, whether we get rid of embedding or positional embeddings again, whether we go with some, like, alibi or learn positional embeddings instead, I'm very uncertain with which 1 we do, but I will say that they're all very slight effects. And this is because you correctly state that Mamba kind of handles a lot of the position. Like, it already encodes the position. So, like, you don't live and die by the positional embeddings that is.
Nathan Labenz: (1:33:24) Yeah. The context extension is another thing that I think is really interesting, and that was 1 of the biggest things that kind of inspired me about the Mamba architecture in the first place. At the simplest level, the state size doesn't grow with sequence. So that means we can, in theory, have an arbitrarily long sequence and have some sort of more, I don't want to say human like, but on the grand scale of possibility space of memory, always 1 step toward more human like in the sense that, obviously, my memory, our memories are not all tokens to all tokens, but rather some sort of lossy compression of everything that came before. And I wonder what would be your outlook for that now? We we've been almost a year since the original Mamba paper. I would say it's definitely shown itself to be a big deal, but I haven't seen like outside of maybe magic, I don't know what they're doing, but I haven't seen like many millions of tokens contributing to a single long lived state. You talk about like on device, constant continued pre training, like learning about me, baking that into the weights. That sort of suggests a disbelief in the idea that we can compress my full history into some long lived state. So what would be your outlook for training not training necessarily, but running my whole Gmail history through a Zamba3, let's say, you know, that do we have a a line of sight to something that could handle a 100,000,000 tokens and have this holistic fuzzy sense of who I am, what I care about, what I how I respond to things? Or does that seem just still, like, too far to envision a path to?
Quentin Anthony: (1:35:12) Yes. In terms of engineering, there's some work to be done. For example, we have ring attention and tree attention for attention blocks for these million plus context lengths. We only have that for Mamba, and this is just because the memory of training time is not high enough to support million context Mamba. Right? You need some sort of sequence parallelism for the Mamba blocks that just doesn't exist right yet right now, and we're working on it. We hopefully will be the first. In terms of from the modeling angle, 1 thing I'll say is that I don't really want my models to be brain to model. Like, I forget why I'm walking into a room. Like, I can't reverse a string or anything else. Like, the cross sequence dependencies of attention are more powerful than the way your brain works in a good way. So I still want them to to stick around. In terms of how model or how Mamba behaves as super long context, I don't think anyone knows this. I agree that it's intuitively makes sense, but I I I don't have a specific bet either way whether Mamba will will will will do super well or super badly because there's something also to be said about about losing cross sequence dependencies for Mamba at at these really large contexts that are definitely noticeable right now, and whether attention will make up for those sins at even larger scales, I don't know. We might need more attention, for example, if we want 1000000 contexts. So I want it to be there, but whether well, I think there's a lot of challenges that need to be figured out first.
Jason Meaux: (1:36:37) Yeah. Yeah. I'll just pick up on
Quentin Anthony: (1:36:40) the Okay.
Jason Meaux: (1:36:41) Cool. On 1 detail you mentioned Please. Having tried to get longer sequence lengths trained into the original Mamba, you know, where you reach a point, you go through all the steps that you normally would to do long sequences and to save memory, and then you memory out. And then the next big leap you'd
Quentin Anthony: (1:36:58) have to
Jason Meaux: (1:36:58) do is figure out sequence parallelism. Much less straightforward, it seems like, for the original Mamba. Actually, beat my head against it. I was like, okay. Cool. I'll do a poll request. I think there's for Deep Speed, there's the Ulysses library. I don't know if you're familiar with it. Okay. Cool. Let's get a poll request going. Could not even crack 1% of what it would have taken. With Mamba 2, it seems a little bit more tractable. They even mention it, I think, in in the paper that it's gonna be much more transformer like. I mean, any insights to are we gonna get the same kind of sequence pair parallels that we're used to working with transformers?
Quentin Anthony: (1:37:36) It will look similar, and I don't want to say exactly how because I wanna be the first. I do think this has an interesting implication for hybrids, though. As I was mentioning before, whether attention can make up for it, I see a mixture of depths on the attention block being the best way to move forward, where the model decides how much attention it actually needs at inference time and also at training time. So, like, maybe if you want a 2,000,000 context of all of Nathan's emails for his entire life, then maybe the model wants attention every single block. Can we also make the model decide, like, how many lore is or what like, how aggressive the lores need to be at every block? I feel like dynamic attention is what I'm getting at. It's gonna be a big determining factor on how long context for Mamba really will work. But, yeah, in terms of SP, it's unknown. It's still still being figured out. Mamba 2 definitely made things easier. I will say that.
Jason Meaux: (1:38:27) Excellent. Looking forward to that. Yeah. No worries. I understand you wouldn't wanna share details at this time.
Nathan Labenz: (1:38:32) For anybody who might be a little confused about the whole sequence parallelism thing, can you maybe just give a little bit more intuition for the problem and how this is even conceptually possible to solve because, again, a naive understanding would be like, the Mamba block is sequential. So how are we ever going to get around that? Isn't it generating 1 token at a time? And, like, how do we change the perspective or change the frame on the computation to go from the naive understanding to at least opening up the space of possibility for parallelism in that respect?
Quentin Anthony: (1:39:08) Sure. So, like, sequence parallelism is purely a necessary evil to get around memory constraints. So even though Mamba is generating 1 token at a time, it needs to store somewhere the activations and just the act the input sequence of 1000000 tokens. And that is even if you only require a constant hidden state, yeah, activations plus the actual input sequence are just too large to fit on even an h 100 at training time. Training time is even worse because you have gradients and optimizer states. Those have to be in higher precision, for example, and you can't fit it. And if you don't train on a longer sequence length, then your model doesn't know how to generalize to a longer sequence length, then you have accuracy problems. So at training time, there will definitely be less sequence parallelism needed just because there's less pressure on your device memory. But at training time, those are the reasons why we need it. How it actually looks is the thing that, like, we're still cracking, but that's the reason we require memory. That's all. Yeah.
Nathan Labenz: (1:40:06) Okay. We will stay tuned for Zamba 3
Jason Meaux: (1:40:10) for that.
Quentin Anthony: (1:40:10) Absolutely. Yeah.
Jason Meaux: (1:40:12) Yeah. I I guess just 1 last breakdown, if you could go step by step. The Zamba 2 architecture, it's not just, hey. Let's swap in the SSD algorithm rather than the original SSM, and let's call it a day. You're making these other changes. It makes sense that if you're gonna make the new architecture, hey. Let's update a few other things. If you could talk about exactly what you've updated the the LoRa modules on the MLPs and then what you've done with attention and collectively all of that together. I think, like, the highlight result from the the Zamba 2 release is it's sort of at this Pareto frontier of efficiency and performance in certain respects against the other open models. So Mhmm. With your intuition, you already said it. I think you had 3,000,000,000,000 training tokens rather than just 1. So a lot goes into this performance booth. Mhmm. But, yeah, as well as architecturally, what do you feel is giving the uplift the most significant uplift for for Zamba 2?
Quentin Anthony: (1:41:14) Yeah. Okay. So that's different for model. So let's start with the the 1.2 b, let's say. So the 1.2 b says there is still this global transformer block of attention plus MLP, but if you only have 1 block, there is correlations across depth, but maybe we want a little bit of more representation. So that's why we have LORAs on both the MLP and the attention block, and these do 2 different things. So remember, I meant, like, attention is correlated across depth, but MLPs are not. So we wanted that MLP to do a bit more heavy lifting. We thought there was still still room for it to specialize across depth, which is why we introduced that LoRa. I think that 1 does a bit more heavy lifting than the attention LoRa, which is why we only have the MOP LoRa on the 2.7 b and the 7 b Zamba 2 models. We stick with a single global block instead of the a b a b for the 1.2 b just because we were parameter constrained. We thought those parameters were better used on Mamba blocks. That's a 1.2 b. So that's 1 the 0.2 b is really good if you have 1 specific thing you want the model to do. 1.2 b is enough for summarization on its own, and you want it to just to 4 bit quantize, and you want it to be, like, very low overhead on any Raspberry Pi or Jetson Nano in the world. The 2.7 b has this a b a b. So there's 2 global blocks. Only the LoRa is on the MLP for the reason I mentioned, but it specializes more cross depth, and we wanted to kinda like, for a very small overhead and flops and params, we could get a bit more expressivity. That was also when we put in the rotary the positional encodings. That was just because we got, like, the slight boost, like I was mentioning, along context. Whether it was a mistake or not in terms of context, like, the extension is still to be determined. That was the main change, though, for the 2.7 d, and then the 7 d is the exact same architecture as the 2.7 d. So we just scaled that 1 up. But the main differences from Zamba 1 to Zamba 2, if I were to summarize it, is that we were trying to really investigate the depth relationship with the shared block and try and find out how much rep do you want the model to be able to be flexible across depth, what is actually being flexible, is it the MLP or the attention? And then the other 2 pillars of our training stack. Right? Like, our data, we improved a lot and then scaled it way up. So we doubled the size of our annealing set. We tripled the size of our phase 1. This helps the model, like, be more more broad in what it's good at. So first model was a bit brittle in terms of what like, the trunk I was talking about was not super wide. But on 3,000,000,000,000 tokens, like, we can anneal and fine tune the model to be much more flexible across a lot of different tasks that our customers and partners wanted because just because it's seeing more data. Like, if you you can look at this actually in the weights themselves. If you look at the LAMA 3 weights, you can see they're much more evenly distributed across the representation. You fill it out that weight more. You can and you can see it. And going from 1,000,000,000,000 to 3,000,000,000,000, we were able to get most of the way there as well. The the model's much more flexible. But, yeah, just improving end to end the stack of our model production, but with a focus on depth was Zamba 1 to Zamba 2.
Jason Meaux: (1:44:25) Yeah. And I think you mentioned on the first read of the blog post, I didn't see the state size of the Mamba blocks in Zamba 2. Mhmm.
Quentin Anthony: (1:44:33) 64. So, yeah, all of those details are in our H100 face.
Jason Meaux: (1:44:37) Excellent. Okay. Yeah. I didn't check out the model card yet. So 64, that's interesting. So you could have trained larger state sizes, of course, inference efficiency being governed by state size. Could you talk through experimentally how your team landed at 64? And is it actually a little bit lower than my expectation because the attention blocks are doing so much heavy lifting? There's less pressure for the state size to be larger for the sake of memory.
Quentin Anthony: (1:45:06) This is pretty much right. So if I recall, larger states by larger state sizes is mostly for improving your long context ability for like, you can more accurately, like, store longer range states. And we have attention, so that's it it it's a less clear benefit for us when we try to increase it. We we've tested this, but it's a very clear damage at inference time. So 64, we found, was enough for our Zamba based models. Yeah. Just empirically, and this intuitively makes sense for the reasons I just mentioned.
Jason Meaux: (1:45:37) Yeah. Fascinating. That makes total sense. I guess if we could just go deep on, I've I haven't seen that many papers or architectures where we're applying lower projectors to MLP blocks. It, like, intuitively makes sense that, wow. Okay. A really, certainly, memory efficient way because you're not storing a whole another set of weights to add some specialization into the MLP blocks. It sort of, like, is in alignment with that sort of performance per parameter as, you know, idea. Mhmm. I guess if you could just talk about that idea, what you've seen, like, in the ablations or the experiments you've run. It's a very interesting feature.
Quentin Anthony: (1:46:19) Yeah. It's gonna boil down to what I was mentioning earlier of, like, the MLP specifically definitely are not correlated across depth very strongly. So this even boils down to what we've learned with CNNs where, like, the first few layers learn very general representations, and then the deeper layers learn more specific representations in the CNN world. This is a face. This face is a child. This child is crying. You get more and more abstract and specific as you go deeper. And in the language world, this is all manifested in the MLPs. So it's important that the MLPs learn the same notes, textualize themselves across depth. So that's why attention is learning that anyway. Remind the MLPs that these are all the notes that we have across all tokens of the sequence. But those MLPs are are, like, learning different deep relationships across depth. But all of us to say, that is the intuitive reason why we have MLPs or why the MLP that we were carrying around anyway in this shared block should probably have some expressivity across depth. And LoRa is a good way to get some of that expression, but with lower memory and compute cost, which is really what we want. If I was not bound by memory and compute, then I would just probably have independent MLPs and a shared attention. But this is a good middle ground. And then, empirically, it improved accuracy across depth, so we stuck with it.
Jason Meaux: (1:47:43) And and it is it the kind of thing where you've you also ran the experiment with independent MLPs, or or you just know that's not feasible for the constraints you want for the Zamba model so that you don't even bother running that ablation?
Quentin Anthony: (1:47:56) Yeah. Unfortunately, that's just not on the table. Like, that model is not in the cards for us. So I would hope someone does it. But, yeah, Ideally, I'm sure we would get some pretty big boost from doing that. And if I had more parameter budget,
Jason Meaux: (1:48:10) I would go to full MLPs. Yeah. Yeah. Yeah. Makes sense.
Quentin Anthony: (1:48:14) But the specific ablation of no MLP LoRa to MLP LoRa is worth the jump in params and flops for sure in terms of loss.
Jason Meaux: (1:48:24) Very good. Interesting. I guess a little bit more of a general question is if you read the literature, it seems like almost every time someone tries some type of Mamba attention hybrid, it it's like this unreasonable effectiveness, at least across the metrics that they're measuring. We talked about how loss can be misleading. That's not always a a measure of how well that model will actually perform, but it maybe tells you at least something, and then you can go down the list. You can look at other evaluations. Like, you know, it's not surprising to see that a hybrid Mamba hybrid model would beat Mamba by itself if, like, you think about the limitations of a fixed date size, the way information would propagate through that network. But it's always surprising me when I see all other factors held equal that it's beating something that has that's like a full transform or full attention blocks. Do you have any insight around that? How these, like, dynamics are playing together? Almost like maybe it's greater than the sum of its parts.
Quentin Anthony: (1:49:28) I really think it's powerful when you're doing hybrids to boil down, to almost apply, like, qualitative like, track qualitative differences in how well your model is encoding cross sequence dependencies and projecting them to mix heads and to find deeper relationships from those notes of the cross sequence dependencies. Like, Mamba does both. MLPs just does taking the notes and finding deeper relationships, like, they're projecting onto some abstract feature space, and then attention is just finding the then finding the dependencies themselves and not doing much else with them. At least, in terms of how we approach things at Zyphra, it's really just boiling down to this critical intuition of things. And, for example, that's how we can explain why Black Mamba had struggled with the long sequence, because Mamba notes on themselves about sequence dependencies are not rich enough, so you need some attention. The unreasonable effectiveness comes from reaching, like, the critical point of where everything is balanced, where you are not spending all of your flops in memory and everything on taking notes and then doing nothing with them, like the tiny MLPs or something and big attentions. It's not with, like, a weaker cross sequence relationship from Mamba and then big MLP projections where you're, like, trying to find really deep things about very minimal notes. I really just if you find the balance of these 2, that's when you get really strong model architectures. And, eventually, I think we'll probably converge to, like, a better way to do both, where maybe we boil down attention to, like, maybe you can have a retrieval head on a Mamba block, whatever that may look like, to help you patch the problem with long sequence dependencies. Maybe there's some unified block that finds the exact mixture of projections and, like, note taking across cross sequence dependencies, and you just scale it up and you're done. That seems to be the case, right, based on this intuition that you just need a balance. But, yeah, I would still tie it back to you need balance between these 2 effects, and these 2 effects are what you need for language modeling. It differs per modality.
Jason Meaux: (1:51:28) Yeah. Very interesting. Do do you get the sense Nathan alluded to this. We we see several iterations of of Mamba hybrids out there. But at the same time
Quentin Anthony: (1:51:39) Mhmm.
Jason Meaux: (1:51:41) It seems like people are like, there's some lock in maybe of even some of the private labs. Are there lock in effects once you commit to a certain type of architecture where even though something begins to be promising, you almost ignore it purposefully to some degree just because you've invested so many resources in a particular path? And do those lock in effects make it more unlikely that a lab that's been committed to a completely different tech tree for years is gonna just jump ship and start? Are those lock in effects there?
Quentin Anthony: (1:52:17) The lock in effects are stronger here than in most places. 1 thing is that as your lab continues, you have more and more ablations that you can look back on. And when you jump ship to a new architecture, all of those ablations, you have to retake a bunch of stuff. Right? Another thing is that we were talking about evaluations. It's really hard to evaluate specific tasks on very few tokens. Meaning that, for example, if you have 2 architectures, how are you gonna test how well they do on MMOU, on in context learning? You have to train for hundreds of billions of tokens. Remember? And I can't just take a pre trained checkpoint and then test both architectures because you have to train these architectures from scratch. So to predict how well a model is gonna do on those, like, emergent evaluations is a real leap of faith on new architectures that a lot of people in labs just don't wanna do. Like I mentioned, LamaCPP and, like, the oLama, like, the whole hacker space and all of the every serving framework is built around transformers. It's slowly changing to support hybrids and SSMs and such, but do I really wanna relearn fine tuning? Do I really want to relearn efficient CPU inference or, like, what efficient set of fused kernels, like, will work at inference time? Do I really wanna get it in VLM and TensorRT? Like, we're going through that pain, but I can definitely see why a big company with even more technical debt and more ablations and interpretability is another 1. Like, big labs have really finely honed and interpreted, like, especially at somewhere like Anthropic or OpenAI. They can interpret really well what these models are doing under the hood, way beyond my intuitions that I'm that I have right now. If you jump into a new model architecture, you start from scratch with all of that in some cases. I'm sure there's some an attention block might act similarly when it's in a hybrid versus when it's in a dense transformer, but there's still, like, a lot of resistance there to the new idea. But, yeah, totally in agreement with your point. These are all the examples of, like, why I would probably myself not wanna jump into hybrids right away if I was big tech or if I had a big existing presence in dense transformers. But, yeah, there there's a reason that they're moving slowly.
Jason Meaux: (1:54:27) Yeah. That makes sense. In that way, do you see so the lock in effects are a way in which you commit, but then, obviously, the longer you commit, you're beginning to build out not just code, but best practice where if you could imagine the possibility of Zyphra developing this architecture over the next 5 plus years or longer, what kinds of so I guess the positive side of lock in effects is that you get really good at managing this architecture that maybe not that many others are managing. Do you any insights as to what that could look like for for you and Zyphra and the hybrid world?
Quentin Anthony: (1:55:06) There's some first mover advantage here where we can kind of steer the story. Like, we have, like, internal inferencing frameworks that we've already ported that if you want a Zamba model, we can run it for you and with you the fastest compared to everyone else. If you want, like, community engagement, this is a negative. I think we're nimble enough to avoid the pitfalls of being locked in, so we have some very non Zamba architectures that are cooking right now. As long as they are efficient on device and punch way above their weight, we're gonna look at whatever that architecture is, and we're gonna try and find creative ways to make it train well. In terms of avoiding locking with that, it's really just comes down to effort of you need to figure out fine tuning again. You need to figure out sequence parallelism for every new architecture you make and this kind of thing, and putting the effort in is our way forward. Like, we don't have any super crazy intuition behind. You just have to grind it out.
Jason Meaux: (1:56:04) Nice. Yeah. That makes sense. I guess related to that is and this might be part speculation, so I apologize for that. But it seems like you do have some real world data given that Zyphra is developing not just 1 flagship Mamba hybrid model, but, like, various parameter counts, various sizes. Any sense in which you're getting sense of, like, the scaling properties of this kind of model architecture, how those scaling properties are you know, chain you you you have Chinchilla scaling laws. Like, we know what it looks like for just base transformers. Do you do you have any sense of what the scaling laws could look like for the Mamba hybrid models? Are are they even that much different than what we've seen in transformers?
Quentin Anthony: (1:56:48) I think they're pretty similar. We just have a slightly better slope for a lot of these. I would say I am curious to see how shared blocks affects chinchilla because we reach Chinchilla optimality way faster on the shared blocks. So if we were if we train for 15,000,000,000,000 tokens, what point does the attention block saturate? It's kind of a similar thing to, like no 1 I don't think it's really cracked scaling laws for MOEs. A similar you can think of MOEs as similar case where some blocks are getting trained more than others. I really don't see any reason why a Zamba architecture wouldn't scale well to 20 b, 40 b, 70 b. And I think we're we, at some point, will try this ourselves to get distillation in house. But those are my only hot takes. I think they'll look pretty similar, just a little bit better.
Jason Meaux: (1:57:41) That's awesome. This is maybe a big picture question. What does it take to do foundational ML research? And in particular, what is the kind of approach you and your team have? Because Zyphra seems fairly interesting. It's a company, but it also seems to be a lab doing what I would consider pretty clearly research. So it's not necessarily product focused solely, but there's some in commercial goals. Like, what is the process? Is is it a lot of open ended search in the team, or there's always specific objectives with the commercial interests in mind?
Quentin Anthony: (1:58:17) We have research guided by specific objectives. So those objectives are efficient inference, is 1. So you get a big first mover's advantage if you come up with a novel architecture that actually runs efficiently on device because now everyone else either has to use your model or figure out all of the arcane knobs that you had to turn to get your model to work. We're at sort of an applied research lab. We're not, like, making Mamba 3. Right? We see all of the available blocks and training processes and lores. Like, every element of this Zamba model is built by someone else. But in order to recognize what all of those knobs do and then to turn them in a way that you get the same modeling performance as a dense transformer, but with a much more amenable on device performance and even cloud performance is the research that we're doing. So how do you take everyone else's work, and how do you put it together in a way that's perfect at inference time? Yeah. And then this ties into the product story of if you have that inference efficient and training efficient model, it's way easier. It is now achievable to make it personalized to everybody. It's achievable to put it on a phone. It's achievable to personalize it to enterprise applications and these sort of things. It's definitely guided by product.
Jason Meaux: (1:59:35) Yeah. That makes sense. And this is just a side question. The you know, when I think about the the properties that Mamba has, but in particular, these hybrid models and I've had increasing interest in robotics. I feel like that's could be the next big market. It's not really a market right now. Yeah. And the what you need to run models locally, that just ties in almost directly with what your lab is focusing on. Any sense obviously, robotics usually involves completely different stack of data, so it's multimodal. You have things like the ACT transformer. It'd be interesting to know what that would look like for Mamba. If if you were to apply your work, which is primarily focused on NLP, how big of a jump is it to now be looking at, like, the robotics technical use cases, or or is that a bigger jump than what I'm thinking in my head? Do y'all ever think about things like that?
Quentin Anthony: (2:00:30) Oh, definitely. So robotics, gaming, there's all kinds of markets where it's really nice to have flexible models on device. For robotics specifically, to make if it's just, like, chat with humans and maybe ability to do visual question answering about, like, how to recognize and respond what the robot is seeing and speaking, that's pretty easy. That's pretty achievable, and I think we probably have the best offering for that right now. Once we get a voice to voice done, then, yeah, a robot walking around that can talk to you and is totally on device is is, like, nothing for us. What else would be required? It would all be very low power as well, so you could get it running on a pot Raspberry Pi or a Jetson. I think there's a lot of cases with, like, smart cities with you could put this on a Tesla, and you could put it a self driving car and just have a chat interface. You run the windshields, speed them up, run the windshields at the speed that I like. All these sort of things become possible when you have a model architecture like Zamba that's multimodal and voice to voice, and that's why we're pushing for it so hard.
Jason Meaux: (2:01:36) Yeah. That's awesome. I guess just 1 more practical question is, oftentimes, really, if you look at the adoption of models, it often comes down to just, like, how easy are they to use. And that often Yeah. Involves, like, how compatible are they with the transformers library from Hugging Face. But I I I guess if you could just talk with talk about that. If I wanted to do some really interesting experiments with Zamba2 right now or I wanted to whether that was fine tuned, whether that was, like, trying to some extended pre training stuff, or whether you know, how easy it is to use. What do I actually do? What libraries do I need to get into to use Zamba?
Quentin Anthony: (2:02:22) So right now, you're restricted to Hugging Face transformers. So we have Zamba 1 has been upstreamed now. It's in transforming. You can use it with all Accelerate and PATH and all of the Hugging Face, like, class of frameworks. For Zamba 2, we have our own fork of transformers. It's currently getting upstreams, but there's always just a lot of compatibility work with this kind of stuff. You can get, like, basic functionality right away, which is why we have our own fork, but getting it really solidly ingrained in the entire Hugging Face ecosystem is the hard leg of the journey that we're still working on. And we also have a pure PyTorch implementation for both Zamba 1 and Zamba 2. So people have their own custom pipelines. They wanna pour it to their own internal inference framework. We tend to find people like that a lot. It's also more performant than putting it in, like, the Hugging Face ecosystem to just have pure PyTorch. You can compile it with Torch Compile. You can create custom Threatened kernels very easily. Going forward, I think we're about to finally, like, become more ingrained because we're about ready to crack GTML and LlamaCPP and OLAMA and that entire ecosystem. We've pretty much finished porting Zamba2s. I don't know. I struggle to put a date on it. I wanna miss a date, but, like, in the next couple weeks, I think people will be actually will be able to use this seamlessly with BLM, oLama, lama c p p, g g mail, and the entire ecosystem. I think we're right about to jump on too.
Jason Meaux: (2:03:44) Very interest so the choice to fork transformer is simply we'd like people to be able to use it immediately, and there's still some kinks to work out
Quentin Anthony: (2:03:53) Yeah.
Jason Meaux: (2:03:53) To get it integrated. That's essentially it. Right?
Quentin Anthony: (2:03:57) Yeah. And yeah. We understand that, like, people will bulk balk when they see, oh, it's a fork? Done. I'm not jumping into it. And even if like, we produce the best model at the weight class, so there's definitely incentive. For those who want to use it, you gotta use the fork. But for all of those who are more focused on what is the fastest way to get this running on my laptop or something, you will very soon be able to actually do that very soon. But currently, Plugged Face Fork or Pure PyTorch are your options.
Jason Meaux: (2:04:27) And you said, like, what would be the most performant both for inference and for let's say, wanted to, like, do fine tuning or training. If I use 1 of those training frameworks, is it, like, fully optimized, or am I are there some trade offs with having to make sacrifices with getting compatibility? Or should I go straight to PyTorch? Or or could I use those frameworks? It's not gonna come with costs.
Quentin Anthony: (2:04:50) Yeah. So for small scale like, by small scale, I mean, like, fine tunes. So if you're at the pretraining scale, you've gotta have your own thing. You've gotta have your own GPT-NeoX or Megatron DeepSpeed or Megatron LM. You've gotta have a pretraining framework, and those are not super Mamba friendly right now. We have our stuff is in house, unfortunately, right now. It's just the name of the game. People watch your PRs otherwise. In terms of fine tuning, at the fine tuning scale, Hugging Face is totally fine. We even use Hugging Face internally to do some of our fine tunes, some of our context length extensions. When you're working on sub you know, a couple billion tokens to 10,000,000,000 and below, 50,000,000,000 and below, Hugging Face is totally fine, and we recommend people use that. So yeah. If I had, like, a mid range kind of thing, maybe I would try to wrap that to your PyTorch implementation of Zamba in some, like, simple data loader and just try and get that going if I have maybe 50,000,000,000 or something. 50 to a 100,000,000,000 in the range. But I think most people fall into the fine tuning range where HUD phase suffices.
Jason Meaux: (2:05:54) Very good. Makes sense. Excellent. And if people today, like, right now wanted to start experimenting fine tuning, I I believe there's a GitHub linked to the blog. There I can access the fork of trans of the transformers library. If someone is motivated, they could spin something up today. Is that correct?
Quentin Anthony: (2:06:15) Oh, yeah. Definitely. Yeah. It it's not hard to fork and to just build our our fork and load our checkpoint. Yeah. And then if you just wanna talk to the model, then we had a day 0 release with NVIDIA as a NIM, as a inference microservice. We self hosted it, like, the Maya endpoint on our website. So if you just wanna talk to the model, use 1 of those. If you wanna spin it up, you can do it today with our fork.
Jason Meaux: (2:06:37) Excellent. So I guess we've been talking a lot about Mamba and the Zamba models, But Zyphra also published a very interesting paper recently, the tree attention paper, and we barely talked about it. I've not been able to dedicate myself to a complete deep dive, but my high level view is it's very interesting in the sense that it seems to be, in some aspects, improving upon ring attention, which was the huge thing earlier this year where it seemed like everyone was training million plus context length LAMA models. If you could just talk about tree attention, how is it different from ring attention, and then what impact can it have? What is it really pushing on?
Quentin Anthony: (2:07:20) Sure. So ring attention is good when you have a network topology that's a bit more flat. So, like, a mesh topology, like, TPUs and stuff, I think, is what it was designed more for. And that's because you have a ring with point to point across each GPU in the ring. If you're in GPU land, you have 2 level topologies, where on a single node, you have NVLink that connects all of your GPUs with really high bandwidth. But across nodes, you have a much slower InfiniBand or maybe you have a Rocky, like, a converged Ethernet thing. That's an order of magnitude lower bandwidth than the intra node interconnect. So there's a reason why they only did up to 2 nodes in the ring original ring attention paper, and it's because you're bottlenecked by this cross node link. So the boundary GPUs at the cross node are gonna be dictating how quick your inference speed is, and it's not great. As you scale up to more and more nodes, this becomes worse and worse. If you have smaller PV states, then you're gonna be the compute is not enough to overlap that expensive communication, so it's not scalable. The whole point of tree attention is to recognize this fact that 2 level topologies are not good for rings, and to reformulate the distributed attention operation as an energy function, which allows you to use all reduce operations, communication operations instead of point to point. All reduce operations in existing, like, communication libraries, NVIDIA's communication library called Nickel, NVIDIA collective communication library. AMD has something like RICL, the Lacom communication library. All reduce Allreduce's themselves are topology aware for GPU clusters. So they account for this 2 level topology, and they also add some of the computation is, like, in network. So with an Allreduce operation, you can do some of the sum in the network card, and you can overlap that sum computation, that that kernel, be it on the GPU or on the card if you're doing sharp, with the communication of those KB states. So, basically, it can scale. In short, it can scale on 2 level topologies because you're reformulating to an all reduce, which is inherently topology aware, and as some of the compute, it's easier to overlap versus point to point operations, which are restricted and sort of assume a flat topology of network.
Jason Meaux: (2:09:34) So I guess if you could help me, maybe let's begin with the end in mind, like, what will that enable me to do that ring attention would struggle with? What in what cases will tree attention help me?
Quentin Anthony: (2:09:47) So this all boils down to parallelism. When whenever you have parallelism, it boils down to memory. GPUs have a limited memory. If you have long context so, say, if I wanna do 1000000 context on a Llama 8 b, or if I need a lot of parameters that I need to store in memory at inference time, then I need more GPUs. For both of those cases I mentioned, super long context of super large models, you need lots of GPUs to fit more GPUs than are on a node. So you're going to 2 nodes, 4 nodes, etcetera. When you're going across nodes, you have a 2 level 2 level topology, and all reduces work better. So if you wanted a 16 node, like, context extension training run with, like, 2,000,000 contexts, tree attention will work much better. If you have an inference run of Lama 4 0 5 b on older GPUs, you need 4 nodes or 2 nodes, tree attention will scale much better on that. Anytime you're scaling across nodes for GPU clusters or any imbalance of communication costs, tree attention will win.
Jason Meaux: (2:10:51) Interesting. Okay. So this is not necessarily something where, we had ring attention and all of a sudden everyone's just gonna implement tree attention. It's on a case by case basis of which 1 you might use if you were to try to do multi node training and, you know, distribute the memory.
Quentin Anthony: (2:11:08) Yeah. So there are cases where, like, if you perfectly have enough compute and your communication costs, like, those 2 perfectly overlap 1 another and it's effectively free and you're on a single node, ring attention's pretty good. Even though all reduces are really closely optimized because they're used in training so much, you can't really do much better than that in free communication. It's totally overlap. Yeah. Not everyone will benefit right away, but those cases where you have a smaller model, longer context, which translates, like, smaller KB states, that means you're gonna be communication bottlenecked. There's not much compute to hide the communication. So those people will benefit from us, and then the people with really large models that are inherently across nodes will benefit from us. Yeah. But, yeah, you're right. It's not every single case.
Jason Meaux: (2:11:54) Mhmm. Trying to for, like, a very, very long training on very, very long sequences, it would help for that if that was a only if it was a multi node training setup Oh. Or it doesn't come into play.
Quentin Anthony: (2:12:05) It would help more. It would help more. Even on a single node, all reduces are a little bit more effective, than point to point just because tuning library like, collective libraries today are really closely tuned for that specific operation. Like, we're kind of exploiting the fact that all reduce is really good than point to point, because you don't really use point to point much in training unless you're doing, like, pipeline parallelism, which is less common than tensor parallelism or data parallelism, which are both all reduced. So, yeah, again, it'll help you a lot more at scale. It might help you a little bit at smaller scales, but I wouldn't promise big speed ups.
Jason Meaux: (2:12:42) Okay. No. That that's very interesting. I guess just because we hit that topic, I have I thought of another question. The Zamba 1, I believe Sure. Was natively trained with, I think, if I'm reading this correctly, 4,096, token context length. Did Zamba 2 train at the same native token context length, or is that higher?
Quentin Anthony: (2:13:03) Yeah. Yeah. And then we were able to extend to the 16 k range for free just by interpolating rotary positional embeddings. Getting beyond that is an open question that we're still working on. So I think we have a path now for 64 k and a 100 k, which should be coming out soon, especially for the smaller models. Because remember, I mentioned it's a memory overhead thing at the end of the day. Right? So, like, Zamba 2.7 b, we can go to way longer context than this. But just the engineering challenge is to get us to 1000000 context. Still working on that. But, yeah, trained on 4 k.
Jason Meaux: (2:13:37) Yes. Yes. So I guess just because it's something I've experimented with Mamba 1, like, really, basically, the idea of, okay. We're natively trained at even 2,000 token context link,
Quentin Anthony: (2:13:47) but I want
Jason Meaux: (2:13:48) something capable of much more. So I'm just gonna do continuous pretraining, and now all of a sudden, my data distribution
Quentin Anthony: (2:13:53) Mhmm.
Jason Meaux: (2:13:54) If you wanna do it smart, maybe there's a curriculum learning approach where you don't just throw that much more context length at the training. At once, you wanna do it smart, but it it seems there are all these papers, and my own experimentation is it works pretty well to do continuous pretraining to expand context length. To what extent is that a solution? Mhmm. If somebody wanted to grab, let's say, the Zamba model or we even see this with some of the just the transformer lamba models. Do you have any thoughts on continuous pre training to extend context length? Is that sort of like a hack that you're just not gonna get the quality that a natively trained model would have?
Quentin Anthony: (2:14:30) No. You can get the quality. That's what we're doing ourselves. So yeah, definitely continue to pre train. You definitely need to do a curriculum of where you're warming up the sequence length and batch size across your continual pre training run. But eventually, you're gonna hit a memory wall, and then that's really that's all I'm trying to say here is that you gotta finish up sequence parallelism for the Mamba blocks as well if you wanna get past that memory wall, but it totally works.
Jason Meaux: (2:14:55) Yeah. Absolutely. So playing around with the 2,800,000,000 parameter model, I could never figure out sequence parallelism. I I use the deep speed library because it just makes it easy to do all the 0 3 stuff. Of course, you have to pay tremendous cost with training efficiency when you start offloading everything to CPU memory or yeah. So Mhmm. Sequence parallelism is the big unlock. That's gonna be exciting. Okay. Cool. Does Zyphra have plans to open source that once that gets cracked?
Quentin Anthony: (2:15:23) Sequence parallelism for Mamba will definitely be open sourced. The the Zyphra specific training stack that we're using to produce Mamba will probably not be open sourced. But the general process and, like, what we did with Tree Attention, the kernels that we use, the scripts that we use, those will all be open sourced. Because we want the community to, like, move towards Mamba and see that there's something there.
Jason Meaux: (2:15:43) Yeah. Awesome. Great to hear. Anything else that we didn't cover about Zamba? The timing is in around the Zamba 2 release. Anything else you think we should cover before we wrap?
Quentin Anthony: (2:15:53) No. I think we've covered pretty much everything.
Jason Meaux: (2:15:55) Appreciate your time, Quentin. It's absolutely great speaking with you.
Nathan Labenz: (2:16:00) Likewise. So let's do this zoomed out 1. I think this has been fantastic. I think that this is for the real ones out there who are, you know, very interested in learning more about what it actually takes to make these models work and just how many little nitty gritty details go into that. So I've learned from this. I think a lot of people will appreciate all of the lessons that you've shared from many long days and probably some long nights working on this stuff. I think it's been great. Let's do the zoomed out thing. With all that in mind Mhmm. Where is Zyphra going as a company? What role do you guys wanna play in the lives of users? You you talked a little bit about how you see yourselves competitively against the big guys, but I'd love to just gear on whatever time scale, you know, what you think my AI assisted life is gonna look like and and what role Zyphra and Zyphra models will play in that.
Quentin Anthony: (2:16:52) Yeah. Definitely. We've focused really hard on producing the best models for 1 modality, and now we're ready to expand out. So this includes other modalities. So we want, like, visual question answering. I want to be able to live edit my pictures. I want to be able to talk voice to voice. Like, I talk to you voice to voice. I wanna talk to my AI voice to voice as well. This includes, like, actually deploying personalizability to people. This includes, like, actually launching, like, a Maya, both enterprise and on device for consumers. This includes the broader ecosystem. For example, oLama and lamaCPP and stuff. Those are all transformer based, and getting the hybrid architectures into those is 1 of our next priorities so that everyone can actually deploy the model really quickly instead of using our fork of pugging face and all this sort of thing and having a much more like, a better user experience for hackers and for consumers. And then there's some also just some higher level things. We think, like, memory, so retrieval is really important for personalizability that we're going to like, how does retrieval interplay with long context reinforcement learning so that, like, we can really tie down it's not just continual pre training on your phone. Like, we definitely need some sort of approximation at least of RLHF or RAIF for users so we can really extract as much as possible from the little data that we're going to get per user. There's a lot of ways to expand, but those are some of the main ones.
Nathan Labenz: (2:18:14) Cool. I love it. For now, I'll I'll say Jason Meaux and Quentin Anthony from Zyphra, thank you both for being part of the Cognitive Revolution.
Quentin Anthony: (2:18:23) Thanks, Nathan.
Jason Meaux: (2:18:24) Thanks.
Nathan Labenz: (2:18:25) It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr@turpentine.co, or you can DM me on the social media platform of your choice.