Meta's MEGABYTE Revolution with Lili Yu of Meta AI

Watch Episode Here

Video Description

Nathan Labenz sits down with Lili Yu, a researcher of Meta AI to discuss the paper she authored: MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers. In this conversation, they discuss the architecture and breakthroughs of their research, and the opportunity to eliminate the need for tokenization.

LINK:
MEGABYTE Paper: https://arxiv.org/pdf/2305.07185.pdf

TIMESTAMPS:
(00:00) Episode preview
(07:41) Takeaways from Lili Yu's paper: MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
(17:00) Architecture
(24:59) Embeddings
(27:43) Different local models
(34:23) Encoder model
(36:35) Transformer Architecture
(48:10) Choosing patch size
(01:08) What happens when you scale up?
(01:19:20) Big picture for Meta AI
(01:22:57) Responsible AI
(01:27:02) China and AI

TWITTER:
@labenz (Nathan)
@liliyu_lili (Lili)
@eriktorenberg (Erik)
@cogrev_podcast

SPONSOR:
Thank you Omneky (www.omneky.com) for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

MUSIC CREDIT:
MusicLM

Full Transcript

Transcript

Lili Yu: 0:00 To model a 600 by 600 image, you have to have the 1,000,000 tokens, and current architecture just cannot support it. That naturally introduces a different problem. We need a new architecture to solve this. And that's why we have this very efficient way of modeling, involve multiscale transformer. We are very excited about be able to directly in the future model the raw format of any file, any input, any modality. I think that's a really exciting direction for us.

Nathan Labenz: 0:33 Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Eric Tornburg. Hello, and welcome back to the Cognitive Revolution. Today, my guest is Lili Yu, research scientist at Meta and author of the recent hit paper MegaByte, predicting million byte sequences with multiscale transformers. I first encountered this paper via a viral tweet from OpenAI's Andrei Karpathy, who called it a promising way to potentially move language models beyond tokenization. If you're not familiar with the term tokenization in the context of language models, it's the process of breaking natural language down into a fixed vocabulary of frequently occurring strings. In the case of GPT-3 and GPT-4, this vocabulary consists of more than 50,000 words, word parts, numbers, and symbols. All language model inputs are chopped up into these tokens prior to embedding into numbers, and all next token predictions are selected from this set of tokens as well. In other modalities, there are other conceptions of tokens. Image tokens, for example, might represent small squares within the larger image. Tokenization is used because existing AI architectures struggle to work with really long sequences of data. With the best performing methods available today, the compute costs ultimately get out of control, and there is a GPU shortage, as you may know. So since sequence length is limited, some higher level, more semantic compression of the data is necessary. And tokenization is the first pass still sort of hacky way that that's currently done. However, it does cause a bunch of problems, which Karpathy explains a bit in the tweet and also links to deeper reading on if you're interested. Suffice it to say for our purposes that it gets super weird. So weird, in fact, that there is a redditor whose username has been immortalized as part of GPT-3 vocabulary, and there are other super rare tokens that famous models seemingly can't be made to say at all. Karpathy summed it up by saying, the list goes on. TLDR, everyone should hope that tokenization could be thrown away. Maybe even more importantly, we may find general purpose strategies for multiscale training in the process. And that's what we're talking to Lili about today, because in her strategy to eliminate the need for tokenization, she just might have made a more fundamental contribution. I'm not an algorithm expert. As you'll hear, I ask some pretty simple questions. But recently, I have come to believe that given the presence of web scale data and web scale compute, it was really only a matter of time until somebody figured out a workable algorithm. Transformers are just one architecture, as the human brain is just one architecture, and neither is the end of history. The basic idea of this research is that the megabyte architecture operates at multiple scales. Unlike in a single transformer where all the layers tend to be the same size, in the megabyte architecture, information is first encoded in patches, then there is a global model that shuffles all the information around, and then the final byte level predictions are again made by separate local patch models, which can be run in parallel. It seems this architecture just might have it all. For starters, yes, the byte level prediction eliminates tokenization. Now the model looks at everything as raw bytes, and it's always predicting just the next byte. There are only 256 possible bytes, just 2 to the 8. So each one is ultimately just a series of 8 zeros and ones. It's kind of crazy low level to me, at least, if you think about it that way. But because everything is bytes, music, video, text, all of these are bytes on some level, if this does work super well, it will naturally extend across modalities. It also has more attractive scaling laws, subquadratic, as they say, and, again, it's more parallelizable too. These advantages allow it to work up to 1,000,000 byte length sequences, hence the title megabyte. But what is a megabyte? In text, it's 1000000 characters, which does exceed, for practical purposes, Claude 100k's 100k token length. In music, it's about 1 minute of music sound. This architecture, because it's constituted of transformers, will continue to benefit from improvements to transformers more generally. So obviously, the key question is how it performs. In their experiments, Lili Yu and her teammates show that it does appear to be competitive with the standard transformer methods. So the next step at this point is for Meta AI to take this architecture up to LAMA scale and see what happens there. It sounds, as you'll hear from Lili's comments, that they are very optimistic. But this is such an experimental science that we can't and won't know for sure until they try. Assuming it is successful, we could see mass adoption remarkably quickly. They have open sourced their methods after all. But we still might find performance quirks and behavioral surprises for quite a while to come. And even if they're not successful, I think we should continue to watch this sort of research space closely as the paradigm could definitely still evolve in unpredictable ways. Now I hope you enjoy this conversation about some cutting edge architectural research with Lili Yu from Meta AI.

Nathan Labenz: 6:28 Lili Yu, welcome to the Cognitive Revolution.

Lili Yu: 6:31 Yeah. It's a pleasure.

Nathan Labenz: 6:33 Yeah. I'm really excited about this conversation. You've recently published this paper along with colleagues at Meta called Predicting Million Byte Sequences with Multiscale Transformers. And, as soon as I came across it on Twitter and saw the headline figure from the paper, I said this is one that I definitely want to dig a little bit deeper into. So this has been a theme for a couple recent episodes. Definitely want to encourage listeners to take a second and go look at the figure. Picture's worth 1000 words. Kind of get that visual in your head of the shape of the architecture that we're gonna be digging into. It should only take a minute to study it for a second before going on to listen to the rest of the conversation. But I think that will be extremely helpful.

Lili Yu: 7:27 We also have all the math and formulas in figure 2. For some people who are really curious and want to know exactly the details.

Nathan Labenz: 7:35 Well, that's what I hope to understand better at the end of this hour than I do coming in. So maybe just for starters, let me kind of bounce off what I took away from the paper, and you can tell me if I'm missing anything or how you would frame it differently. The big thing that I saw, and I try to avoid analogies wherever possible to describe AI systems because I think they so often can confuse and mislead. But when I look at this architecture, it does kind of look like a fork where you've got kind of the main body, the global model, and then you've got these kind of smaller local models that branch off from that. And so it has this kind of general fork shape. And what I was thinking is, okay. This seems like a sort of different take on a somewhat common theme lately, which is models talking to models in high dimensional space, except you've created a hierarchy that can be trained end to end. So now we have kind of multiple transformers in a single architecture, all operating under one loss function and one optimization function. How'd I do for starters?

Lili Yu: 8:46 It's a good start, but first, the idea of separating the agents surprised me. Second, the interaction is a little bit different when we think about different agents communicating with each other, compared with end to end training of autoregressive language model. So when we say different agents, there's different type of action coming from different agents. Here, in megabytes, we try to do prediction, just everything is bytes, and everything has causal relationship. The local model one follow each other predicting the bytes, the sequential bytes even though they are prediction parallel, but it's actually sequential, uniform bytes. Right. But architecture wise, yes, we have this kind of different patch. We use a simple concat operation, which is the patch embedder, patch embed in the paper. The different patch group different bytes together, and they go to the global model. And then they separate out, as you said, fork out. I think that's the correct understanding.

Nathan Labenz: 10:11 So it seems like there's several big advantages to this, and I'd love to hear you kind of describe each one and maybe talk about which ones motivated you and which ones you think are ultimately most powerful. But the big advantages seem to be, one, that there is the ability just to scale to far larger sequences than a typical transformer can. So you're literally predicting up to 1000000 bytes. In this case, the byte is sort of the unit of prediction. So it takes the place of the token that people are familiar with if they're API users of OpenAI or Anthropic or what have you. But going up, anybody who's used these sorts of products would recognize that, hey. We have been kind of living at the 8,000 token level for a while. Now that's starting to expand. We've got Claude 100k, but now you're taking this up to 1000000 bytes. So that's a big deal. There's also the performance or, let's say, the compute efficiency advantage because certain things can be run in parallel. They're both smaller. So I guess it's compute efficiency and kind of ready parallelization. And then third, related to the fact that you're predicting bytes, you don't have to worry about tokens. So you can kind of handle everything as largely sort of raw data. Tell us more about all of those, and maybe start with the one that you think is most important.

Lili Yu: 11:42 Yeah. So I think for this work, it actually come hand in hand everything, basically. So what we really want to solve is get rid of tokenizer. So we want the tokenizer free language model. However, there must be a reason why people do the tokenizer. One thing is to effectively compress information so it's easier to do the compute. It's cheaper. So that means, but however, the tokenizer indeed introduced lots of problems. So the common problem people may experience right now is on the text space, you have this BPE tokenizer, the biggest headache is the space, and sometimes you also see people do certain prompt engineering by prefix some random combination of the letter, and then it's gonna tokenize to certain things, and the model gonna continue generation such nonsense. That's all drawback of tokenizer. Of course, another difficulty of tokenizer is say, oh, one day you want to use ChatGPT is very powerful. Let's say you want to use weaker large language model, and you want to use in bio, you want to use in chemistry, you want to use in some foreign codes, and then the tokenizer is out of distribution for your new domain, and then you're gonna get problem with fine tuning or something. So that's what people experience with the text. However, it's also a big problem for multimodal. See, right now we are focused on text and then everybody have to using ChatGPT, but we also know ChatGPT-4 is already multimodal. And in the future, multimodal is gonna be a big thing. However, it's a big issue when you want to model image or you want to model audio. For example, in image, there are 2 main architecture to solve it. One is a diffusion model, which take pixel, but diffusion model is very expensive and it has its own issues. If you want to do autoregressive image, then you experience even longer sequence. For example, in our paper we do 600 pixel by pixel image, that's already 1,000,000 bytes. And then people use VQGAN as image tokenizer. So text tokenizer give you headache, but the image tokenizer can actually give you a nightmare.

Ads: 14:25 Hey. We'll continue our interview in a moment after a word from our sponsors. Omni Key uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work customized across all platforms with a click of a button. I believe in Omni Key so much that I invested in it, and I recommend you use it too. Use CogRev to get a 10% discount.

Lili Yu: 14:46 Because it's lossy. It's actually truly lossy. So text tokenizer the tokenization may not be reasonable, but if you do, say, input string, tokenize, detokenize, you get the original string. But for image, if you do image input, you do tokenization and detokenization, you get a slightly different image. Some details is off, some color is off, some yeah. Finger get mixed up, and then if it's original, there's words, you cannot say it clearly. Same thing. Same thing for audio. It's also a lossy process. So that's why we really want to get rid of tokenizer, no matter is to enable in the future, the true multimodal type of modeling, or to be able to easily mix image, text, and audio, or to easily adapt to new domain. We really want to get a tokenizer. So that's one of the biggest problem we want to solve. So that come to another question. Nothing is for free. Right? As I just said, the reason people want tokenizer because you cannot just handle such a long raw input sequence for image. To model a 600 x 600, 600 by 600 image, you have to have the 1,000,000 tokens, and current architecture just cannot support it. So that naturally introduces a different problem. We need a new architecture to solve this, and that's why we have this very efficient way of modeling, involve multiscale transformer. So, yeah, that's naturally I think it's a one stone, two birds situation here. We remove the tokenizer, and we have this very efficient architecture to support it, and we demonstrate that across different modality, across text, image, and audio, we can both achieve state of art performance. Okay. So that comes to another question. Right? Can you combine the tokenizer and this model architecture to enable very long sequence modeling? So I think that's definitely part of our future work, and we are thinking very carefully. So for the current paper per se, that's not what we're trying to solve.

Nathan Labenz: 17:26 So that's all very helpful. So the big motivation is and I didn't even mention the natural extension to multimodality in my already too long first question. So thank you for bringing that up as well because that's definitely a key part of the result is that you have the same architecture working across different modalities. Let me just try to understand this core figure. So, again, listeners, look at the figure. This is figure 1 from the paper. We'll put a link in the show notes. Very easy to find, and pretty easy to sort of get in general. But there are a couple things I wanna dig into a little bit more. First, the input side, everything is bytes. Right? This is all bytes in and bytes out, and the architecture itself doesn't care what these bytes represent. Right? So I got that. What is a little less clear to me is what are the patches doing on the input side? Because if understood correctly, and maybe I didn't, I understood that the embedding was lossless. So then I was kind of like, well, if it's lossless and it's all just going into this global model that kind of sits at the middle of the whole thing, what is the meaning of those patches, or what is the function of those patches as opposed to just saying, here's all the bytes that are inputs. Just feed them directly into the global model. What's happening in those patch embed modules?

Nathan Labenz: 17:26 So that's all very helpful. The big motivation is, and I didn't even mention the natural extension to multimodality in my already too long first question. So thank you for bringing that up as well, because that's definitely a key part of the result is that you have the same architecture working across different modalities. Let me just try to understand this core figure. Again, listeners, look at the figure. This is figure 1 from the paper. We'll put a link in the show notes. Very easy to find, and pretty easy to sort of get in general. But there are a couple things I want to dig into a little bit more. First, the input side. Everything is bytes, right? This is all bytes in and bytes out, and the architecture itself doesn't care what these bytes represent, right? So I got that. What is a little less clear to me is what are the patches doing on the input side? Because if I understood correctly, and maybe I didn't, I understood that the embedding was lossless. So then I was kind of like, well, if it's lossless and it's all just going into this global model that kind of sits at the middle of the whole thing, what is the meaning of those patches, or what is the function of those patches as opposed to just saying, here's all the bytes that are inputs, just feed them directly into the global model. What's happening in those patch embed modules?

Lili Yu: 19:04 So there are a couple of things here. In figure 1, right, we have this whole string as input, and they are actually input as bytes, so we have 16 bytes. So the sequence length is 16, and we separate them as patch, so it's 4 x 4. That means each of the either the local model or the patch embedder, so we have 4 duplicate of them, and each has input of 4. And the global model, the global model, they only see the patched input. So each patch takes 4 bytes as input, and they're going to give you back 1 vector. So the 4 bytes already merged into 1 vector, and then now the global is going to only handle 4 vectors instead of 16. So that's why the global model was able to say much shorter sequence, or equivalently, if it's shorter sequence and you have the same compute, you can make the model bigger. So that's a big difference. If you just put everything into the global model raw, it's 16 lengths, 16 as a sequence length, and now with the help of the patch embedder as a local model, the global model only needs to see sequence of 4, so it's much shorter and much more efficient. So one thing, another thing which I think is also relevant to your later question is this padding. The padding is actually really, really critical. So in current language model is doing next token prediction. So how they do next token prediction is actually by padding. So if you look at the upper part, the local model, you have this, for example, megabyte. The Mega, and each the Mega, the full capture input. So for the local model, you actually need to pad the beginning, this padding token. So the model, how it works is they see the padding token and they predict the M letter, and then they see the M letter, they predict the E letter. So that's to make sure they don't see future. They don't see future. Everything they predict, they only see things before it. That's the essence of the next token or, here in our paper, next byte prediction. So that's why they need to add this token. So what about the, but however, because our local model can see all the previous patches, so when we, so the input to the global model, you are not only going to patch a single layer, you have to patch a whole patch.

Nathan Labenz: 22:27 Are all the patch embed modules the same? Are they all identical? Even down to the lookups, the sort of conversion of a byte to, I guess it's even, oh, it's a little different than what I normally think of as embedding, right? Because I normally think of a token having a sort of long form vector representation. But here, because everything is bytes, there's only 256 bytes. They're just represented explicitly literally, right?

Lili Yu: 22:59 I see. So, actually, we still have embedding. Again, as you said, but embedding matrix is much, much smaller. So, for example, in a normal large language model, the vocab size is either 32k, 65k, 50k, it depends on how the researcher chooses for their task. So for that, you have a large vocab mapped to an embedding space. It's actually the same, it's just the vocab is so much smaller, the vocab is only 256. And for the 256 you map to embedding. Again, as you said, because the vocab is much smaller, so the embedding size we need should also be smaller, right? Otherwise, it's such a waste if you have 4,000 embedding dimension to capture a vocab of 256, that doesn't make sense. So in our paper, actually, the embedding size we pick is 32, which is very small. So you are mapped to 1 one-hot vector of 256. That's what people do to represent. Naive vocab is 1 one-hot vector. And then you change that to a dense vector of 32, a 1 dimensional dense floating point vector with dimension of 32.

Nathan Labenz: 24:34 It's an eighth as big.

Lili Yu: 24:38 Yes. Yes. So we found that it doesn't really impact that much. We can actually go even smaller, but it's actually a very small module. We didn't optimize that much.

Nathan Labenz: 24:53 So there's still this embedding, and you're learning your own embeddings as part of this process? There's not an off the shelf embedding for this, I suppose, right?

Lili Yu: 25:03 Yes. Everything is end to end. So, basically, that matrix, that small matrix.

Nathan Labenz: 25:08 The patches are the same. So it doesn't matter if I'm in patch 1 or patch 2, patch 3. If I have the same bytes, that gets converted to the same embeddings and the same data flows into the global model regardless of which patch that is coming from?

Lili Yu: 25:28 Yes. 100%. So there's a slight difference. The difference is the position embedding. So position embedding, there's separate position embedding for the local model and for the global model. You're going to know which patch you are coming from and which letter inside the patch, so the model actually knows your position fully. But the embedding is the same, actually. It's the same for transformer because, transformer has no sequence. Everything's parallel. The matrix has no idea of the sequence, and every position information is learned from this position embedding.

Nathan Labenz: 26:12 Helpful in clarifying. So I think I understand everything now, up to the global model. And the global model, if I understand correctly, is basically a transformer, right? It is, so it has all the normal features that we would expect from a transformer, which is to say your residual stream, your attention blocks, your MLP blocks, your nonlinear gate. And the difference in this case is that its output gets directed to all the, instead of being the final output, its output is the input to each of the local models that are going to do the final set of predictions.

Lili Yu: 27:00 That's perfect. Yeah. That's very, very correct.

Nathan Labenz: 27:03 Going back to that padding concept. So, I mean, here's a really naive question. When I'm thinking about predicting text, and maybe the answer is going to be don't think about text, maybe think about a different modality for this. But I'm looking at the diagram and it's like, you've got kind of MegaByteTran are the sort of 12 letters that have been predicted. And then as it works its way through, then the output is that the next, the fourth patch is predicted. And now at the end, you've got 4 more characters, megabyte trans, and the last 4 are SFOR. So you've predicted that chunk of 4 letters. My first question is, why do we even need all of the different local models? What are they doing? What is even happening there? There's got to be something wrong with my understanding. But what I'm kind of thinking is, okay, you take all this stuff. You do the embeddings. You feed it into the global model. The global model does its thing, but it's really only the last patch that's predicting new content.

Lili Yu: 28:13 I see. What do you mean? Yes. Yes. Yes. Okay. So, this actually comes to how language model works. Even with tokens, right? Even with tokens, you predict every word. So with a sentence, for example, today is sunny, you still predict today with empty input, and you predict is with today, and so on. So you are not only predicting the sunny. So it's the same here. So basically, the model needs to know that, see the first ever, the left column for example, the model needs to know that with a padded patch, the most red color, with a padded patch, with a basically empty patch, it needs to predict this mega. That's equivalent with you say empty and you predict today. And they see the patch plus mega, it needs to predict the bite, the BYTE. The one efficient thing about the autoregressive prediction is they predict every word. That's why this learning is efficient, and that's done by two things. One is this padding, the other one is triangular masking, the causal mask. So basically, we also want to make a prediction on every local patch. It's the same. So we want to predict it so that the loss has rich information about the modeling. We want to, so there are 4 local models, and every local model's prediction is going to compare with the ground truth and get a loss. So we're going to have, if we simplify it, we're going to have 16 losses. So every project, because there are 16 characters. So each character has ground truth to compare to, and then we're going to get a loss. And in the end, the loss is averaged.

Nathan Labenz: 30:43 So maybe just starting with a vanilla transformer that I'm most familiar with and everybody in the audience will be more familiar with. I don't know how well understood this is. It's easy to kind of forget and just kind of background this fact. But when you do a forward pass through a transformer, you are not just predicting the final token, but you are actually generating a full set of predictions, which, in the case of a GPT-3 type of model would be full distribution across the entire vocabulary of 50,000 tokens at every token position. Now you're not using that at the end because you already know ground truth there for what all the input tokens were. But you can kind of examine those if you want, assuming you have access to the outputs, which increasingly, you have to run your own thing to have that. But you can see along the way that, oh, this token was a surprising token. This token was an expected token, et cetera. So there's lots of kind of interesting things you can glean from that. But then also in the vanilla transformer, there is this kind of look back mechanism that happens through every layer, right? So I guess when I think about a traditional transformer, I'm thinking there's 2 reasons to do that. One is what you said, which is because I'm making a prediction at every single token, I have that many more predictions that I can feedback and use to power my optimization. But then also with the regular transformer, up until the last attention layer, you have some possibility, right, for information from earlier tokens to influence the last token that you're actually here to predict. Right? Okay. Now flipping over to the megabyte architecture, it seems like maybe, if I'm understanding correctly, one of those holds, but maybe not both. So you would still have, you make the prediction for every single byte, which in this case is you're making a, because there's 256 possible byte values. You're making a prediction for every possible byte value at every possible byte position. Because you're doing all those predictions, you have all those loss measures that you can then use to power your backpropagation optimization. That's presumably particularly important when, particularly useful for informing the global model. But in this case, once you're past the global model, now there's no more communication, right, between the local models. So I get kind of one of those two benefits, but not

Lili Yu: 33:54 I see. So I have to comment on this. Just one comment is, in the autoregressive model, I feel like it's underappreciated that we make prediction every token. So basically, you have 16 tokens, you make 16 predictions. And that's different with the representation model, which is BERT or RoBERTa. So that's encoder model. So GPT-2 and GPT-3 at scale, they are decoder only model. And we are decoder only model too. So for the encoder only model, the loss is only on the masked tokens, that's only 15% of the tokens. So you do 1 forward pass, you only get 15% loss on 15% of the tokens, while for decoder only model, you do 1 forward pass, you get loss on every token. I think that's why, personally, I feel that's one of the big reasons why decoder only model gradually gets so and so popular, and being able to enable very, very big model, very, very powerful model, very, very uniform model. That's one of the big reasons, because it can learn so efficiently, and you get lost information, and you are able to leverage the data really well. So that's the second comment, basically. And of course, that's what we want to leverage to make a prediction on every token. And then that's supported by transformer. So basically think about this. This could be sequential. So what that means? During inference, you have this padded patch as input, and that goes to the local model. And the local model is going to see this letter 1 by 1. They're going to see the pad, they predict 1 letter M, and then they take the M as input, M and pad together as input, they predict E, and they take the pad and the 2 letters as input and predict G. So theoretically, it's supposed to be done sequentially, we predict 1 by 1. But thanks to the parallelism of the transformer, we are able to during training, during training, you can do the feedforward and get every loss, and then you can get the loss altogether. So that's one advantage of the transformer architecture. That's also why everyone is using transformer. Of course, I'm not shunning other architectures. People are also making efforts trying to make other architecture scalable and parallelizing, but transformer is native for this, and it works really, really well. And then as you said, for inference, yes. So, basically, inference, we decode patch by patch. We decode, see if we want to decode the TRAN, we need to decode the first patch out and take the first patch as input to feed into the global and predict the second patch and take the first two patch outputs as input to predict the third patch.

Nathan Labenz: 37:26 And then within each patch, of course, byte by byte.

Lili Yu: 37:29 Yeah. Yeah. Yeah. And within each patch, byte by byte. So that's a small difference between how model works in training and how model works in inference.

Nathan Labenz: 37:37 I'm with you on the training portion. If we are using the MegaByte architecture for inference, though, is there any reason that I have to run the earlier patches? Or if I'm just concerned with inference, could I just run the final local patch?

Lili Yu: 38:02 So yes, if you know the previous patch, you can run the last patch. Right? So it comes to see how you want your language model to work. So this is a question, for example, if you ask your child GPT, I don't know, a coding question, and then what the model gonna generate is the whole coding answer. So that means your question, they don't need to run 1 by 1. They can run that parallel. But everything the model generates has to run token by token. So it's the same here. If you know the previous patch, you can input that and run that parallel. But everything the model doesn't know and the model needs to continue to generate, it needs to run byte by byte.

Nathan Labenz: 38:54 But I'm stuck on this point only because it will either help me confirm or poke a hole in my understanding. But if I am doing new inference with the MegaByte architecture, I have to run the whole thing in a loop because I'm predicting 1 patch at a time, and I can only predict 1 patch at a time because any future patches depend on that patch first. That's basically the meaning of autoregressive. But when I look at figure 1 and I'm thinking, okay, there's 4 patches here. Do I, in practice, could I just not run the first 3 patches at inference time and just embed, you know, do all my local embeddings, do my global model, and then just run the final patch?

Lili Yu: 39:45 Do you know about the first 3 patches? Is that given?

Nathan Labenz: 39:48 Yeah. So let's say I'm doing exactly this example. So my first 3 patches are megabyte tran, and I input those. They get embedded. They get passed into the global model. The global model does its transformer thing. It then produces outputs that would feed into all 4 patches. But I'm trying to confirm what I think I understand, which is that at that point, there's no further information exchange between the patches. And so if I'm only concerned with the new patch output, then I could just say, okay, I'm gonna ignore those first 3 patches of output from the global model and only process the fourth patch from the global model because that's all I really need the output for.

Lili Yu: 40:34 Yes. That's 100%. So after this global model, right, you get 4 labeled as edge global out in our hidden representation of global out. At this point, you can only take the last 1. You can only take the last 1, and then you're gonna use that together with the local model to decode the last bytes you need. Yes.

Nathan Labenz: 41:00 Okay. Cool. But in training, you do have to do or you don't, I guess, maybe you don't have to, but you do run all of them because they all contribute to the presence of all these loss values that then you can use to optimize the model end to end in the first place.

Lili Yu: 41:18 Yes. Yes. We want a loss from every byte. Yes.

Nathan Labenz: 41:22 So that is really interesting. This is not a strong theory of mind, but increasingly, I see kind of a lot of things, and I'm like, a lot of these things seem isomorphic to other things or kind of nearly so. So I was thinking, is there a sort of modified vanilla transformer that I could kind of imagine that would be akin to this? And what I came up with was, you know, typically, the transformer has the same width throughout. It has kind of the same dimension at each block. But here, you're kind of shrinking the dimensionality of the input in order to have a more efficient global model and then kind of chunking in order for more efficient parallelization. Do you think you could kind of do something similar where if you devise a transformer, but you said the early layers are gonna be wide, but then they'll go through a bottleneck, and then at the end, they'll get particularly narrow. Does that seem like it would potentially have some similar value?

Lili Yu: 42:34 Yeah. Yes. Yes. I think so there are a couple of things. I think 1 thing we really want to do is try to take advantage of the powerful transformer. So in some sense, the global model is intact transformer, and the local model is also intact transformer. So that's something actually by design, because, as you said, there are many, many optimizations of the transformer architecture, and some of them is very memory efficient. Yeah, some of them is FLOP friendly. But in the end, what we really care is how well it scales. So I think there are lots of great work, but they need to test at large scale. So, you know, nowadays, I think most people do large language models still using the dense transformer because it's most well tested and it performed well when you have a lot of model, a lot of data, and a lot of compute. I think that's why we don't want to touch that. And in many sense, see, 1 day people found efficient attention mechanism, we can just swap in, right? Because we can swap in the attention in local model and swap in the attention in the global model with this new attention mechanism. So that's by design. The second is this autoregressive prediction, and this masking is actually very important. So basically, if you want to do autoregressive transfer, you have to guarantee that. Otherwise, you have information leaking, and you have information leaking, and then your model will learn nothing. So, you know, the large language model learning, you need to make the task also challenging. If you already know the ground truth, I just copy from my input. When you patch the transformer locally, it's really hard to guarantee that. So basically, because we have this triangular masking on the input, if you arbitrarily rescale your inner dimension, or as you said, make the model wider or narrower, you have to add very, very complex masking strategy inside, and it's really, really hard. So that's the thing. And third thing is there's something quite pretty about how we design this global model. So basically the global model has shorter sequence. So a shorter sequence actually has a couple of advantages. 1 is on the attention. That's what when people can't think about the sequence, that's what people care the most, because apparently this quadratic memory scaling give people issue when you have long sequence. But if you make the sequence shorter, it also reduces the feedforward. We also discussed in the paper. So nowadays, feedforward is most compute expensive part of your model. See, if a GPT-3, the attention only takes 3% of your whole compute, feedforward is a big chunk. And the way we do it for the global model, it's 1 over p, the normal FLOP of a transformer of a naive transformer. So, see, we take the patch size of 8. Now our global transformer only needs to take 1 eighth of the compute, and we can make it bigger and more powerful. I think that's why we stick to this current design for many reasons.

Nathan Labenz: 46:22 Yeah. That's really interesting. And then I suppose too, it also sort of suggests the potential for modularity. Right? I mean, we see all these projects where certain, you know, a language model is frozen and combined to something else. And I can, as you're talking about this more, I can start to easily imagine freezing different parts of the model, you know, having different kinds of local models that could perhaps be swappable depending on, you know, we may only need to fine tune the local models for certain tasks or, you know, who knows what. Right? But it seems like this architecture would lend itself quite nicely to modularity downstream.

Lili Yu: 47:06 Yeah. Yeah. Yeah. Yeah. I totally agree. I think there's actually in the VLM, the video language model, there's such architecture to adapt to, say, pre-trained, so you're adding image understanding and maybe even image generation functionality on top of pre-trained text language model. In essence, it's quite similar, yes.

Nathan Labenz: 47:34 Yeah, almost like a different head, so to speak, if you had a similar, maybe even the same concept. But yeah, if you can train a single global model that can handle different kinds of inputs and represent them in a meaningful way, then you could have different local models that create different sorts of outputs based on that single understanding.

Lili Yu: 47:57 Yes. 100%.

Nathan Labenz: 47:58 Yeah. This is really interesting. So the more I learn about it, it makes sense why some of the leading thinkers in the space have been very excited about this paper. You've kind of covered a little bit already the how you choose the patch size. It seems like that's basically a trade off. The way I was thinking about it, and you can, you know, complicate this, but I was kinda thinking in the limit, if you just had 1 giant patch, then you basically just have 1 giant model or 2 kind of stacked, you know, global models, which would have all the normal downsides of that. If you, on the other hand, you know, took the other limit, then you would have an insane number of patches. And I guess I don't know exactly what goes wrong there, but it seems kind of silly. So you're kind of looking for this happy medium in the middle that trades off these 2 extremes, and that's an experimental process, it seems.

Lili Yu: 49:04 Yes. Yes. 100%. As you said, either you choose patch size as the whole thing, or you choose patch size as 1. That's both 2 extremes. We just become a normal transformer, basically.

Nathan Labenz: 49:20 Okay. So the patch size of 1, yeah, it doesn't have to do anything. Then it's just the global model chooses the output. Okay. That makes sense.

Lili Yu: 49:26 Either ends, we become the naive transformer, and our model actually also, I mean, in our experiments, we also update that. And actually, we found out with combination of both 2 is really, really helpful. Then it comes to the question how to choose a patch size. So theoretically, we should, if you want to optimize the memory of your attention, we have a formula how to choose a patch size. But we actually pick the patch size quite heuristically right now. Say for text, we try to stay don't stay too much away from BPE token. So right now BPE token normally have 1 BPE token corresponds to around about 4 bytes. So we pick patch size of 8 to be not far away from that, but also compute efficient. And for image, we kind of pick patch of 8x8 or 16x16, that's what people do with image tokenizer. So basically, how we pick the patch that is kind of inspired by how people do the tokenization, but we don't want the tokenization at all in our paper. So that being said, it's actually good in our paper, we also studied how to pick certain patch size, but I will say that's not comprehensive at all. We do need more study, wide range of that, especially that's going back to the long sequence story. If we make the local embedding, local model as strong as a normal transformer, say 2,000 tokens, right, the local model 2,000 tokens, and we only have, say, 8 or 16 of those patches, then that means we're training a 16k or 32k normal transformer. So I think there's a whole wide spectrum how we can pick the patch size. But interestingly, overall, we find a wide range of patches actually works quite well for our task. So that's something definitely needs more analysis, which region and how we do the model size peak too. We do find it quite stable to the patch size at this point.

Nathan Labenz: 52:05 Yeah. It's interesting. I guess the intuition for that would maybe just be that there's some curve and, you know, it's kind of relatively flat near the local minimum, and so you have kind of a range there that you won't see too much difference with. Maybe you could just kind of describe the different multimodal experiments. Because we've talked largely about text, but there's also stuff pertaining to image and even music. Maybe just give a little rundown of kind of even just what those data sets are, because I don't think people have a great sense when they see a name of a data set, what even is that task? So give us just a sense of kind of the generality that you've been able to demonstrate. Nathan Labenz: (52:05) Yeah. It's interesting. I guess the intuition for that would maybe just be that there's some curve and it's relatively flat near the local minimum, and so you have a range there that you won't see too much difference with. Maybe you could just describe the different multimodal experiments. Because we've talked largely about text, but there's also stuff pertaining to image and even music. Maybe just give a little rundown of even just what those datasets are, because I don't think people have a great sense when they see a name of a dataset, what even is that task? So give us just a sense of the generality that you've been able to demonstrate.

Lili Yu: (52:48) Yes. So basically, this autoregressive language model has been very popular on text, but people also have been doing that for both image and audio for a very long time. However, for image, due to the challenge I just described, the image has very long sequence, it's not as popular, but even earlier, right after Transformer, there's image Transformer. However, they have to compress inputs, so it's also a lossy process. The same as audio. Back in, I believe it was 2016, there's WaveNet, that you predicted audio wave one by one. So the task, basically it's the same. It's predict something, that something is byte, predict byte one by one. And for image, it normally comes with image size, say, 4K image. So the 4K image, when you read the raw image, it's 4K times 4K, so that's the square, times 3, 3 is normally the RGB channel, that many of bytes, that's your whole image. And for us, we just take it, and this is a 3D matrix. One image is 3D matrix, and we just flatten it out. And that's your input sequence. That then becomes a 1D sequence. And now you can just treat that as text, and you predict one by one and one by one. There's a little bit of caveat of that. We actually didn't just purely flatten it out. We take a patch by patch, like 8x8 patch and 8x8 patch, to and then flatten it out. That's illustrated in the image. But for simplicity, it's like for image, you just take this input pixel value, and you make it 1D, and you predict one by one. It's the same for wave. So when you read the audio, audio is represented as wave. So you can read the wave input, just change that, that becomes a 1D byte sequence, and you predict one by one. So basically that's the difference. And for image, you just predict one by one and compare with ground truth, then that gives you per beat per dimension. There are many ways you call the metrics, but it's all the same, basically how well can you accurately predict one value out of 256. And then we actually achieved state of the art for the smaller image, which is actually the same state of the art value, matching state of the art value on image 64. And of course, we don't want to stop that. The reason we compare image 64 is people rarely do longer sequence of images just because the earlier model didn't support that. But to illustrate the capability of our model, we did image of 256 as well as image of 640 pixel value, that's equivalent to 1 million bytes when we flatten it out into 1D. So that's the image experiment. And on the audio experiment, basically, we take some audiobook data, where basically audiobook data as well as some speech data, basically think about, in the end it's just wave, and you want to know how well your model can accurately predict or recover those wave information. So there's actually a very interesting thing here. Normally audio is encoded not by the bytes, not by the bit bytes of 8, which is equivalent to 256 value, it's actually 16 bit depth. So that means you read one value, and that's not pick one out of 256, that's actually pick one out of 65,000. That actually gives people earlier, earlier research, that actually give people very much headache, because the softmax over one over 65,000 is very expensive. So what people have been done is either do a customer based approach to map each of the value into one out of 256, or they do hierarchical decoding. On the decoding, they do tree-like decoding, decodes into something and smaller bin value and then further decode. So that has been a big issue. But what we are doing here is we just read the file bytes. So we ignore what's the particular encoding and decoding of audio file. We just ignore those. We take the file and read the bytes and model that. I think that's actually very interesting, and that's the future direction we want to explore. Basically, for certain file, forget about its image, its audio, or its video, you just read the byte. You just read the byte value, and then you model that. I think that's going to be very interesting.

Nathan Labenz: (58:50) I want to cover results, and then I wonder if there's any things that you would flag in terms of, at a high level, are there tradeoffs here? The results you present in two ways, as I understand it. One is basically that you compare, I believe it's a compute matched experiment where you say, okay. I have a certain compute budget. I'm going to train a Transformer, classic Transformer. And also, you do the Perceiver architecture, and then you've got the new Megabyte architecture. But same compute budget, I'll train them each as well as I can on this given compute budget, and then I'll compare how they do in the end. And that is done on a byte by byte basis. Right? So we're using the same encoding in that experiment and finding that when you're dealing with bytes, the Megabyte architecture just blows away the others. But then the other question is, okay. Well, people don't usually do transformers in that way for all the reasons that we've discussed. So then the other approach is let's take a dataset, same dataset. And here, I was a little bit confused, but it seemed that you're taking the same dataset, training a Transformer, Perceiver AR, and a Megabyte on that same dataset. Say it's a text dataset. I wasn't quite clear how the compute budget compared across those different architectures, but the key punch line is the results are comparable.

Lili Yu: (1:00:31) Yeah. So it's actually an experiment hard for us to do, I mean, the latter case. Let me first explain why compute matching is so important, because it's actually quite well, quite known to everyone, basically, for a quite scalable and capable model architecture, adding more data, as well as adding more compute hours. With the right setup, it normally gets better as you train longer and adding parameters to it. So that's why we think compute match is very important. It's just opportunity cost. If I take the local Transformer and I just add on top of the global Transformer, that's for free. Right? It definitely, most cases, it's going to get better, but what we want to consider is opportunity cost. If you want to add a global model, you have to make your local model smaller because you only have such compute. So that's the same. That's the general idea. We want to do compute match, so that's why we implement this Perceiver AR in house. Of course, we run the experiments to validate. We get comparable experiments from the original paper. But in the compute experiments, we run the baseline, which is Perceiver AR, of course, and another baseline, which is naive Transformer. We run the three model all ourselves. And the three model are seeing exactly the same data, they are training for exactly the same hours. So we think that's a fair comparison. And that's what really makes us happy to get the confidence why Megabyte is good. However, as you said, most people don't do it this way. When they publish paper, they don't tell you exactly how many hours they train on which GPU, what's the batch size, et cetera. So we also want to convince, also let us know, do are we comparable with other implementation? We want to compare to more models, but many model, we don't know the detail. So in that case, for the benchmark run, we try to run the model longer until the model converge. That means the validation loss is not going down again. And we take that value to compare with what people have been reporting in their paper. But again, we don't know exactly what's happening in their paper. But that's the idea. We want to at least reach a ballpark, or at least we know we are not using more compute than them, by the way, still get comparable performance. So I think that's, in high level, two sets of experiments we are running.

Nathan Labenz: (1:03:33) So this is table 3, right, in the paper. And the bottom section, the bottom three lines are all the experience that you ran with the byte level encoding, and you're basically showing their Megabyte dramatically outperforming the other architectures with the byte level encoding. And then comparing to the top section, it's that validation column where and I guess also the test column where you're sort of saying, okay. Now and this is where I'm a little bit unclear as to exactly, and maybe you're also saying it's not fully clear. It's not always known exactly what somebody else did. But is if I compare the Megabyte row at the bottom to the, let's just take the top one, the Transformer XL from the top. Do we know the relative compute budgets of those, or we just know the dataset it was trained on? Like, what is the constant in that comparison?

Lili Yu: (1:04:37) Yeah. The constant in the comparison is only dataset. Unfortunately, many of the work, we don't know how much compute they use. But one thing is PG19 is a small dataset, so basically that means you run multiple epochs of it. For other in house experiments, for example, see table 2, those datasets are much larger than PG19. But unfortunately, people didn't really report all those datasets. So, yeah, we are definitely doing a confined optimization here.

Nathan Labenz: (1:05:17) So, the bottom line is if you use the byte level or byte level encoding, byte level prediction, then the Megabyte architecture is blowing the others out of the water. But if you're comparing on, let's say, something like each architecture working in its optimal condition, then you can get comparable performance. And then the punch line is, but you still get all these other advantages of the Megabyte architecture that we've discussed.

Lili Yu: (1:05:50) Actually, that's not a full story. I think for the full story is for bytes. We believe Megabyte is one of, is the best. But what people are interested too is, see if you have byte input, what's the optimal? We already know Megabyte. We are pretty sure. But another question is, on byte prediction task, we are the best. But is a byte prediction task as a language model task work as well as when you have tokenizer for the language task. I think that's what we try to get the insights from table 3. So we didn't really answer that in this paper, but we are working on that right now. It's, see, bytes level has its advantage, but we want to know, can we just replace the tokenizer as a whole in the future and get wide adoption so we don't need to worry about tokenizer, with all the troubles I just mentioned at the beginning of our conversation. So that's something we are currently working on.

Nathan Labenz: (1:07:02) So there's a conversion process there where you're basically saying we're going to rescore our byte level prediction output as if it were delivered in token form, and then we can compare on a more apples to apples basis.

Lili Yu: (1:07:19) Yes. Yes. That that's exactly what what happened here when you see table 3 and table 2, their value is in totally different range, right? Because table 2 also have PG19 results, and the BPB is around 1. But when you see table 3, that's converted. That's converted to token way, word level perplexity, and it's in the range of 42. So there is a conversion between them. They are actually, see table 3, that's word level perplexity of 42, actually is equivalent to 0.8 BPB. So that's, maybe that's a confusion. The reason is because the top session, the top other work, they don't report the BPB value, so there's no way we can compare. So we have to convert our results to word level perplexity. Yes.

Nathan Labenz: (1:08:21) The upside of this, hopefully, is pretty well established. You've got these, the avoidance of all the tokenization. You've got the natural flexibility to handle all these different modalities. There's the compute efficiency. The performance is all there. What do you think happens as you scale this up? I mean, there's a couple interesting little wrinkles in the paper, particularly around the stride of inference where there seems to be this, there is some sort of performance penalty associated with the boundaries between the patches, and so you have an interesting strategy for overcoming that. But then I've also seen, most of the commentary on this paper online has been effusively positive and really hyping it up. But one of the more interesting things I saw, and I really don't have an intuition for this myself, but I wonder if you do, was somebody said, this looks amazing. I do wonder if it will demonstrate the same sort of in context, few shot learning, type behavior that the vanilla Transformer does. Maybe it wouldn't because of the sort of early localization of information, perhaps. So maybe just for starters, tell us about what you observed in the patch boundaries and how you overcame that so far, and then how that may develop and what you expect as you think about scaling this up much bigger.

Lili Yu: (1:09:52) Yeah. So, basically, again, as you said, see, if we were to answer this question, will people adopt this in a wide range? I think the only way to test that is scale it up. Because, nowadays, people don't really care about if you train a model that's less than 1 billion and you only train less than, let's say, 100 billion tokens, people don't really care about that because we already know there is a region that model can be more powerful. So that's what we're currently working on, see, can we train using similar computer of LLAMA and get matching performance on LLAMA. And then we don't need the token either. That's what we're currently working on, and we think that's the ultimate test, if that's going to work well, and being able to get adopted and being widely used. One thing about the boundary is so that's something really interesting, and we have a couple of ways of thinking, how to solve it. So one thing we explore a little bit is the convolutional layer. The convolutional layer actually improves the model slightly, but it didn't improve too much. Ultimately, our hope is that's going to be actually solved with scaling. When the model is seeing enough data, the model should learn the invariance across the boundary. So I think that's exactly the ultimate goal, we're very hopeful to see more data, the model can learn that. On the other side of in context learning, I think in context learning, you shouldn't really be worried, because it's all showing in the perplexity, right? Ultimately, if the model can understand and predict the next word well, which is showing in the perplexity, then your in context learning shouldn't be a problem. So I think ultimately, again, I think both question should be answered when we have this large compute and large model size experiment done.

Lili Yu: 1:09:52 Yeah. So, basically, again, as you said, if we were to answer this question, will people adopt this in a wide range? I think the only way to test that is scale it up. Because, nowadays, people don't really care about if you train a model that's less than 1 billion and you only train less than, let's say, 100 billion tokens. People don't really care about that because we already know there is a region that model can be more powerful. So that's what we're currently working on. Can we train using similar compute of LLAMA and get matching performance on LLAMA? And then we don't need the tokenizer either. That's what we're currently working on, and we think that's the ultimate test. If that's going to work well and being able to get adopted and being widely used. One thing about the boundary is, that's something really, really interesting, and we have a couple of ways of thinking about how to solve it. So one thing we explore a little bit is the convolutional layer. The convolutional layer actually improves the model slightly, but it didn't improve too much. Ultimately, our hope is that's going to actually be solved with scaling. When the model is seeing enough data, the model should learn the invariance across the boundary. So I think that's exactly the ultimate goal. We're very hopeful to see more data, the model can learn that. On the other side of in-context learning, I think in-context learning, you shouldn't really be worried, because it's all showing in the perplexity, right? Ultimately, if the model can understand and predict the next word well, which is showing in the perplexity, then your in-context learning shouldn't be a problem. So I think ultimately, again, I think both questions should be answered when we have this large compute and large model size experiment done.

Nathan Labenz: 1:12:10 There's a lot of surprises in these systems, right? In some sense, it's all a giant surprise that Transformers have scaled as well as they have and seem to be. Even Yann LeCun has recently said some things about like, yeah, they probably do have some nascent world models. He wouldn't go as far as calling it a full world model, whatever. Obviously, there's a huge debate around that. But it does seem like these things kind of keep surprising us on the upside in general. Do you have any sense for how the surprises with this architecture might be different from the surprises we've seen with kind of the vanilla transformer, or is that just something that we have no way to guess except to go scale and find out?

Lili Yu: 1:12:59 I will see. If it's a surprise, at least emergent behavior, most of the time, just let us get surprised. But one thing we hope that it can be better at is actually code and math. It's less talked about, but actually coding as well as math problem, that's something kind of a big issue with tokenizer. And as well as sequence length. The coding, when you need to, you always need longer sequence for you to be able to model a coding problem. Also, the way you tokenize it is just really, really unnatural. If we get on par to the perplexity of a normal language model in Llama, we are also really looking forward to see how this scaled MegaByte works on math and coding problem. Yeah. That's one of the domains we want to really watch closely.

Nathan Labenz: 1:14:11 Yeah. The math in particular definitely makes a lot of sense. Some of the explorations of what a typical language model has to deal with in order to figure out math at all can be pretty crazy sometimes where it's like these long integers people are trying to get it to add together are actually being parsed into 2 and 3 digit tokens. And it's like, oh my god. No wonder it can't do the math when it's looking at it like that. It's definitely coming at it from a serious deficit for starters. The coding one, I have a little less of an intuition for just because it seems like tokenization there, I would guess, is usually less weird. But

Lili Yu: 1:14:58 Yeah. Yeah. Yeah. So one thing, one interesting experiment is, again, at the very beginning we talked about why people need the tokenization to compress, right? If you take your domain and you compress it and then you run your BPE, you can get a different tokenizer, and that may give you a different compression rate. So what we do is in one of the earlier work of my colleague, the Incoder. Basically it's almost like a Codex model developed by Meta too. So that's a code specific language model. If you take the common BPE, which is like GPT-2 BPE tokenizer, compare with let's say we train an in-house tokenizer, we can get 30% less tokens. That just tells you a GPT-2 tokenizer on code is not efficient. It's 30% less efficient. So it's very also very domain specific. Also think about in code, sometimes there is multiple space or something. It's really hard to figure how BPE tokenizer handles that and how that's actually meaningful. Yeah, I think there are issues with the current way how people do that.

Nathan Labenz: 1:16:23 Yeah, this probably also seems like it connects to, I mean, they haven't published a ton about this, but it's kind of gradually come to public light that the latest OpenAI models have a lot of code early in their training. And, presumably, this is kind of related where with presumably more code friendly tokenization, it has an easier time learning certain logical structures. And then that can kind of benefit all tasks in a way that is really hard to get either to happen with just traditional natural language and traditional tokenizers. So, yeah, that makes me actually start to think that maybe this is going to just be, this might just be strictly better, which I'm sure is what you're hoping too. Yeah.

Lili Yu: 1:17:14 Yeah. I think, to many sense, even we are happy with the simplicity part. So I don't know if you are familiar with the Galactica work, also from Meta. So it's particularly trained on archive and science. There are actually a whole strategy how you tokenize it. So there's like, oh, in this case we use BPE, in that case we do byte, in that case we do digit splitting, in that case we do this. So there are 3 or, I believe, 3 or 4 rules combined together to just do the tokenization. So that's actually, the users won't really feel the problem, because maybe the only concern, as you said, is that you run the tokenizer and know how many tokens you are, and then that's going to change your expense. But for developer, actually, it's quite tricky to pick all these rules, how to handle correctly and effectively. And when you decode it, you also have different ways of decoding it. It's actually a hassle. Yeah. So we are hoping that's going to simplify everything. Yeah.

Nathan Labenz: 1:18:29 Anything else in this work that you want to highlight that we haven't got to so far?

Lili Yu: 1:18:34 We are very excited about the either scanning the ARP or, say, to be able to directly in the future model the raw format of any file, any input, any modality. I think that's a really exciting direction for us.

Nathan Labenz: 1:18:52 Yeah. So I'll be looking out for the next paper. I just looked, this one was published May 19, and we're recording here about a month later. Knowing the size of the clusters that you guys are working with over there, I feel like I can set my stopwatch, and it shouldn't be too much longer before we get some of these questions answered. I think it'll certainly be very interesting to see how that comes out. Maybe just zooming out for a couple kind of big picture questions. Meta AI has really been on a sort of incredible roll lately and has been the subject of quite a bit of speculation here. There's the Google memo that says, and I don't necessarily endorse all the conclusions of that memo, quite the contrary. But the idea is that Meta is winning because it's got this big open source community. What's it like to be at Meta AI right now with kind of hit after hit? What is the vibe? What is the kind of big vision? It seems like it's probably very different from some of the other places, but I'd love to just kind of hear your reflections on what it's like to be a part of that.

Lili Yu: 1:20:04 I have to tell from my personal angle. On the other hand, I'm also working on, in general, multimodal, multimodal language model. So I think, as I said, that's one of the inspiration why we did the MegaByte. And in our team, we also believe the future belongs to multimodal or mixed model. We are not thinking about now we have ChatGPT, now we have Midjourney. Why can't we have a unified model that's being able to do everything? That's definitely one big trend we're working on, and it's really, really exciting. I cannot ask for more at this stage.

Nathan Labenz: 1:20:42 It's been striking to see, obviously, lately, so much focus has shifted toward AI safety, and different organizations have kind of come out at the leadership level and signed on to the extinction risk statement. OpenAI, DeepMind, Anthropic have all, leadership at those companies have signed on. Is there an active dialogue about this within the Meta AI team? I imagine there has to be, but we haven't really heard as much about the kind of how do we develop all this technology without losing control of it from Meta.

Lili Yu: 1:21:25 Yeah. Of course. Of course. We have a, we form, couple of Responsible AI teams, and we are always very, very careful about the data. We are always very, very careful about the bias coming out of the model. We have very, very strict, how we do open source. Even though, at least right now in my team, the big org, the Meta research lab, we still have open science as our major goal. But the process of doing open source is very, very strict. We have to pass all this bias filter, we have to answer all these questions, so that's basically embedded to every project. Of course, we do hear, every day, everyone is complaining why it's so hard to use this model, why it's so hard to open source this data or open source this model. But it's a world we have to live in right now. We have to do responsible AI. And, again, on the other side, there's particular teams forming. So I think in Meta, we are very much following this. But no matter how hard it is, the researchers here are interested. It seems like the leadership is aligned with open science. We want to do open science. I think, again, we believe that to be able to do responsible AI, we shouldn't let, that's only represent my own opinion, we don't necessarily need only a very few number of companies to do that. We should open source it and everybody help debug, help develop more safe model. So yeah, I think it's definitely heated. I will say everyone has different opinion, but a baseline is every open source model, it has to pass some internal filter. It cannot be biased. Yeah.

Nathan Labenz: 1:23:40 Do you think we're going to start to see a sorting of researchers across these orgs? Because it seems like you've obviously published this paper, and along with that comes some fun stuff. Folks at OpenAI, they still do publish some, but obviously not nearly as much. If they had invented the megabyte architecture, I doubt they would have published it at this point. So, it almost seems like there may be developing a kind of divergence of people that value the open source ethos and want to share, go to places like Meta, and people that maybe have a more, we want to build a world changing product or, we feel like we have to keep this stuff secret because it ultimately is too dangerous to share. They go to these other labs. Do you see kind of a polarization or a self sorting starting to take place?

Lili Yu: 1:24:38 Yeah. Yeah. 100%. I think that, I mean, that's unavoidable when the AI industry is mature enough so that you can have quick prototyping or quick productionizing certain model or certain idea. I think it's also understandable for certain company to say they don't release their model. It's just unfortunate a little bit to the community. For us, we are here, which is Meta AI, one of the reasons we believe open science, and we are sticking to that. But I do see, I agree with you. I do see this polarization. Another thing is, it hasn't happened to Meta, but it may happen is, when the product team and the research team is doing work that is more and more similar, the resources could be pulled into the product team easily. Because the optimal customer facing team, the optimal customer experience team, we as a research, as a pure research team, we are doing research that's far ahead. So in such a competitive space, the product team has more priority. So one sad fact could be the resource shifts towards the product thing, and then we get less.

Nathan Labenz: 1:26:08 And you mean by resources there, you mean just compute, like access to GPUs?

Lili Yu: 1:26:15 100%.

Nathan Labenz: 1:26:16 Yeah. Well, it seems like so far so good on that.

Lili Yu: 1:26:19 So far so good.

Nathan Labenz: 1:26:21 I think one of the most sort of unfortunate and problematic dynamics developing in the world today is the US China rivalry, which seems like it's constantly escalating and it seems like nobody can do anything about it. And that would be bad enough unto itself. But now it seems like it's kind of feeding back into the AI discourse and almost every other conversation I have about big picture AI questions. Are we rushing into this? Do we have a good plan to keep ourselves safe as we build these more and more powerful systems? So often, the conversation kind of ends in, Well, but China will do it if we don't. And so, there's just kind of this low trust, obviously, and kind of, almost fatalism around, like, what other choice do we have? So for one thing, I always love to just highlight any positive connection or collaboration that crosses the US China divide, and you grew up in China, going to university there, then coming to the United States, working here. In some sense, you embody that. What is it like for you right now to be at kind of the center of both of these super hot topics? AI is obviously super hot. US, China is super hot. The center of that Venn diagram is maybe the most focal thing in the world. And here you are publishing research right from the kind of center of all of that. What is that like? And do you have any feelings about that?

Lili Yu: 1:28:03 It's actually to me, it's not between US or China. It's like every country can have different policy about how to develop AI model, and that could dramatically impact things. One or 2 days ago, Israel, as well as Japan had a law in AI model could train on any data. So that's not even in China. Right? It's just like in Israel, and Israel nowadays have so many startups. I can imagine a scenario is, not everybody or the startup are going to be an Israel company so that they can train with any data or they can have more freedom to develop their AI model. Yeah. So, again, I think this indeed is a big concern. In the end, the easier way is everybody, of course it's hard, everybody on a similar page, so they follow this rule of develop safe AI. But I will say definitely go beyond the US and China, it's worldwide. Every country could be the one who allowed to train the wildest AI things there. Lili Yu: 1:28:03 It's actually, to me, it's not between US or China. It's every country can have different policy about how to develop AI model, and that could dramatically impact things. One or two days ago, Israel, as well as Japan had laws in AI model could train on any data. So that's not even in China. Right? It's just in Israel, and Israel nowadays have so many startups. I can imagine a scenario is not everybody or the startup are gonna be a Israel company so that they can train with any data or they can have more freedom to develop their AI model. So, again, I think this indeed is a big concern. In the end, the easier way is everybody, of course it's hard, everybody on a similar page, so they follow this rule of develop safe AI. But I would say definitely go beyond the US and China, it's worldwide. Every country could be the one who allowed to train the wildest AI things there.

Nathan Labenz: 1:29:15 I just hope that clearer heads can prevail. It seems to me that any, again, any flicker of positive relations between US and China should be celebrated and elevated and made more visible. And certainly, I've always been a big believer that to the degree that the United States can attract people from around the world and including from China, it's a great thing to bring people to the same research environments and have this kind of cross cultural collaboration. So I don't expect you to have all the answers on that by any means. You've got your hands full, obviously, just doing the research itself. But, boy, I hope we can continue to have the degree or, hopefully, even more of the sort of academic and intellectual collaboration that has existed and continues to exist, but seems like it's under some threat. And I just really hope that that can be built upon and not something we turn our backs on.

Lili Yu: 1:30:27 I think, personally, I haven't felt this threat. I think overall, what I have felt is quite friendly environment.

Nathan Labenz: 1:30:37 I think this is great. Phenomenal conversation. I really learned a lot from it. I'm glad to hear that you have not felt any of that sort of thing personally. I don't know to what degree people are feeling it personally, but just you look at kind of macro numbers, the number of grad students coming is down, the number of Americans living in China is way down. And just in general, there's this kind of decoupling phenomenon that, especially as we head into this AI future, I would like to see more coupling and not less. So that's just kind of my overall point of view. But I'm very actually heartened to hear that you have not found this to be a practical issue in your daily life. So that's good news as I see it. Well, we'll leave it there. Then for now, again, phenomenal work on the megabyte architecture. I'm looking forward to the next paper and seeing if you've just changed the AI game across the board or if there are any limitations or surprises that are unveiled. I'll certainly be keeping a close eye out and reading your next publications with all that in mind. For now, Lili Yu, thank you for being part of the Cognitive Revolution.

Lili Yu: 1:31:52 Thanks. Thanks for this nice conversation. It's a pleasure.

Ads: 1:31:56 Omni Key uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work customized across all platforms with a click of a button. I believe in OmniKey so much that I invested in it, and I recommend you use it too. Use CogRev to get a 10% discount.

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

Meta's MEGABYTE Revolution with Lili Yu of Meta AI

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next