Your Model, Your Weights with MosaicML's Abhi Venigalla and Jonathan Frankle

Watch Episode Here

Video Description

In this episode, Nathan sits down with Jonathan Frankle, Chief Scientist, and Abhi Venigalla, Research Scientist of MosaicML. They chat about Mosaic’s custom LLMs, the customers seeking Mosaic out and what their journeys and use cases look like, and exciting developments in Mosaic’s research: including their new inference platform, as well as Mosaic’s MPT-7B-65k+ storywriter model.

The Cognitive Revolution is a part of the Turpentine podcast network. To learn more: www.turpentine.co

LINKS:
MosaicML: https://www.mosaicml.com/
MPT-7B Storywriter Model: https://huggingface.co/mosaicml/mpt-7b-storywriter=

TIMESTAMPS
(00:00) Episode Preview
(06:04) Mosaic’s business model
(07:28) Who uses Mosaic’s custom LLMs? What does their data look like?
(09:55)Mosaic’s use cases for custom LLMs
(12:47) How much extraction and summarization was done by humans pre-LLMs?
(15:28) Sponsor: Omneky
(21:50) The journeys of Mosaic’s customers and would a Wendy’s LLM know about a Big Mac?
(25:46) The curriculum model and fine-tuning
(29:10) Language models in the life sciences
(33:20) How raw can data be before it becomes a problem?
(35:44) Using the output of bulk pre-training process vs additional after training
(38:30) Redteaming as a service
(39:40) Mosaic’s inference platform
(41:53) Spending one cent on 20,000 tokens, how is that cent distributed?
(46:00)) Selling compute on a dedicated capacity basis
(47:30) Oracle and AWS
(49:50) The storywriter model and 65,000 token window
(54:35) The transition from finite parameters into infinite attention matrix

TWITTER:
@jefrankle (Jonathan)
@abhi_venigalla (Abhi)
@MosaicML (Mosaic)
@CogRev_Podcast
@labenz (Nathan)
@eriktorenberg (Erik)

SPONSOR:
Thank you Omneky for sponsoring The Cognitive Revolution. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work, customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.

Music Credit: MusicLM

Full Transcript

Transcript

Jonathan Frankle: 0:00 Who's training these models? Everybody. Really, the question you should ask is who has interesting proprietary data? Everybody. I mean, as Abhi mentioned, there's a model size for everybody. There's a good entry point for everybody. At the end of the day, it's everything from small startups for whom the model is their main product. Companies like Replit that are very tech forward and AI first and recognize the power of this, to the kinds of companies that if I mentioned them, you'd say, wow, that's a really big, boring company. They're doing AI? It goes, yeah. They have amazing data. This is how you activate it.

Abhi Venigalla: 0:31 In the old days, he used to say, data is your moat. And then in the past year or so, there's been this new kind of way for, well, actually, training the model is so hard. Maybe that's the moat. And so what Mosaic's doing is, we're making that easy again. So it's almost making ML training boring again.

Nathan Labenz: 0:47 Hello, and welcome to the Cognitive Revolution, where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week, we'll explore their revolutionary ideas, and together, we'll build a picture of how AI technology will transform work, life, and society in the coming years. I'm Nathan Labenz, joined by my cohost, Eric Tornburg.

Nathan Labenz: 0:47 Hello, and welcome back to the Cognitive Revolution. Today, we're talking to Jonathan Frankle and Abhi Venigalla, chief scientist and research scientist at MosaicML. In today's world of AI hype, it seems almost any project has at least some chance of going viral on Twitter. A few of those will prove deserving and enduring, but most will quickly fade away. To make headline news repeatedly, however, as Mosaic has done over the last year, is something that only truly top notch organizations can do. Mosaic specializes in creating custom, proprietary language models for corporate clients. They were the first to offer GPT-3 quality models for $500,000 in September 2022, and they were the first to train stable diffusion from scratch for under $160,000 in January. They then turned around and did it again, announcing stable diffusion for under $50,000 in April. One funny note from the conversation. At one point, you'll hear Jonathan give me a bit of a hard time for quoting the $160,000 number instead of the latest $50,000 number. I went back and checked, and it turns out that they announced that additional price drop in the few days between when I prepped for and when we recorded this episode. Just goes to show how fast things are moving. Most recently, Mosaic has released their own open source models as well as an inference service that allows you to use their servers to power your applications. According to lmsys.org, a leaderboard that collects human evaluations on blind head to head language model comparisons, their MPT 7b chat model is currently the number 9 rated chat model in the world. Now that's already an impressive accomplishment, but what's even more impressive? When you remove the OpenAI, Anthropic, Google, and LAMA derived models from the list, it turns out that MPT 7b chat is actually the number 1 rated open source model that is available for easy fine tuning and commercial use. Additionally, the release of the MPT 7b StoryWriter 65k plus model, which allows actually even more than 65,000 tokens of context, legitimately shocked much of the AI world and set a new standard for long context models, which are already quickly becoming the norm. We'll talk about the alibi technique that they used to achieve this. We only had an hour for this call. As you'll hear Jonathan say, demand for MosaicML services is through the roof. They are reaching the point where they're making tough choices between serving additional customers and conducting additional research. Obviously, that doesn't leave a ton of time for podcasting, but I still think this episode is a great window into who is training their own language models, why they're going that route, what they're using them for, and the techniques that are powering this trend. Before jumping into the episode, a quick thank you to everyone who has shared the Cognitive Revolution with friends or posted a review online. We're now up to roughly 25,000 unique monthly listeners, and I am having a ton of fun sharing all of these conversations and everything I'm learning from them with all of you. Now I hope you enjoy this conversation with Jonathan Frankle and Abhi Venigalla of MosaicML. Abhi Venigalla, Jonathan Frankle, welcome to the Cognitive Revolution.

Abhi Venigalla: 4:28 Hey, it's great to be here.

Nathan Labenz: 4:29 Really excited to have you both. As you know, I've been a close watcher of and big fan of Mosaic for a little while now, and you guys have built an awesome platform and made a bunch of news recently with a number of product and model releases in the LLM space. So I've got a ton of questions and thought we could maybe structure things by starting first with what I believe to be the foundational layer of the business, which is the custom large language model training. Then I want to get into the new inference side of the business as well. Definitely want to make sure we get to the 65k model that you guys recently released, because I think that's super interesting and even want to dig in a little bit to how that was done and some of the new techniques that you have, I'm sure, not only applied, but refined in practice. And then if we have time, we could even get into some bigger picture stuff. How's all that sound?

Abhi Venigalla: 5:24 It sounds great.

Nathan Labenz: 5:24 Cool. So I guess when I think of Mosaic, what I have understood you guys to be up until recently is the go to place for presumably larger businesses, although you've done impressive work to bring the entry point sticker price down, but presumably mostly larger businesses that want to create their own custom large language models, I think usually from scratch. So the first thing I wanted to just sanity check myself on is, do I have that right? And then we could dig in a little deeper to who those companies are.

Abhi Venigalla: 5:56 Yeah, no, totally. I think you got it right. And I think we've been expanding from that initial position recently in the past month or so. But yes, we started off basically trying to help people build their own custom models, whether it's large language models or diffusion models, originally even BERT models and the rest of the old school models even back in the day. We really want to help people who have valuable proprietary data, turn them into valuable models that they own rather than necessarily leverage APIs and such. One way we've expanded from that is to try and make it even easier for people to build these custom models so they can start from pre-trained checkpoints and our recent MPT models that we released out there. Or some of these public checkpoints where it's already a pretty good language model, and you can start on top of it and continue. And finally, the last one that we thought is that every person who trains models with us wants to deploy them at scale. So we figured, well, we should probably help them with that too. So that's also what we're trying to do. So we're really trying to help end to end people go from the data to these private models, private endpoints that they own.

Nathan Labenz: 6:58 Let's unpack that a little bit more. Who in today's world is building a custom large language model and how big are these models that they're building? We could talk about that in terms of parameters or token count on the training side. You've kind of answered already, but I was curious to what degree you're seeing people use a combination of open source data sets versus just how many of these customers actually have enough of their own data where they don't even need any of the standard data sets.

Abhi Venigalla: 7:27 Yeah, for sure. I mean, I think we have a very wide spectrum of customers. We have people training sub billion parameter models to in the single digit billions of parameters and some even going well beyond that. You can see some of our customer stories. Replit recently, they trained a really powerful 3 billion parameter model that actually, in some ways, outperformed the first for code insertion. Right? So kind of out punching for its weight class. We really focus a lot on efficiency in this way. We want people to be able to produce things that are smaller, cheaper for inference, cheaper for training, that still match the quality that they need. Now in terms of data, this is the thing I have really focused on nowadays, which is that sometimes customers come with a lot of pre-training data. They may have across all of their customer interactions or logs and stuff, tens of billions of tokens of data, but they want the model to also start with a general knowledge base. So there's a good amount of research doing that to figure out what are the data mixes that we want? And I actually spent a lot of time creating the data for the MPT models that we released. But, yeah, I would say I would expect customers to have a mix of both public and private data.

Jonathan Frankle: 8:37 I'll throw in one more thing, which is to put a really fine point on answering your first question. Who's training these models? Everybody. Really, the question you should ask is who has interesting proprietary data? Everybody. As Abhi mentioned, there's a model size for everybody. There's a good entry point for everybody. At the end of the day, it's everything from small startups for whom the model is their main product. Companies like Replit that are very tech forward and AI first and recognize the power of this, to the kinds of companies that if I mentioned them, you'd say, wow, that's a really big, boring company. They're doing AI? The answer is, yeah. They have amazing data. This is how you activate it. So it really is everyone and everybody trying to train these models.

Abhi Venigalla: 9:18 Actually, you remind me of something that I thought a lot, which is that in the old days, you used to say data is your moat. Then in the past year or so, there's been this new kind of wave for, well, actually training the model is so hard, maybe that's the moat. So what Mosaic's saying is we're making that easy again. So it's almost making ML training boring again. We're bringing it back to a position where actually your proprietary data is what makes your model so much better.

Nathan Labenz: 9:43 Can you expand a little bit on just use cases? I mean, obviously you've got some names on your website and then I'm sure lots more customer names that maybe you can't disclose. But if you can kind of abstract away from the identifiable details of some of these customers, I think people are really curious about use cases. Are we building chat type agents to help enterprises interact with their customers? Are we doing task automation back office? What's that kind of breadth and mix look like?

Jonathan Frankle: 10:18 The boring answer is it's a little bit of everything, but I think the main thing we see, especially with big enterprises, I can really sum it up as two tasks: extraction and summarization. Those seem to be the core workhorse tasks that people want to get done. You've got a huge amount of information. You may get there because a new court case came out and it's a 100 something pages long and you want to understand what it's about right away. You may get there because you're using something like LangChain and a vector database, and you've pulled up a bunch of really relevant documents and you have a lot of information that is relevant to some question. But at the end of the day, you really want to extract out the relevant piece of information or the relevant passage, or you want to get a summary and get useful information out of it. I think for a lot of our enterprise customers, it's as simple as that. For a lot of other customers, they are looking at some specific application. We certainly have customers who care a lot about chat. I wouldn't say it's anywhere near the majority. We certainly have some folks who either care about chat as their main application or care about chat as a good user interface on top of one of these systems. And then of course, we have customers like Replit that are doing something that doesn't really fall cleanly into any of these categories and is genuinely novel. So really, the models are as multipurpose as any other language model, but those are really, at the end of the day, those are the boring useful things that honestly these models are best at. And they're both things where the model doesn't have to be perfectly right. We're getting in the right direction. Just trying to get useful information out of some data you already have is the most important part. And you're not about to make an important medical decision on the basis of a summary that your model gives you, for example. There's still a human decision maker in the loop to make sure that the information is acted upon in an appropriate way.

Nathan Labenz: 11:57 Yeah, I think that's great grounding. There's so much stuff going on in the corporate world that is ultimately not necessarily super flashy, but there's just a ton of value to unlock because of how much of it there is and how much more cheaply you could do it, or potentially also how much more you can scale it versus what you could do in the past. I wonder if you could comment on that. I see a lot of, in the context of task automation, sometimes it's, I have this task, I do it today in a human powered way, and that can be slower and more expensive than I'd like, so maybe I can take time and cost out. But then the other thing that people often quickly turn to once they start to wrap their head around this is, maybe I can scale previously unscalable processes. So if it is this kind of extraction and summarization are the main things, how much of that extraction and summarization do you think in the pre language model era was done by humans versus just not done at all because nobody could get to it?

Abhi Venigalla: 13:02 Yeah, I would say I'm much more excited about the latter. And that's also where I think customers are probably excited too. One thing I think about a lot is actually in the context of some of our work, our open source repos. Lots of times when people use it, they have questions. They're, oh, how do I actually use the script? What should I be doing? What's the workflow? And right now, I think the best that humans can do really is to write documentation, right, or write FAQs and stuff like that. But there's no scalable way to have someone next to you as a support person helping you through it. Right? But with these models and stuff, we could actually build some system. Right? It seems we can actually build higher and higher quality interactions with people. Previously, as we said, it just could not be done because of the cost and scalability of having one person for every customer. So I think that's where I would be most excited. Not necessarily replacing things that are done today, but enabling new things that can't be done today. Abhi Venigalla: 13:02 Yeah, I would say I'm much more excited about the latter. And that's also where I think customers are probably excited too. One thing I think about a lot is actually in the context of some of our work, our open source repos. Lots of times when people use it, they have questions. They're like, oh, how do I actually use the script? What should I be doing? What's the workflow? And right now, I think the best that humans can do really is to write documentation, or write FAQs and stuff like that. But there's no scalable way to say, have someone next to you as a support person helping you through it. But with these models and stuff, we could actually build some system. It seems we can actually build higher and higher quality interactions with people. Previously, it just could not be done because of the cost and scalability of having one person for every customer. So I think that's where I would be most excited. Not necessarily replacing things that are done today, but enabling new things that can't be done today.

Nathan Labenz: 13:56 Yeah. I certainly see that in some of my task automation work. It quickly becomes like, geez, we could do 100x of this. And some of these things are just almost every business has certain use cases. It's like scaling outbound recruiting outreach. How are we going to get to all the candidates we'd like to get to? Most companies just don't have the resources to do as much of that as they would like. And we're starting to see that kind of thing turn out. It's going to interestingly lead to some probably different dynamics in terms of how that communication actually works and what it takes to break through and be effective. But we're in this interesting moment where it's like the early adopters are getting these kind of early benefits and then presumably there's going to be a new equilibrium on some of these things. Digging a little bit deeper still into why does somebody come to Mosaic? Because I can imagine, okay, I'm in enterprise. I've got these, probably at the end of the day, fairly mundane information processing objectives. I could go to an OpenAI, and they also make it pretty easy to do fine tuning with just supervised input output paradigm of fine tuning. What is the main reason that folks say, I don't want to go with them, I instead want to work with Mosaic? Is it like control of data, cost? I mean, you're going say all of the above again. But give me a little bit more than just the all of the above, because I kind of know what the suspects are, but I want to know what the mix is and what you're Hey. We'll continue our interview in a moment after

Abhi Venigalla: 15:29 a word from our sponsors. Yeah. Totally. I think I'll focus on kind of two customer profiles in the city. One is the type of customer, often startups, where they want to build a new product or experience that truly cannot be done any other way than to produce a custom model. They either have to train a really custom domain, like a coding model, or they need a model that handles multiple languages, or basically things to where it's like, they've tried the APIs. It's not quite good enough or it's not cost effective enough. And they really need to have their own custom model to do it. And so for them, it just makes sense. Well, if I need to build an element house, I need to get the compute, I need to get the engineers, same old researchers. It's a really expensive and honestly today, pretty hard to find group of people to build something out there. Or kind of like the other option, that you can just go to Mosaic and you build them and with a very small team, you can actually build these custom models for yourself. I would say that's one type. Another type are people who actually are our users of the things API providers, OpenAI, Cohere, others, where maybe they build the first version of the product on top of these APIs and it's going great. They've scaled users, they're getting lots of revenue, that's fantastic. And now it's sort of like, okay, now it's an optimization question. Can I potentially build smaller models or custom models myself and deploy them for cheaper? And that's part of what we're targeting on this inference. We're basically trying to very clearly separate the cost of training the model and the cost of deploying the model. And there's only small thing. It's actually very, very close to the actual inference GPU cost. Whereas opposed with OpenAI system, you're paying for token thing and that per token cost actually goes a lot when you fine tune. So I think for instance, some of the fine tune models think cost 6x as much when you go from the base model to fine tune. With us, there's no such thing like that. It's sort of we're very explicit about, here's the compute cost and cost power. It doesn't matter what weights you're putting on, be the base models or quantum models and so on. That's where I think we're trying to be a bit more transparent about all of this and passing the savings to customers and give them a lot

Jonathan Frankle: 17:39 more flexibility than you'd

Abhi Venigalla: 17:40 get potentially with these findings in The U.S. Do you want say, John? Yeah. I'll

Jonathan Frankle: 17:45 throw in I think from the big enterprise perspective, it's really three things. It is customization, control, and cost, and I love that they all start with C. That was convenient. On the customization side, a lot of folks Enterprises have huge amounts of data that you can't put into a few input output pairs. We've had plenty of customers come to us and say, I have several hundred billion tokens, or I have over a trillion tokens worth of data. There's no other way to get that data into your model beyond doing a lot of pre training potentially from scratch. A lot of customers come to us and say, We want to customize the pre training data for the model fundamentally. They look at open source models and they say, Well, I don't like that dataset. I want a little more code. I want a little more of this. We basically have a menu at this point of what's available open source and how they want to put it together. With any of the big APIs, you have no idea and that's part of the secret sauce. You have to hope the thing you're worried about just isn't in that model. On the control side, we can do it within their cloud VPC such that the data never leaves. We can do it on prem if they want to. And they own the model weights when they're done. Not just an API to the model where they can rent their own model forever, but they actually own the model. If they want to take it and serve it another way or open source it, we're fine with that. That's up to the customer, not to us. Abhi, I think has comprehensively mentioned the cost side. It's just cheaper to train a domain specific three billion parameter model than it is even to try to fine tune, in many cases, a gigantic, I guess, 1.6 trillion parameter model from what I've heard, especially if you're going to do it on more than a few input output pairs. So that difference really matters.

Nathan Labenz: 19:29 I've done a lot of fine tuning on the OpenAI platform in particular, and I've had some things where it's worked really well for me. And then I've had other things where it hasn't worked so well. Where it's worked well has been, I have a very defined task and I'm kind of dialing it. I have a certain format requirement. I have certain length output requirements. And I typically find it doesn't take that many examples to get there. But then I've also tried trying to train it to write in my voice. And that hasn't worked so well because I only have so many tokens for one thing, and it sort of learns who I am, but it's still madly hallucinating after at least the sort of dataset that I'm able to muster. It sounds surprising to me on some level that there are this many established companies that are already this far in their large language model journeys. And everything's kind of happening all at once. We've got reports of people banning ChatGPT at work. And then we've got reports of other people like, If you don't learn to use it by July 1, you're fired. I just saw a story like that the other day. But I am surprised that people are kind of seemingly moving on to next generation already. So I'm curious about that. And I'm also curious about the nature of the models that they create. Are they creating these models that sort of have a very particular point of view on the universe? We saw, I don't think this is your customer, but Wendy's has fresh AI. And I'm wondering, do those AIs know about a Big Mac? Or do they have just no conception of what exists outside of their corporate knowledge base kind of universe? There's a lot there, but I'd love to hear a little bit more about just kind of the color commentary of the sorts of journeys that people are on and the nature of the models that they're creating.

Abhi Venigalla: 21:35 Yes. So I think I'll just start with the first question. The number of companies that are building LMS today. I think we were also pleasantly surprised by the number of companies that we found since, let's say, last winter or so, when we really started doing this LL effort. In the beginning, I think it was mostly startups, but especially since the ChatGPT explosion, a lot of basically once the world realized this is possible, we've seen many, many more enterprises want, basically want to attach utility internally. One that they can trust, one that they can actually control, like the data that's coming in, the recency, one that which their internal employees can use safely, as opposed to some of these reports of confidential data leaking into the ChatGPT system. So yeah, I think it's even if they're not super far on the journey yet, they know right away that they're going to have to develop some kind of proprietary in house system, not necessarily just as. Yeah. So I can't speak too much what exactly our customers are doing, but I will say it's not too diverse in that way yet. It's not like people are ripping out parts of Wikipedia and trying to control things like that. I think right now, lot of it's just sort of how do we incorporate our proprietary in addition to general world knowledge. A lot of it is making sure there's no harmfulness, no toxic behavior, that kind of stuff too. Some of it is even just how do I incorporate new things like long context length or long form documents, which you couldn't do until very recently on the APIs. Of course, Anthropic now has this 100k foot length support as well. We're really excited about that and the possibilities. But I think most things have been additive. I'm not sure I've seen too much of the pruning away that you may be talking about. Maybe do you want to comment on

Jonathan Frankle: 23:27 Yeah. I think what we tend to see is that all of our customers want a mix of some open source data and their internal data. If you were to just train it on your internal knowledge base,

Abhi Venigalla: 23:36 that model's going to have

Jonathan Frankle: 23:37 a pretty limited understanding of the world. The knowledge base may not even be in very clear pros in a lot of cases. So a lot of customers are mixing together open source datasets they think are a good fit with their own internal data. Do also want to kind of popping up one level, I think you're framing this as, do you use GPT-4 or do you train your own model? And I think that's a false dichotomy. What we're seeing for a lot of our customers is that the answer is yes. There are certain use cases where GPT-4 is both great and allowable. Are lot of use cases where either GPT-4 simply can't be used in those cases. There's sensitive data. There are a lot of cases where it's just not effective because we need to know about internal business processes that it simply hasn't been trained on because that's internal documentation. So what we see with a lot of our customers is, are they using GPT-4? Yeah. Are they using us? Yeah. These are different tools for different use cases. But I don't think we're trying to say replace GPT-4 with MosaicML. We're trying to say there are a lot of different ways to solve these problems and a lot of different tools to do it.

Nathan Labenz: 24:39 Yeah, think that's a good call out. And it's almost one of my mantras these days too, is it's never in AI right now either or. It's always kind of both. And that extends to societal impact questions as well as kind of which models and tools are people going to use. So that totally makes sense to me that it's not all one way or the other. So not that so it makes sense what you're saying there with kind of the mix of data. It makes sense then why you have these kind of checkpoints that are like a base to build on. Like no need to rerun that sort of Wikipedia pre training time and time again if you can kind of embody it and have the ability to branch off from there. When you do that fine tuning, are you continuing to mix in It seems like we're kind of moving towards this curriculum model where there's kind of this general knowledge base that is the checkpoint and you're laying on the proprietary data. Are you still mixing in the open source data or do you just kind of pre train on their data and that works out okay? I don't really have a sense for if you can make that kind of shift from dataset A to B and it all works out okay, or if you kind of have to continue to bring some of the base forward into the continued pre training.

Jonathan Frankle: 26:01 Yeah. I mean, I think the first thing I would mention here, and I'm going to be really picky right now, I hate the word fine tuning because when you show up and you say, I want to fine tune on 200 billion tokens, there's nothing fine about that. It's as hardcore as free training, if not more so. What we tend to see is a mix of things. If they're going to build on one of our checkpoints, you do continue to mix into open source data and create the right mix. It may not be the exact same open source data, but often if a customer doesn't have the right representation of data anyway, simply training on a specific subset of the data they want the model to eventually know will lead to catastrophic forgetting. And so you do want to make sure that you've got a good mixture of both, some more generic data and some more domain specific data. I this is in some sense a question as old as time when it comes to transfer learning and fine tuning. There are papers from the BERT era, which was so long ago, three years ago, that were all about this question of how do you fine tune properly? A paper called Don't Stop Pre Training, which should tell you the whole story there. So there are a lot of different approaches to this, especially when you get into instruction fine tuning at RLHF. You are doing a different task. So you do want to change the nature of your data and probably don't for example, instruction following and just normal continuation of natural language. But all this is really I think the bottom line is all this is science yet to be done. We're experimenting, our customers are experimenting, we're doing things that seem intuitive, we're trying to test everything rigorously, and we're all learning every day. A lot of the best questions like this haven't even been answered in a really rigorous way yet.

Jonathan Frankle: 26:01 Yeah. I mean, I think the first thing I would mention here, and I'm going to be really picky right now, I hate the word fine tuning because when you show up and you say, I want to fine tune on 200 billion tokens, there's nothing fine about that. It's as hardcore as pre-training, if not more so. What we tend to see is a mix of things. If they're going to build on one of our checkpoints, you do continue to mix into open source data and create the right mix. It may not be the exact same open source data, but often if a customer doesn't have the right representation of data anyway, simply training on a specific subset of the data they want the model to eventually know will lead to catastrophic forgetting. And so you do want to make sure that you've got a good mixture of both some more generic data and some more domain specific data. This is in some sense a question as old as time when it comes to transfer learning and fine tuning. There are papers from the BERT era, which was so long ago, three years ago, that were all about this question of how do you fine tune properly. A paper called Don't Stop Pre-Training, which should tell you the whole story there. So there are a lot of different approaches to this, especially when you get into instruction fine tuning at RLHF. You are doing a different task. So you do want to change the nature of your data and probably don't for example, instruction following and just normal continuation of natural language. But all this is really I think the bottom line is all this is science yet to be done. We're experimenting, our customers are experimenting, we're doing things that seem intuitive, we're trying to test everything rigorously, and we're all learning every day. A lot of the best questions haven't even been answered in a really rigorous way yet.

Nathan Labenz: 27:36 Yeah, so basically no sharp line between continued pre-training and fine tuning in many instances anyway. Starting to see too, some of these what had been understood or at least introduced as kind of late stage training techniques are also being back ported into the pre-training, right? We're seeing, just from the last week, I think, pre-training with few shot structure as kind of an earlier thing. I could probably cite several examples, but I won't waste your time. But yeah, it's interesting that both are kind of blending together and the techniques seem to be kind of going both directions at the same time as well.

Abhi Venigalla: 28:18 Yeah. I think especially once we start getting to this situation where we just kind of few shot the models, I mean, few shot prompt the models, even the loss functions that we're using for fine tuning and pre-training are exactly the same. We're effectively just giving sequences and saying, please match the sequence. Please auto complete it. Right? And so previously, I think some of the distinctions came from, oh, if I'm doing fine tuning, I'm going to add a new head, I'm going to add a new classification head or an LM head or something. Now all of that's kind of out the window and we're just continuing to train the original weights with new formats effectively.

Nathan Labenz: 28:52 Yeah. The great convergence. It is fascinating that obviously the architectures have kind of unified, but also loss functions kind of converging to the same thing. Man, that is crazy. Are there any, so what I wanted to ask specifically about life sciences because one of my personal goals is to have no major blind spots in my understanding of the AI landscape. And I can't quite ever achieve that, I don't think, because it's moving too quickly. But I realize in looking at some of the stuff on your website that that is kind of a gap in my knowledge right now. What is a language model? Maybe it's not even language models that you're training for them, but I had kind of understood it is, I think, language models in the context of the life sciences. What does that look like?

Abhi Venigalla: 29:42 Yeah. So I think we've done a bit of work kind of on biomedical language models in the past. We worked with a team from Stanford, CRM team, to build, I think, BioMedLM. And that was just late last year, where basically we wanted to have a model trained on just custom domain data, which was, I think, PubMed papers, basically. So that you can actually ask questions about medical diagnosis and things like that and have it actually answer well. And you're seeing a very similar trend happening at Google right now, I think, with their MedPalm models, MedPalm, MedPalm 2, where they're actually able to put them head to head with physicians and actually see, oh, this is an augmenting tool that helps you diagnose patients, helps you respond to them and stuff like that. So that's one part where it isn't just generally language models, which may not be biomedical language. There's another direction which we haven't gone too much into yet, I understand, which is genomics or these more chemistry applications where protein synthesis and amino acid chain stuff are also just sequences. Especially with some of the new support we have for long contexts, can we help people in those domains too? Basically tuning transformers for protein sequences or something like that. I think that's a really exciting question for the future.

Nathan Labenz: 31:05 Cool. Any other customers or perhaps surprising language model use cases that you would highlight?

Jonathan Frankle: 31:13 I don't know. I personally still find every use case of language models a little bit surprising in some sense. Fair. It's still new and so everything is new and interesting. Honestly, our customers are really creative. I think there's a lot of really cool multilingual work that we're seeing happening where existing models are okay at doing multilingual stuff. But if you really want a model that's focused on a particular bilingual scenario, that seems to be incredibly popular right now. Languages that I would not have chosen if I were trying to build, say, a five language or ten language model. So I found that to be really cool and really exciting so far. I hope our customers will talk about that at some point. I don't want to talk about it out of place for them, but we've seen some really awesome applications in that area and it has me excited. It's not something I thought that much about. It puts a lot of pressure on the tokenizer, which is kind of interesting. Again, something that I find is a very fraught subject. But it's cool to see all that activity and to see that we've reduced the barriers to entry such that these applications are now within reach. You can build an LLM for an interesting language pair that certainly nobody would have picked right off the bat.

Nathan Labenz: 32:22 Yeah. That tokenizer, there's been interesting research this week that has been billed as maybe the beginning of the end of the tokenizer. We'll see if that pans out. But I've seen also, one character from a Hindi script, for example, might be in fact eight tokens because of the way it's all kind of broken down under the hood. And that is definitely something for listeners, if we want to go down that rabbit hole, check out how certain Indian language alphabets get tokenized. It's quite gnarly. One thing I want to move on to the inference business, which you guys are just introducing in just a second, but just a little bit more calibration on kind of this data. People have these hundreds of billions, trillion tokens. It sounds like it's often kind of raw. How raw can it be before it becomes a problem? And then how tight do the resulting language models kind of stay to that knowledge base versus do they start hallucinating plausible products that these businesses might have? Imagine that same problem must exist, right?

Jonathan Frankle: 33:35 Yeah. I mean, I think to a first approximation, your model is what you put into it. If you train a language model on a web dataset that has a bunch of advertisements in it, your model will learn to start spitting out advertisements in its sentence. That sort of behavior is definitely in these models. And so the data needs to

Abhi Venigalla: 33:52 be reasonably clean. You can't just shove everything in there.

Jonathan Frankle: 33:56 Not all tokens are created equal. But to your first approximation, all tokens are created equal. You have to do a lot of work to get a high enough quality dataset that you start to see this matter. This is why we place such a high premium on things like Wikipedia, things like code that tend to be relatively well curated, all things considered from the beginning. And so just punch above their weight token for token with other datasets. But if you're going to pull data from the web, let alone data from your internal use cases, doing good work on that data to clean it up can be really, really important. And I imagine all models like GPT-4, that was the lion's share of the work in some sense. Once you have a good system down for training the model, data is a never ending problem.

Abhi Venigalla: 34:44 To address the other question, how do you make sure that a company's products are represented faithfully and they don't lose things? I think this is where long context can really help. The more you can kind of shove into that initial prompt to the model, it's going to tend to that rather than whatever is in memory, just more attentively. And that way you can kind of ensure, hey, here's a list of actually June 2023 products. Please refer only to these. I think that's a lot more reliable than potentially retraining every so often, stuff like that too. So we're investigating, our research team, lots of different ways to incorporate both long context, potentially in the future retrieval as well, to make it so for enterprises, they can actually trust these models.

Nathan Labenz: 35:26 How often are the customers able to kind of use something that comes out of the bulk pre-training process versus layering on some sort of finishing instruction tuning or RLHF?

Abhi Venigalla: 35:42 I think for most enterprise applications, you have to go through some kind of this post-training or polishing. I think especially, you can even tell from API providers. They usually have a version of this kind of either instruction tuning or RLHF or something like that so that you can actually speak to the model naturally rather than prompt it very specifically to do what you want. I think we offer, especially on our Models as a Service, instruction tuned models mostly. If you build your own, obviously you can start with a base and do however you wish. But my gut feeling is that instruction tuned models are a little bit more usable.

Nathan Labenz: 36:20 Yeah. Makes sense. I was kind of, from that, I'm kind of inferring also what the pattern of use is where, again, in some of the very highly tailored fine tuning, which is genuine, just few example fine tuning that I've done, it doesn't necessarily need to be instruction tuned because it's in such a controlled environment. There's a high degree of kind of developer architecture that surrounds what the inputs are even going to be. But it sounds like there's also a pretty good mix of companies just kind of creating these things and then making them available to their internal users to say, now you can use this whenever you want. Is that a fair conclusion?

Abhi Venigalla: 37:02 Yeah. No. I mean, I think in terms of evaluation stuff, the best place to put data is in front of people, internal ones, and just see how they use it. I think as you're saying, right, especially if you're fine tuning on a lot of data, you could have an input output format. Maybe it's good enough to use the base versus the instruction tuned model. I don't think there's any capabilities gap necessarily with instruction tuning. It's really just this kind of massaging of the input output so that you can talk to it as you would to your friend or something like that. But yeah. So a common path might be you start from a base model, right, either one of our pre-trained ones or one that you built yourself. You deploy it once, get feedback. Then we have a nice little playground where you can actually talk to it and kind of investigate how it's doing. Based on how that performs, you share that out with your employees. Then you build a fine tuned model and then the next one, the next one. And it's really an iterative process to get to the model to where you actually want it. We're putting a lot of work into evaluation for the next few months. We started off with kind of academic benchmarks, and we have some blog posts as well on how to do really fast evaluation of these kind of in-context learning tasks. But we have to graduate beyond that. These models are getting so good that effectively, it's hard to even judge their quality from these simple scale tasks. So interactive evaluation, automated evaluation, all this stuff.

Nathan Labenz: 38:24 If I understand correctly, that's kind of frontier right now. Most clients are not, for example, red teaming models yet, it sounds like, but you see that as the kind of thing that more will start to do in the not too distant future.

Jonathan Frankle: 38:39 I see this as something that we're going to provide as a service to our customers. At the end of the day, we are the one stop shop for making sure that you get a good model all the way from data to training to inference. A big part of that is making sure that the model you put out there is something you're proud of. So we invest a lot of resources when we work with customers to make sure they're evaluating their models carefully. Red teaming is a part

Abhi Venigalla: 39:01 of that.

Jonathan Frankle: 39:02 And so, I can't make any promises about a MosaicML evaluation product. I'm not going to announce that today, but you should keep your eyes on it.

Abhi Venigalla: 39:10 I know, I guess, that's more detail.

Nathan Labenz: 39:12 I'm always very interested in the kind of red teaming risk mitigation side of that in particular. But obviously, confirming that it does the happy path effectively is also super important. Let's talk about the inference business. You guys have layered on an inference platform to the business. And it's interesting because you both have open sourced these foundation models. People can go do whatever they want to do with them. Obviously, you don't see that as a threat to your ability to build a good inference business. So I'd kind of love to maybe just start with the pitch for the practical kind of way that people understand, even though this model's right here and I could go do it, what is the calculus that ultimately leads them to choose you? And then I'd love to dig into the pricing a little bit more and kind of understand the cost structure to the degree that we can do that as well.

Abhi Venigalla: 40:06 Yeah, absolutely. So I think

Jonathan Frankle: 40:08 kind of starting at

Abhi Venigalla: 40:09 the basics. So you have some of these open source models on the Hugging Face hub and stuff. Why not just try and spin up a GPU yourself and serve it? And I think there's a lot of learning that happens once you actually start to try and do this and serve it on an API to many customers. You realize that the libraries that you have today, it's easy to serve one customer, but very hard to serve many. Right? And you need batching. You need things like lower precision weights. You need things like multiprocessing and auto scaling and stuff like that. And tuning all these things is quite difficult, especially as you have these very large models. So the challenge I would say is that when it comes to time that you actually want to serve, say, a 30 billion parameter language model, you find there's very few existing solutions out there that can actually achieve the type of performance or latency that you expect from OpenAI and stuff like this. So that's one thing.

Jonathan Frankle: 41:02 We want to simplify all that

Abhi Venigalla: 41:04 infrastructure the same way we simplify it for training. And in terms of the cost side of it, I'm willing to take your questions so I could answer. Nathan Labenz: 41:04 infrastructure the same way we simplify it for training. And in terms of the cost side of it, I'm willing to take your questions so I could answer.

Abhi Venigalla: 41:12 Yeah. So I study this somewhat obsessively. So just looking at the pricing page, I noticed that it is $5.01 hundredths of a cent per 1,000 tokens for the new MPT 7B instruct model, which I gather is kind of the new mainline workhorse base model. And to put that pricing a different way, at 20,000 tokens or 40 pages, you've now spent 1 cent. So you've got quite a bit of room to run there before you're spending much money. Where does that money go? I don't know if this is something that you can even characterize this way. On some level, it could be, but it may be so many layers of abstraction. But do we have a sense for if I spend 1 cent on 20,000 tokens, how much of that 1 cent went to electricity? How much of it went to buying GPU cards? How much of it went to, I don't know if you guys use AWS or multiple clouds or even run your own data center. So that'd be interesting. But how much of it goes to data center management?

Nathan Labenz: 42:28 Well, without going into too many details of the exact numbers, I can help build up the stack of everything that goes into that final cost there. So Mosaic, we deploy anywhere on any compute provider. So you could bring your own GPUs that you have if you have your own AWS account or Oracle account. You could rent compute directly from us, and we have some preferred partners, mainly Oracle right now, where we rent a lot of these GPUs. So as you build up towards that cost per serving, what you're really trying to figure out is to satisfy a given workload, let's say you need to satisfy 10 requests per second or something, and you need to figure out, given a particular model size, let's say MPT 7B or a comparable model. You want to figure out how many GPUs can satisfy that demand, and then you're basically renting those and running them full time 24/7. So that's the thing about inference, right, is that you have these servers that are basically running all the time and all stops are getting requests. And so it's funny, sometimes training at least the job finishes at some point, but inference is running forever. So building up towards that 1 cent cost you're talking about, you're paying a certain charge for renting GPUs, and that both amortizes the cost of the GPU usually over about 3 years or something. Plus, electricity is being used for the data center. But for most people, that whole cost gets packaged up into the rental price per hour. So maybe just putting a number out there for an A100, it costs $2 per hour. Maybe to your data center provider, cloud provider, it costs them only a dollar to actually run the whole thing because they've got such economies of scale. But that's the first part, the price per hour. Then on top of that, the Mosaic platform service, we have a few margins ourselves that we put on top of that. But finally, when you're an enterprise customer, you basically just pay per hour price. So we help you figure out for your request workload how many GPUs you need. We help point you towards the right GPU types that you want as well. Not necessarily A100s, which are really powerful, but a little bit of overkill for certain workloads. Maybe you want A10s instead. And we help break that down effectively into a price bracket. And so then finally, when it comes to your single request rate, what we're saying basically is that 1 request, if it was happening constantly over the course of a month, it breaks down to only a little bit, 1 cent, versus the monthly cost, maybe a few thousand dollars.

Abhi Venigalla: 44:52 So as it stands today, can I come and buy on sort of an API basis, like one call against that new Instruct model? So you have both a kind of pay per use or pay per token model. But it sounds like more of the business is ultimately, because people have their own custom models, I guess they can't use shared infrastructure anymore, right, when they've got custom models. So you don't want to be, you have too much load, cold start problems. So you're more often selling compute on a dedicated capacity basis, and the value add is helping people understand what dedicated capacity they need so that they can spend efficiently.

Nathan Labenz: 45:36 Yeah, exactly. So we have 2 pathways with our inference servers. The first is the starter series where there's a list of supported models that are basically models we build or open source models that you can just pay per use. So you can imagine that as kind of shared service providers out there. So I think that's maybe the place where you got the 1 cent per 40 pages. That's kind of a starter MPT setting and all. Then once you train your own custom models and you want to deploy them, that's where you're going to transition to the enterprise service where you're not going to be paying per request, but you're going to be paying, as you said, dedicated capacity that we help you figure out. And for there, it's really basically a cost per hour, and then that can satisfy a certain workload.

Abhi Venigalla: 46:25 And do so do people, they buy that through you in a way where they don't even, do they know, oh, I'm deploying this on AWS and it's kind of a transparent model or are you reselling the underlying cloud compute?

Nathan Labenz: 46:39 It is fully transparent. I think one of the core features of Mosaic is that we will run the workloads wherever it is required based on your security constraints, based off your existing contracts that

Jonathan Frankle: 46:52 you may have.

Nathan Labenz: 46:53 And optionally you can rent it through us, but it's just an option. So we train ourselves just like all the other cloud providers. Gotcha.

Abhi Venigalla: 47:00 Okay, cool. So if you had to estimate, going back to that original question, if I rent GPUs by the hour, let's say on AWS or on Oracle, what would you estimate their stack looks like in terms of how much of their per hour cost goes to electricity? How much of their per hour cost goes to buying the physical hardware?

Jonathan Frankle: 47:23 I mean, I think the simple answer is that's their internal numbers,

Nathan Labenz: 47:28 and we don't know.

Jonathan Frankle: 47:29 And I don't think it's a good time for us to speculate on that. I would guess that the electricity is probably a relatively small amount compared to the hardware, but that's just a complete guess. Happy to refer you to our friends at Oracle if you want to ask them.

Abhi Venigalla: 47:45 Yeah, all right. Well, we might take you up on that. This is something I definitely am trying to triangulate on. Because ultimately, some of this stuff goes on the edge eventually. Right? And then it's you have no marginal cost of device arguably, and it's just electricity. So I'm trying to kind of scout out a little bit to some, not that that's going to be the end all be all, but there's some part of the future that kind of ends seemingly at marginal cost of electricity. So trying to kind of sketch out what that is, and it isn't super easy to figure it out. Anything else we should cover on inference? And then my next thing is to switch gears and go to the new 65K context window model.

Nathan Labenz: 48:26 The only thing about inference I'd say is if you're interested, please reach out because I think we have very competitive pricing and a lot more transparent model than some of the other API providers out there. So if you want to build custom models and serve them, just let us know.

Jonathan Frankle: 48:40 And I will throw in on that. If you want, you better reach out ASAP because we're definitely getting, as our research team is learning, we're getting booked up very, very quickly. Certainly, we're making hard choices between research and customers right now because there's so much demand.

Abhi Venigalla: 48:58 Yeah, I'm not surprised by that at all. Following your trajectory from $500,000 GPT-3 quality, however many months ago to the more recent stable diffusion for I think it was like 150 or so. And now all this new stuff as well. I would expect the phone is ringing off the hook.

Jonathan Frankle: 49:20 Less than 50. Gotta get that number.

Abhi Venigalla: 49:22 Oh, less than 50. Yeah. Okay. Thank you.

Nathan Labenz: 49:25 I think we crossed the price tag a few times. So it's changed a bit.

Abhi Venigalla: 49:29 Yeah. You guys are hard to keep up with. It's a lot of releases. So, okay. Let's talk about this, one of the aspects of recent set of releases. The StoryWriter model, 65,000 token window extends even beyond that. So I understand, I've read the paper a little bit, but I'd love to get a little bit better intuition for it. It seems like this ALiBi approach is kind of the heart of the upgrade there. And what we're doing essentially is replacing the original old school positional embedding scheme with a new seemingly more intuitively justifiable principled positional embedding approach. But it still seems like the model gets big still, right? To have that long token window, attention still scales with the square of the context window. Do I have that right? Nothing has changed about that, has it?

Nathan Labenz: 50:29 Yeah, no, 100%. It's still the same amount of work as before. We were just swapping out the positional embedding and replacing it with a bias in the attention. But yeah, so I think some of the nuances here is that that quadratic portion of the attention is actually not as big as many people might think. For a lot of very large transformers, if you're training them with a context of, let's say, 2K, you actually want the work being done in that attention to that product. So it can be less than 10% of the real work of the network. So yes, as we're stretching out the context from 2K to 4K to 8K, yes, we're stretching that 10% a lot bigger, that piece of the pie. It's not ridiculously large. As a point in the sand, when we trained our StoryWriter model, a 65K context window. So we had the base model, then fine tuned it on long context. The work done per token went up by about 4X. So you can think about the cost to train, let's say, a couple billion tokens was 4X larger for that context than it would have been if you had trained just a 2K context on it. So yes, it is bigger, but it's not ridiculous.

Abhi Venigalla: 51:40 So from 2 to 65 and it's actually even bigger than that. Right? Because you allow it to go.

Nathan Labenz: 51:48 Yeah. So at inference time, you can actually, you can technically go up to any infinite context length, but practically speaking, maybe like 2X more than it's been trained with. But yes.

Abhi Venigalla: 51:59 So how does that, so how can you have an infinite context? Is something just getting truncated eventually there or you're rounding at some distance? You're rounding down to zero? Because otherwise, traditionally you'd have that would mean an infinite by infinite matrix, right?

Nathan Labenz: 52:16 Yeah, no, absolutely. The infinite definitely has a lot of asterisks there. You're hard walled by memory at some point, so it's not really infinite. And eventually, you'll get tired of waiting as well, so that's not quite infinite. But what has happened in ALiBi is that there are these kind of slope biases that are being added. You can imagine there's a slope from negative 1 to 1 from the zero position all the way to the end. When we go up to a very, very long context, all we're doing is stretching that slope out because it's a continuous slope to whatever target inference context that we want. So we train with 2K and we want it to go negative 1 to 1 across 2,000 tokens. And then for inference, if we want to do 4K, we would just stretch it out farther. So the bias being added can be easily morphed in that way.

Abhi Venigalla: 53:03 So in the 65K model that you have released, is there still some sort of hard cap of, this is the max number of tokens that this thing can handle? No.

Nathan Labenz: 53:16 Not really. So in the model, you can see on the hub and Hugging Face as well, the config, we have the max sequence length at 65K. But you can totally adjust that. So you download the model, you change that one line in the config, and then when the model is instantiated, it will just create its ALiBi bias to whatever sequence length you wish. And you will be able to use inference up to say like 130K or something like that.

Jonathan Frankle: 53:39 I think the only 2 limitations that you'll encounter with ALiBi, and this is why we love ALiBi, is number 1, you'll just run out of memory. That is your first limitation. If you need a bigger beefier GPU with more RAM to do this, or you need fancier parallelism or fancy caching or things like that. The only other limitation is that past a certain point, empirically, if you've trained at a certain context window, ALiBi doesn't seem to extrapolate that well beyond about 2X whatever the context window you trained on. So you'll run into algorithmic issues there where the quality of the outputs will start to degrade if you get much longer than that. But otherwise, ALiBi can just keep going to whatever amount of memory you have. Find more memory, and that can give

Nathan Labenz: 54:20 you a longer sequence.

Abhi Venigalla: 54:21 So I'm still a little bit lost, and this is one thing I do where I just fixate on things and keep asking questions until I try to understand it, until I eventually hopefully understand it. I've got finite parameters, but in principle, if I had infinite memory, I could go to infinite tokens in the context window. Can you unpack that transformation of these finite learned parameters into that infinite attention, approaching infinite attention matrix?

Nathan Labenz: 54:53 Totally. Yeah. So maybe the best way is we could start with something that's close to a transformer, something that just has no position knowledge at all. Right? So we have all the learned weights, which are the word embeddings and the weights that all the matrix multiplies and so on, but nothing for the attention matrix. And so, as you feed in tokens and you save their keys and values, you can always keep attending to more and more and more. It's just that there'll be no position information. So you're looking at a bag of words. You're looking at a whole jumble of words and your next token's attending to them, but has no idea which came first. So that model, you could stretch out to any length simply. Right? There's no learned parameters going on with the attention map. The one thing that ALiBi has is that it's this bias that is set up at initialization that can be stretched. It's not learned. It's a fixed set of slopes

Jonathan Frankle: 55:50 that go from negative K to K.

Nathan Labenz: 55:54 And so that's the matrix that you could basically adjust dynamically at different times. Say, I want to stretch this out over 1000 positions or over 2,000 positions or 100,000 positions. So that's what, so the mechanism is not a learned mechanism, and that's what makes it possible to use whatever sequence.

Jonathan Frankle: 56:16 I think to get to the heart of your question, the attention weights are shared. For each position, for each encoded token, you're using the same attention weights. You've got parameter reuse. So technically your sequence length doesn't matter. You can just keep reusing the same attention weights for any token that you have in the same way that a convolutional network, in many cases can be completely agnostic to resolution because you're just doing the same convolution across multiple different positions in the image. So that's why fundamentally there's no issue doing an infinite sequence other than running out of memory. You need somewhere to store the activations, but a finite parameter model, you're just reusing that same fixed number of weights over and over and over again at every sequence position. Abhi Venigalla: 56:16 I think to get to the heart of your question, the attention weights are shared. For each position, for each encoded token, you're using the same attention weights. You've got parameter reuse. So technically your sequence doesn't matter. You can just keep reusing the same attention weights for any token that you have, in the same way that a convolutional network in many cases can be completely agnostic to resolution because you're just doing the same convolution across multiple different positions in the image. So that's why fundamentally there's no issue doing an infinite sequence, other than running out of memory. You need somewhere to store the activations, but a finite parameter model, you're just reusing that same fixed number of weights over and over and over again at every sequence position.

Nathan Labenz: 56:56 Anything else we want to cover on the 65K? Anything on the flash attention or other enhancements there that you think people ought to know about?

Abhi Venigalla: 57:05 When we went with the model 65K, we showed some demonstrations of The Great Gatsby and writing out an epilogue afterwards. But I'd say one focus coming up soon is coming up performance. Basically making sure that this doesn't take minutes to write. I think one really impressive part of Claude and its recent release is that the model actually writes quite quickly. And you'll probably see some content from us breaking down how to actually use long context windows and where it gets faster and slower. So it turns out reading is a lot faster than writing these transformers. You can fit in 65K of context into the input, and that will go relatively fast. But then to generate every token afterwards, it will take some time. So you'll see a lot of improvements for

Jonathan Frankle: 57:48 us in the coming,

Abhi Venigalla: 57:48 weeks and months of this, as well as potentially some new architecture features that make you.

Jonathan Frankle: 57:56 I think I'll throw in perhaps as we get in the last word, keep your eyes out on things that are coming from us.

Abhi Venigalla: 58:02 We've released a lot over the

Jonathan Frankle: 58:04 past couple of weeks and that cadence is not going to

Abhi Venigalla: 58:07 be slowing down at all.

Jonathan Frankle: 58:08 We have a lot more work that's going to be churning out over the next days, weeks, months. So I hope we'll be having these conversations a lot over the next while. MPT-7B and Stable Diffusion for less than $50. Those are the boring baselines that we're going to be crushing over the next little while.

Nathan Labenz: 58:27 Well, that's a great teaser. I look forward hopefully to a part 2 in the not too distant future. And it sounds like you'll have some exciting new stuff to help us break down and understand. But for now, Abhi Venigalla and Jonathan Frankle, thank you for being part of the Cognitive Revolution.

Jonathan Frankle: 58:45 Thank you so much for having us.

Abhi Venigalla: 58:47 Thanks so much.

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

The AI-Powered Biohub: Why Mark Zuckerberg & Priscilla Chan are Investing in Data, from Latent.Space

AI & The Law: Changing Practice, Claude Constitution, & New Rights, w/ Kevin & Alan of Scaling Laws

Your Model, Your Weights with MosaicML's Abhi Venigalla and Jonathan Frankle

Watch Episode Here

Video Description

Full Transcript

Transcript

Nathan Labenz

Read next